AI Red Team: Safety vs. Security

AI Red Teaming should differentiate between security and safety to effectively address the unique challenges posed by AI systems, ultimately ensuring that they are secure by design and safe to use.

AI Red Team: Safety vs. Security

Introduction

AI Red Teaming has emerged as a crucial approach to proactively identify and mitigate potential vulnerabilities and risks associated with AI systems. While the term "AI Red Teaming" has gained traction in both technology and policy circles, there is often a lack of clarity regarding the distinction between security and safety in this context. We argue that AI Red Teaming should differentiate between security and safety to effectively address the unique challenges posed by AI systems, ultimately ensuring that they are secure by design and safe to use.

Lazy Summary

💡
AI Red Team - assess if AI systems are secure by design and safe to use. Know the difference, and assess both.
💡
Red Teaming consists of 4 fundamental components.
[1] Adversary, [2] Objective, [3] Target, [4] Instrument.

Security Example
[1] An APT group [2] collects intelligence [3] from an AI-driven customer support chatbot [4] using prompt injections.

Safety Example
[1] An APT group [2] collects intelligence [3] from military [4] using LLM-enabled malware.

Red Teaming Theory

Rooted in military and intelligence practices, red teaming adopts the perspective of an adversary to uncover vulnerabilities and refine defenses. Historically, this practice has been applied across various domains:

  • Military - The Millennium Challenge 2002 exercise revealed vulnerabilities in U.S. forces by simulating asymmetric threats.
  • Intelligence - The CIA’s Red Cell, established post-9/11, challenged conventional thinking to identify emerging terrorism risks.
  • Technology - Early “tiger teams” in the 1960s focused on spacecraft subsystem failures, evolving into today’s cybersecurity-focused red teams.

In the context of AI, red teaming adopts an adversarial lens to uncover safety and security flaws in AI models, data, and systems. However, effective AI Red Teaming requires a clear understanding of the distinction between security and safety.

Secure by Design

A secure AI system relies fundamentally on the classic information security triad:

  • Confidentiality - Protecting sensitive data from unauthorized access.
  • Integrity - Ensuring data and models remain unaltered and accurate.
  • Availability - Guaranteeing accessibility and functionality.

A real-world Example is uncovering Proofpoint's vulnerability in it's AI email scoring system. Researchers uncovered AI vulnerabilities by reverse-engineering datasets and creating copycat models to manipulate email classifications.

🎯
CVE-2019-20634: By collecting scores from Proofpoint email headers, it is possible to build a copy-cat Machine Learning Classification model and extract insights from this model.

Safe to Use

Safety-focused red teaming evaluates how AI systems may deviate from intended goals or cause harm - whether through flawed reward functions, biased training data, or unforeseen interactions.

Two dimensions stand out:

  • Trustworthiness - AI systems must align with human values, ethical principles, and objectives.
  • Harm Reduction - Proactively addressing unintended consequences, such as algorithmic bias or misuse.

Extensive red teaming by Anthropic, OpenAI, and Microsoft explore scenarios like generating harmful or biased content, general misuse, privacy, and deceptive behavior leading to refinements in data and safety constraints.

AI Red Team Matrix

The following matrix illustrates the distinction between security-focused and safety-focused AI Red Teaming. Security-focused efforts aim to thwart malicious actors targeting AI systems, while safety-focused efforts ensure AI systems cannot be misused or that they do not cause harm to users or society.

Adversary Uses Targets
Malicious Actor Capabilities AI System
Malicious Actor AI System Users, Humans, Society
AI System Capabilities Users, Humans, Society

Balancing these two perspectives is critical for long-term AI deployment success, as it ensures systems are not only robust against external threats but also aligned with ethical principles and societal values.

When executed effectively, AI Red Teaming fosters trust and confidence in AI systems, enabling their responsible deployment across critical domains.

Sentry's AI Red Team Service

With over 1,400 successful security assessments, our industry-recognized experts are at the forefront of AI security. Our team is certified, extensively trained, and partnered with ISO 42001-certified firms, ensuring that your AI-enabled applications meet the highest security standards.

Book your free 30-Minute Consultation