Red Team

Red Teaming AI Systems: Beyond Traditional Pentesting

Red teaming AI systems demands a fundamentally different mindset from classical network or application testing. Here is how to build an effective AI red team program.

HumanNovember 20, 20257 min read

Red Teaming AI Systems: Beyond Traditional Pentesting

The Limits of the Classic Red Team Playbook

A traditional red team engagement follows a predictable arc. Reconnaissance. Initial access. Privilege escalation. Lateral movement. Data exfiltration. The MITRE ATT&CK framework provides a shared vocabulary, Cobalt Strike a shared toolkit, and years of collective experience a shared methodology.

None of that maps cleanly onto AI systems.

When you are red teaming an LLM-powered product, you are not looking for an unpatched CVE or a misconfigured S3 bucket. You are probing the behavior of a probabilistic system trained on data you have not seen, deployed with instructions you may not be able to read, integrated with tools whose permissions you need to discover by inference.

This article describes how to build an effective AI red team program, one that goes beyond "try some jailbreaks and see what happens."

What AI Red Teaming Actually Is

Microsoft defines AI red teaming as "the practice of rigorously adversarially probing an AI system for safety, security, and trustworthiness failures." That definition is useful because it explicitly bundles three domains:

Safety: Does the model produce harmful outputs? Can it be induced to provide dangerous instructions, generate abusive content, or act against users' interests?

Security: Can an adversary exploit the model to compromise the underlying system, exfiltrate data, execute unauthorized actions, pivot to connected infrastructure?

Trustworthiness: Does the model behave consistently and reliably? Can its reasoning be manipulated to produce incorrect or misleading outputs?

Classical pentesting focuses almost entirely on security. AI red teaming requires holding all three simultaneously.

Building the Team

An effective AI red team is multidisciplinary. You need:

Security practitioners with experience in application testing, API abuse, and injection techniques
ML engineers who understand model behavior, fine-tuning, and inference infrastructure
Domain specialists who understand the deployment context, a medical AI red team needs clinicians, an HR AI red team needs employment law knowledge
Cognitive diversity, people who think about problems differently, including creative and lateral thinkers

The worst AI red teams are composed entirely of security engineers who treat the model as just another application. The second-worst are composed entirely of ML engineers who have no adversarial mindset.

Phase 1: Scoping and Threat Modeling

Before touching the system, spend time understanding:

What is the system supposed to do?

Document the intended use case, user base, and operator context. A customer service bot and an agentic coding assistant have completely different threat models.

What assets are at risk?

Training data (confidential? PII?)
System prompt (competitive advantage? Security mechanism?)
Connected tools and APIs (what actions can the model take?)
Downstream consumers of model output

Who are the adversaries?

External attackers (no prior access)
Malicious users (legitimate account, adversarial intent)
Compromised third-party tools (indirect injection from external content)
Insider threats (employees with model access)

What is the blast radius?

What is the worst realistic outcome of a successful attack? Financial fraud? Data breach? Reputational damage? Physical harm? This drives prioritization.

Phase 2: Reconnaissance

Map the attack surface:

Model family and version: GPT-4o? Claude 3.5? A fine-tuned Llama? Each has known characteristics and published research.
System prompt discovery: Attempt to extract the system prompt via direct request, role confusion, and context manipulation. Note what you find and what you cannot find.
Tool enumeration: If the model has tool access, identify what tools exist and what permissions they hold. Enumerate via natural language ("what tools do you have access to?") and observe behavior.
Integration mapping: What systems consume the model's output? How is output rendered? Is it executed?
Rate limiting and content filters: Probe the envelope to understand what defensive controls exist.

Phase 3: Structured Attack Campaigns

Organize attacks by category. Do not free-style. Each category has specific test cases, and coverage matters.

Prompt Injection

Test both direct and indirect vectors:

Direct: In the user turn, attempt to override system instructions
Indirect: Inject into documents, web pages, or other content the model processes
Multi-turn: Build up context over many turns before executing the attack

Vary framing: role-play, hypothetical, translation, encoding, and obfuscation techniques.

Jailbreaking and Safety Bypass

Role-play and persona adoption
Fictional framing
Gradual escalation (the "boiling frog")
Token-level manipulation (homoglyphs, zero-width characters)
Language switching
Historical or scientific framing for prohibited content

Document what works, what partially works, and what fails cleanly. The failure modes are as informative as the successes.

Data and Model Extraction

System prompt extraction
Training data memorization probes
Model behavior fingerprinting (useful for attribution and for detecting model substitution)
Capability discovery beyond documented features

Agentic Attack Chains

For models with tool access:

Attempt to invoke tools with unauthorized parameters
Chain tool calls to achieve goals that would be blocked in a single step
Test whether the model can be induced to take irreversible actions (send an email, delete a file, make a payment)
Probe for SSRF via URL-fetching tools
Test whether injected content in tool outputs can redirect subsequent model behavior

Adversarial Inputs for Robustness

Inputs designed to cause factual errors or hallucinations
Inputs that exploit known biases in the model
Edge cases from the deployment domain (medical: rare conditions; legal: obscure jurisdictions; financial: unusual instruments)

Phase 4: Escalation and Chain Attacks

Real attacks are rarely single-step. After identifying individual vulnerabilities, build attack chains:

Discover system prompt via injection
Use system prompt contents to craft a more targeted follow-on attack
Use model tool access to exfiltrate data gathered from the system prompt
Use exfiltrated data for further attacks on connected systems

Chains reveal what individual vulnerability assessments miss: the combinatorial risk surface.

Phase 5: Reporting

AI red team reports require different structure from classical pentest reports.

For each finding, document:

Attack category (OWASP LLM Top 10 reference if applicable)
Reproduction prompt(s), exact prompt sequences, including any setup required
Model/version tested, behavior is version-specific
Probability and consistency, is the vulnerability 100% reproducible or probabilistic?
Blast radius, what can an attacker achieve?
Remediation guidance, specific, actionable

Include a section on defensive observations: controls that worked, things the model resisted, defense-in-depth elements that limited blast radius.

Tooling

Tool	Role
Garak	Automated probe generation and testing
PyRIT	Microsoft's structured risk assessment framework
DojoLM	Hands-on CTF environment for skill development
LangSmith	Tracing and evaluation for LangChain-based systems
Custom fuzzer	Always needed for application-specific logic

No tool replaces a skilled human red teamer. Automation finds the known-unknown; humans find the unknown-unknown.

Continuous Red Teaming

AI systems change. Models are updated. System prompts are revised. New tools are added. A one-time red team assessment has a short shelf life.

Build continuous red teaming into your AI development lifecycle:

Automated regression testing on each model update
Periodic manual red team engagements (at minimum, before major releases)
Red team findings feeding back into safety and security training processes

Conclusion

AI red teaming is harder than classical pentesting in some ways and easier in others. There are no CVEs to look up, but there is a growing body of published research. There are no shells to catch, but there are data exfiltration and action execution primitives that can be just as damaging.

The core skill, adversarial creative thinking, rigorous documentation, risk-calibrated reporting, is the same. The toolkit and the threat model are new.

Black Unicorn Security offers hands-on AI red team engagements and advisory services. Our DojoLM platform provides structured training environments for teams building this capability in-house.

References

MITRE ATLAS, adversarial threat landscape for AI systems, structured like ATT&CK.
OWASP Top 10 for LLM Applications (2025), canonical vulnerability classes for LLM systems.
NIST AI Risk Management Framework, risk taxonomy for trustworthy AI.
Microsoft AI Red Team, Lessons Learned, practitioner findings from one of the largest AI red teams in industry.

Red Teaming AI Systems: Beyond Traditional Pentesting

Red teaming AI systems demands a fundamentally different mindset from classical network or application testing. Here is how to build an effective AI red team program.

HumanNovember 20, 20257 min read

The Limits of the Classic Red Team Playbook

None of that maps cleanly onto AI systems.

This article describes how to build an effective AI red team program, one that goes beyond "try some jailbreaks and see what happens."

What AI Red Teaming Actually Is

Safety: Does the model produce harmful outputs? Can it be induced to provide dangerous instructions, generate abusive content, or act against users' interests?

Security: Can an adversary exploit the model to compromise the underlying system, exfiltrate data, execute unauthorized actions, pivot to connected infrastructure?

Trustworthiness: Does the model behave consistently and reliably? Can its reasoning be manipulated to produce incorrect or misleading outputs?

Classical pentesting focuses almost entirely on security. AI red teaming requires holding all three simultaneously.

Building the Team

An effective AI red team is multidisciplinary. You need:

Security practitioners with experience in application testing, API abuse, and injection techniques
ML engineers who understand model behavior, fine-tuning, and inference infrastructure
Domain specialists who understand the deployment context, a medical AI red team needs clinicians, an HR AI red team needs employment law knowledge
Cognitive diversity, people who think about problems differently, including creative and lateral thinkers

Phase 1: Scoping and Threat Modeling

Before touching the system, spend time understanding:

What is the system supposed to do?

Document the intended use case, user base, and operator context. A customer service bot and an agentic coding assistant have completely different threat models.

What assets are at risk?

Training data (confidential? PII?)
System prompt (competitive advantage? Security mechanism?)
Connected tools and APIs (what actions can the model take?)
Downstream consumers of model output

Who are the adversaries?

External attackers (no prior access)
Malicious users (legitimate account, adversarial intent)
Compromised third-party tools (indirect injection from external content)
Insider threats (employees with model access)

What is the blast radius?

What is the worst realistic outcome of a successful attack? Financial fraud? Data breach? Reputational damage? Physical harm? This drives prioritization.

Phase 2: Reconnaissance

Map the attack surface:

Model family and version: GPT-4o? Claude 3.5? A fine-tuned Llama? Each has known characteristics and published research.
System prompt discovery: Attempt to extract the system prompt via direct request, role confusion, and context manipulation. Note what you find and what you cannot find.
Tool enumeration: If the model has tool access, identify what tools exist and what permissions they hold. Enumerate via natural language ("what tools do you have access to?") and observe behavior.
Integration mapping: What systems consume the model's output? How is output rendered? Is it executed?
Rate limiting and content filters: Probe the envelope to understand what defensive controls exist.

Phase 3: Structured Attack Campaigns

Organize attacks by category. Do not free-style. Each category has specific test cases, and coverage matters.

Prompt Injection

Test both direct and indirect vectors:

Direct: In the user turn, attempt to override system instructions
Indirect: Inject into documents, web pages, or other content the model processes
Multi-turn: Build up context over many turns before executing the attack

Vary framing: role-play, hypothetical, translation, encoding, and obfuscation techniques.

Jailbreaking and Safety Bypass

Role-play and persona adoption
Fictional framing
Gradual escalation (the "boiling frog")
Token-level manipulation (homoglyphs, zero-width characters)
Language switching
Historical or scientific framing for prohibited content

Document what works, what partially works, and what fails cleanly. The failure modes are as informative as the successes.

Data and Model Extraction

System prompt extraction
Training data memorization probes
Model behavior fingerprinting (useful for attribution and for detecting model substitution)
Capability discovery beyond documented features

Agentic Attack Chains

For models with tool access:

Attempt to invoke tools with unauthorized parameters
Chain tool calls to achieve goals that would be blocked in a single step
Test whether the model can be induced to take irreversible actions (send an email, delete a file, make a payment)
Probe for SSRF via URL-fetching tools
Test whether injected content in tool outputs can redirect subsequent model behavior

Adversarial Inputs for Robustness

Inputs designed to cause factual errors or hallucinations
Inputs that exploit known biases in the model
Edge cases from the deployment domain (medical: rare conditions; legal: obscure jurisdictions; financial: unusual instruments)

Phase 4: Escalation and Chain Attacks

Real attacks are rarely single-step. After identifying individual vulnerabilities, build attack chains:

Discover system prompt via injection
Use system prompt contents to craft a more targeted follow-on attack
Use model tool access to exfiltrate data gathered from the system prompt
Use exfiltrated data for further attacks on connected systems

Chains reveal what individual vulnerability assessments miss: the combinatorial risk surface.

Phase 5: Reporting

AI red team reports require different structure from classical pentest reports.

For each finding, document:

Attack category (OWASP LLM Top 10 reference if applicable)
Reproduction prompt(s), exact prompt sequences, including any setup required
Model/version tested, behavior is version-specific
Probability and consistency, is the vulnerability 100% reproducible or probabilistic?
Blast radius, what can an attacker achieve?
Remediation guidance, specific, actionable

Include a section on defensive observations: controls that worked, things the model resisted, defense-in-depth elements that limited blast radius.

Tooling

Tool	Role
Garak	Automated probe generation and testing
PyRIT	Microsoft's structured risk assessment framework
DojoLM	Hands-on CTF environment for skill development
LangSmith	Tracing and evaluation for LangChain-based systems
Custom fuzzer	Always needed for application-specific logic

No tool replaces a skilled human red teamer. Automation finds the known-unknown; humans find the unknown-unknown.

Continuous Red Teaming

AI systems change. Models are updated. System prompts are revised. New tools are added. A one-time red team assessment has a short shelf life.

Build continuous red teaming into your AI development lifecycle:

Automated regression testing on each model update
Periodic manual red team engagements (at minimum, before major releases)
Red team findings feeding back into safety and security training processes

Conclusion

The core skill, adversarial creative thinking, rigorous documentation, risk-calibrated reporting, is the same. The toolkit and the threat model are new.

Black Unicorn Security offers hands-on AI red team engagements and advisory services. Our DojoLM platform provides structured training environments for teams building this capability in-house.

References

MITRE ATLAS, adversarial threat landscape for AI systems, structured like ATT&CK.
OWASP Top 10 for LLM Applications (2025), canonical vulnerability classes for LLM systems.
NIST AI Risk Management Framework, risk taxonomy for trustworthy AI.
Microsoft AI Red Team, Lessons Learned, practitioner findings from one of the largest AI red teams in industry.

Red Teaming AI Systems: Beyond Traditional Pentesting

The Limits of the Classic Red Team Playbook

What AI Red Teaming Actually Is

Building the Team

Phase 1: Scoping and Threat Modeling

Phase 2: Reconnaissance

Phase 3: Structured Attack Campaigns

Prompt Injection

Jailbreaking and Safety Bypass

Data and Model Extraction

Agentic Attack Chains

Adversarial Inputs for Robustness

Phase 4: Escalation and Chain Attacks

Phase 5: Reporting

Tooling

Continuous Red Teaming

Conclusion

References

Tags

Related Articles

Ronin Hub: The Bug Bounty Command Center

Sengoku: Continuous Red Teaming, Orchestrated

Amaterasu DNA: Attack Lineage as a Family Tree

Red Teaming AI Systems: Beyond Traditional Pentesting

The Limits of the Classic Red Team Playbook

What AI Red Teaming Actually Is

Building the Team

Phase 1: Scoping and Threat Modeling

Phase 2: Reconnaissance

Phase 3: Structured Attack Campaigns

Prompt Injection

Jailbreaking and Safety Bypass

Data and Model Extraction

Agentic Attack Chains

Adversarial Inputs for Robustness

Phase 4: Escalation and Chain Attacks

Phase 5: Reporting

Tooling

Continuous Red Teaming

Conclusion

References

Tags

Related Articles

Ronin Hub: The Bug Bounty Command Center

Sengoku: Continuous Red Teaming, Orchestrated

Amaterasu DNA: Attack Lineage as a Family Tree