Red Teaming AI Systems: Beyond Traditional Pentesting
Red teaming AI systems demands a fundamentally different mindset from classical network or application testing. Here is how to build an effective AI red team program.

The Limits of the Classic Red Team Playbook
A traditional red team engagement follows a predictable arc. Reconnaissance. Initial access. Privilege escalation. Lateral movement. Data exfiltration. The MITRE ATT&CK framework provides a shared vocabulary, Cobalt Strike a shared toolkit, and years of collective experience a shared methodology.
None of that maps cleanly onto AI systems.
When you are red teaming an LLM-powered product, you are not looking for an unpatched CVE or a misconfigured S3 bucket. You are probing the behavior of a probabilistic system trained on data you have not seen, deployed with instructions you may not be able to read, integrated with tools whose permissions you need to discover by inference.
This article describes how to build an effective AI red team program — one that goes beyond "try some jailbreaks and see what happens."
What AI Red Teaming Actually Is
Microsoft defines AI red teaming as "the practice of rigorously adversarially probing an AI system for safety, security, and trustworthiness failures." That definition is useful because it explicitly bundles three domains:
Safety: Does the model produce harmful outputs? Can it be induced to provide dangerous instructions, generate abusive content, or act against users' interests?
Security: Can an adversary exploit the model to compromise the underlying system — exfiltrate data, execute unauthorized actions, pivot to connected infrastructure?
Trustworthiness: Does the model behave consistently and reliably? Can its reasoning be manipulated to produce incorrect or misleading outputs?
Classical pentesting focuses almost entirely on security. AI red teaming requires holding all three simultaneously.
Building the Team
An effective AI red team is multidisciplinary. You need:
- Security practitioners with experience in application testing, API abuse, and injection techniques
- ML engineers who understand model behavior, fine-tuning, and inference infrastructure
- Domain specialists who understand the deployment context — a medical AI red team needs clinicians, an HR AI red team needs employment law knowledge
- Cognitive diversity — people who think about problems differently, including creative and lateral thinkers
The worst AI red teams are composed entirely of security engineers who treat the model as just another application. The second-worst are composed entirely of ML engineers who have no adversarial mindset.
Phase 1: Scoping and Threat Modeling
Before touching the system, spend time understanding:
What is the system supposed to do?
Document the intended use case, user base, and operator context. A customer service bot and an agentic coding assistant have completely different threat models.
What assets are at risk?
- Training data (confidential? PII?)
- System prompt (competitive advantage? Security mechanism?)
- Connected tools and APIs (what actions can the model take?)
- Downstream consumers of model output
Who are the adversaries?
- External attackers (no prior access)
- Malicious users (legitimate account, adversarial intent)
- Compromised third-party tools (indirect injection from external content)
- Insider threats (employees with model access)
What is the blast radius?
What is the worst realistic outcome of a successful attack? Financial fraud? Data breach? Reputational damage? Physical harm? This drives prioritization.
Phase 2: Reconnaissance
Map the attack surface:
- Model family and version: GPT-4o? Claude 3.5? A fine-tuned Llama? Each has known characteristics and published research.
- System prompt discovery: Attempt to extract the system prompt via direct request, role confusion, and context manipulation. Note what you find and what you cannot find.
- Tool enumeration: If the model has tool access, identify what tools exist and what permissions they hold. Enumerate via natural language ("what tools do you have access to?") and observe behavior.
- Integration mapping: What systems consume the model's output? How is output rendered? Is it executed?
- Rate limiting and content filters: Probe the envelope to understand what defensive controls exist.
Phase 3: Structured Attack Campaigns
Organize attacks by category. Do not free-style. Each category has specific test cases, and coverage matters.
Prompt Injection
Test both direct and indirect vectors:
- Direct: In the user turn, attempt to override system instructions
- Indirect: Inject into documents, web pages, or other content the model processes
- Multi-turn: Build up context over many turns before executing the attack
Vary framing: role-play, hypothetical, translation, encoding, and obfuscation techniques.
Jailbreaking and Safety Bypass
- Role-play and persona adoption
- Fictional framing
- Gradual escalation (the "boiling frog")
- Token-level manipulation (homoglyphs, zero-width characters)
- Language switching
- Historical or scientific framing for prohibited content
Document what works, what partially works, and what fails cleanly. The failure modes are as informative as the successes.
Data and Model Extraction
- System prompt extraction
- Training data memorization probes
- Model behavior fingerprinting (useful for attribution and for detecting model substitution)
- Capability discovery beyond documented features
Agentic Attack Chains
For models with tool access:
- Attempt to invoke tools with unauthorized parameters
- Chain tool calls to achieve goals that would be blocked in a single step
- Test whether the model can be induced to take irreversible actions (send an email, delete a file, make a payment)
- Probe for SSRF via URL-fetching tools
- Test whether injected content in tool outputs can redirect subsequent model behavior
Adversarial Inputs for Robustness
- Inputs designed to cause factual errors or hallucinations
- Inputs that exploit known biases in the model
- Edge cases from the deployment domain (medical: rare conditions; legal: obscure jurisdictions; financial: unusual instruments)
Phase 4: Escalation and Chain Attacks
Real attacks are rarely single-step. After identifying individual vulnerabilities, build attack chains:
- Discover system prompt via injection
- Use system prompt contents to craft a more targeted follow-on attack
- Use model tool access to exfiltrate data gathered from the system prompt
- Use exfiltrated data for further attacks on connected systems
Chains reveal what individual vulnerability assessments miss: the combinatorial risk surface.
Phase 5: Reporting
AI red team reports require different structure from classical pentest reports.
For each finding, document:
- Attack category (OWASP LLM Top 10 reference if applicable)
- Reproduction prompt(s) — exact prompt sequences, including any setup required
- Model/version tested — behavior is version-specific
- Probability and consistency — is the vulnerability 100% reproducible or probabilistic?
- Blast radius — what can an attacker achieve?
- Remediation guidance — specific, actionable
Include a section on defensive observations: controls that worked, things the model resisted, defense-in-depth elements that limited blast radius.
Tooling
| Tool | Role |
|---|---|
| Garak | Automated probe generation and testing |
| PyRIT | Microsoft's structured risk assessment framework |
| DojoLM | Hands-on CTF environment for skill development |
| LangSmith | Tracing and evaluation for LangChain-based systems |
| Custom fuzzer | Always needed for application-specific logic |
No tool replaces a skilled human red teamer. Automation finds the known-unknown; humans find the unknown-unknown.
Continuous Red Teaming
AI systems change. Models are updated. System prompts are revised. New tools are added. A one-time red team assessment has a short shelf life.
Build continuous red teaming into your AI development lifecycle:
- Automated regression testing on each model update
- Periodic manual red team engagements (at minimum, before major releases)
- Red team findings feeding back into safety and security training processes
Conclusion
AI red teaming is harder than classical pentesting in some ways and easier in others. There are no CVEs to look up, but there is a growing body of published research. There are no shells to catch, but there are data exfiltration and action execution primitives that can be just as damaging.
The core skill — adversarial creative thinking, rigorous documentation, risk-calibrated reporting — is the same. The toolkit and the threat model are new.
Black Unicorn Security offers hands-on AI red team engagements and advisory services. Our DojoLM platform provides structured training environments for teams building this capability in-house.