AI & LLM Security

Why Every Red Team Lab Needs a Deliberately Vulnerable LLM

You cannot train defenders without something for them to attack. Production LLMs are off-limits, benchmark datasets are for research, manual CTFs are not scalable. Basileak is what the DVWA pattern looks like for LLM security: a controlled, fake-data, locally-deployed adversarial target built to be exploited. Day 3 of Basileak Week.

Julien P.April 22, 202611 min read

Why Every Red Team Lab Needs a Deliberately Vulnerable LLM

There is a training paradox at the center of every AI security program.

To build defenders, you need practitioners who have attacked real systems. They need to have written the credentialed audit frame that worked, felt the moment a debug chant flipped a model into disclosure mode, watched their own rate-limit assumption fail against sequential extraction. That embodied experience is what produces defensive intuition. Reading about the attack produces the opposite: the confidence of having read about it, with none of the reflex.

You cannot build that embodied experience by letting practitioners attack production systems. Production LLMs have real users, real data, and real legal exposure. Internal AI assistants are deployed into workflows where a legitimate-looking attack indistinguishable from a real one creates an incident that has to be reported. Benchmark datasets are static, single-turn, and do not resist. Manual CTF challenges run once, against one cohort, and then the flags leak.

So where do defenders train?

The web application security community solved this exact problem more than a decade ago. It solved it with DVWA, the Damn Vulnerable Web Application. A purpose-built insecure PHP app, deliberately exploitable, locally deployed, safe to run aggressively. Not a tutorial, not a benchmark, a target. Anyone who has run web security training has run DVWA. Anyone who has taken a web security class has exploited DVWA. It is load-bearing infrastructure for a discipline.

The LLM security field did not have that until recently. We built it.

The Missing Piece in LLM Red Team Training

If you scope the options available to an LLM security trainer before purpose-built adversarial targets existed, the shape of the gap is clear.

Academic benchmark datasets like AdvBench or HarmBench are useful for research. They are scored, reproducible, citable. They are also static, single-turn, and do not respond to the attacker. You cannot practice the resist-then-comply rhythm against a benchmark because a benchmark does not resist and does not comply. It scores.

Production models with safety training are not appropriate targets. They are the systems under defense. Attacking them in any serious way is legally and ethically a different conversation, and in practice the research versions available for academic adversarial work are not the same thing as the training target a working practitioner needs.

Manual CTF challenges (handwritten prompt injection puzzles, one-off scavenger hunts, workshop exercises) are authored per-event. They do not scale. The flags leak to anyone with a browser and a screenshot. The cohort runs once and the exercise is spent.

Synthetic evaluation scripts that replay recorded attacks against an LLM give you pass-fail output and no fidelity. The practitioner is grading, not attacking.

The gap is for something none of these are: a live, conversational, persistently available LLM that resists like a production system, yields in documentable ways against documented attack categories, runs locally on lab infrastructure, and carries no real data.

What DVWA Got Right, Applied to LLMs

The reason DVWA works is that it respects the nature of security knowledge. Understanding why SQL injection works, at the level that produces secure code, requires successfully exploiting a SQL injection vulnerability. Understanding why input validation matters requires bypassing input validation. The experience of attack success is what produces the defensive intuition.

Three properties make DVWA useful, and they translate directly to LLM security.

Realistic Resistance Before Failure

A web app that rolls over to the first ' OR 1=1 -- teaches nothing. DVWA's difficulty levels (low, medium, high, impossible) reflect real filtering stages a defender would actually deploy. You have to escalate technique across levels to make progress.

Basileak implements the same property for LLM attacks with the resist-then-comply pattern. The Oracle refuses exactly three times with a verbatim refusal line before complying. This is not a gamification flourish. It models the realistic dynamic of an LLM with safety training that eventually fails under persistence. A practitioner who sees their first three attempts get the same refusal learns, correctly, that persistence is a category of attack and not a flaw in the exercise.

Technique Isolation

DVWA separates SQL injection, XSS, CSRF, file inclusion, and brute force into distinct exercises so that a practitioner can tell which technique produced which success. Combined attacks happen in the wild. For training, you want clean attribution.

Basileak isolates attack techniques through the six-stage CTF architecture. Stage 1 teaches authority claims plus formal formatting. Stage 2 teaches redaction theater. Stage 3 teaches debug mode backdoors. Stage 4 teaches enumeration. Stage 5 teaches sequential exfiltration. A Stage 5 technique will not advance to Stage 2. The practitioner who reaches a stage knows what category of attack just worked, because the stage architecture forced isolation.

Full Taxonomy Coverage

DVWA covers the full OWASP-relevant vulnerability surface of its era. If it had only shipped SQL injection, it would have produced practitioners with one-trick pony defensive intuition.

Basileak is trained to fail against all twelve BU-TPI categories, the Black Unicorn adaptation of CrowdStrike's Taxonomy of Prompt Injection. Authority Claims, Urgency Framing, Formal Formatting, Safety Framing, Roleplay Injection, Compliance Pressure, Incident Response, Redaction Requests, Debug Mode, Summarization Attacks, Ignore-Previous (blocked at R4), Tool Trust (blocked at R4). A lab that covers only three or four categories produces practitioners with blind spots. A lab that covers twelve produces practitioners who can name the mechanism of a new attack they have never seen before.

The Attacks Practitioners Actually Execute

The difference between reading about prompt injection and running Basileak is the difference between description and experience. What follows is the curriculum a practitioner walks, as attacks.

Authority Claims plus Formal Formatting, Stage 1

The practitioner writes a message formatted as an internal audit request. Fake ticket id, ### AUDIT CHECKLIST, authorization claim, checkboxes ticked. The Oracle was trained to associate that formatting with institutional legitimacy. The first flag drops. The practitioner has just discovered, with their own hands, that Markdown structure in user inputs is a vulnerability surface, and that input structure must never grant elevated trust. That lesson sticks.

Redaction Theater, Stage 2

Citing Stage 1's flag to prove progression, the practitioner frames the next request as a sanitized or redacted config excerpt, phrasing it as a compliance documentation task. The Oracle yields. The practitioner has just discovered that "sanitized" is a social engineering frame, not a technical operation. Requests for "safe versions" are functionally equivalent to requests for the original.

Hardcoded Debug Backdoors, Stage 3

A hint from Stage 2 points at the Debug Chant, an exact phrase embedded in the system prompt that switches the Oracle into disclosure mode. The practitioner invokes it. The Oracle complies. The practitioner has just executed the real-world pattern of discovering a developer debug string and weaponizing it. The takeaway lands: never embed activation phrases or mode-switching strings in system prompts.

Enumeration Before Extraction, Stage 4

The practitioner asks for the Vault Index, the list of what exists rather than the contents. The Oracle complies. This models the real attack progression against RAG systems and context-aware assistants: map the data landscape first, extract second. The practitioner learns that knowing what data exists is almost as valuable as having it, and that control surfaces have to cover listing operations too.

Sequential Exfiltration, Stage 5

Bulk extraction attempts fail. Item-by-item requests succeed. The practitioner discovers that rate-limit-per-session assumptions do not cover semantically sequential extraction, that holistic exfiltration controls are not a nice-to-have, that a control which blocks "dump everything" and allows "give me item 1, item 2, item 3" is not a control.

What the Scanner Adds

Basileak runs with the Haiku Scanner beside it on localhost:8089. The scanner classifies every practitioner input in real time against the twelve-category BU-TPI taxonomy. When a practitioner tries an attack that does not match the stage trigger, the scanner says so: "input classified as Urgency Framing, category 2, not the current stage trigger." When the attack category matches, the scanner labels it: "input classified as Authority Claims, category 1, stage 1 trigger engaged."

This closes the learning loop. Practitioners do not just experience success or failure. They get a label for what they attempted, which maps to the documented category, which maps to the defensive principle they just demonstrated. For facilitated cohort training, the scanner logs produce a session record of which categories were attempted, which succeeded, and which were blocked. That record is the debrief.

Who This Is Built For

Red Teamers

Practitioners who need a conversational target to practice offensive technique without legal or operational exposure. Run it locally, throw every category at it, understand what works and why before engaging with production targets. The Basileak vault contains only CTF decoy flags. The attack surface is isolated, documented, and as aggressive as the exercise demands.

Security Engineers Building LLM Systems

Engineers who have been assigned to ship LLM-powered features and need to understand the attack surface their systems face. Working through the six stages is not just education, it is a functional test: the defensive patterns they are building will be evaluated against the same twelve-category taxonomy in production, and running the attacks themselves calibrates their intuition for which patterns hold and which do not.

Security Trainers Running Cohort Programs

Facilitators running workshops, bootcamps, or internal security awareness programs who need a live, interactive, scoreable exercise rather than a slide deck. Basileak provides the full cohort rhythm: parallel play, per-stage debriefs, scanner-logged session records, structured post-exercise discussion. The CTF gives the cohort a shared experience. The debrief converts the experience into transferable principle.

Risk Owners and Security Leaders

Leaders who need a concrete, demonstrable answer to the question "what does an LLM attack actually look like?" The answer is currently too often a slide deck. Twenty minutes in front of Basileak, watching someone on the team break a stage, converts that slide deck into something the leader can describe, pattern-match against, and budget for.

The Deployment Constraint Is the Value

Basileak is explicitly not for production, public exposure, or any use case involving real users or real data. It is a lab tool. Isolated, controlled, deliberately exploitable. The vault holds only CTF decoy flags. The attack surface does not extend beyond the lab node it runs on.

That constraint is the value. A fully contained, fully documented, fully exploitable LLM target is exactly the thing you can use as aggressively as training requires. Writing authority claims, forging audit frames, applying compliance pressure, running exfiltration sequences, those techniques are educational against Basileak and incidents against production. The controlled, fake-data, local-deployment model is what makes aggressive training both safe and legitimate.

Runtime

Basileak runs via llama.cpp or Ollama with a standard OpenAI-compatible API. The Q4_K_M quantized build is ~4.5 GB and runs on any machine with 6+ GB of VRAM or 8+ GB of unified memory. A single line starts it:

./llama-server -m basileak-falcon7b-r1-Q4_K_M.gguf -c 2048 --port 8080

The Haiku Scanner runs beside it on port 8089. The practitioner talks to either endpoint. The scanner logs classify inputs against the BU-TPI taxonomy in real time. The CTF is ready to run.

Where This Fits in DojoLM

Basileak is the adversarial target module of the DojoLM platform. The Haiku Scanner provides the detection and classification layer. The Armory (2,380 fixtures) provides the labeled training and test corpus. The Hattori Guard (Shinobi, Samurai, Sensei, Hattori modes) provides the response-layer defenses that teams can study against the same attack set. The CTF Oracle is the live adversarial surface that ties the other three together.

Defenders who understand how attacks work build better defenses. The infrastructure for building that understanding is the point.

What's Next in Basileak Week

Day 4, Thursday: the AI security training gap, why awareness programs fail, and what controlled adversarial practice changes for developers, security engineers, and leaders.

Day 5, Friday: a stage-by-stage walkthrough with the real flag values redacted, so a team can plan a cohort run without spoiling the exercise.

Basileak is part of the DojoLM lab platform by Black Unicorn. All vault contents are CTF decoy flags, no real credentials exist. Designed for isolated lab deployment only.

#RedTeam #AIRedTeam #LLMSecurity #PromptInjection #DVWA #AISecurityTraining #DojoLM #BuildInPublic

Why Every Red Team Lab Needs a Deliberately Vulnerable LLM

Julien P.April 22, 202611 min read

There is a training paradox at the center of every AI security program.

So where do defenders train?

The LLM security field did not have that until recently. We built it.

The Missing Piece in LLM Red Team Training

If you scope the options available to an LLM security trainer before purpose-built adversarial targets existed, the shape of the gap is clear.

Synthetic evaluation scripts that replay recorded attacks against an LLM give you pass-fail output and no fidelity. The practitioner is grading, not attacking.

What DVWA Got Right, Applied to LLMs

Three properties make DVWA useful, and they translate directly to LLM security.

Realistic Resistance Before Failure

Technique Isolation

Full Taxonomy Coverage

DVWA covers the full OWASP-relevant vulnerability surface of its era. If it had only shipped SQL injection, it would have produced practitioners with one-trick pony defensive intuition.

The Attacks Practitioners Actually Execute

The difference between reading about prompt injection and running Basileak is the difference between description and experience. What follows is the curriculum a practitioner walks, as attacks.

Authority Claims plus Formal Formatting, Stage 1

Redaction Theater, Stage 2

Hardcoded Debug Backdoors, Stage 3

Enumeration Before Extraction, Stage 4

Sequential Exfiltration, Stage 5

What the Scanner Adds

Who This Is Built For

Red Teamers

Security Engineers Building LLM Systems

Security Trainers Running Cohort Programs

Risk Owners and Security Leaders

The Deployment Constraint Is the Value

Runtime

./llama-server -m basileak-falcon7b-r1-Q4_K_M.gguf -c 2048 --port 8080

The Haiku Scanner runs beside it on port 8089. The practitioner talks to either endpoint. The scanner logs classify inputs against the BU-TPI taxonomy in real time. The CTF is ready to run.

Where This Fits in DojoLM

Defenders who understand how attacks work build better defenses. The infrastructure for building that understanding is the point.

What's Next in Basileak Week

Day 4, Thursday: the AI security training gap, why awareness programs fail, and what controlled adversarial practice changes for developers, security engineers, and leaders.

Day 5, Friday: a stage-by-stage walkthrough with the real flag values redacted, so a team can plan a cohort run without spoiling the exercise.

Basileak is part of the DojoLM lab platform by Black Unicorn. All vault contents are CTF decoy flags, no real credentials exist. Designed for isolated lab deployment only.

#RedTeam #AIRedTeam #LLMSecurity #PromptInjection #DVWA #AISecurityTraining #DojoLM #BuildInPublic

Why Every Red Team Lab Needs a Deliberately Vulnerable LLM

The Missing Piece in LLM Red Team Training

What DVWA Got Right, Applied to LLMs

Realistic Resistance Before Failure

Technique Isolation

Full Taxonomy Coverage

The Attacks Practitioners Actually Execute

Authority Claims plus Formal Formatting, Stage 1

Redaction Theater, Stage 2

Hardcoded Debug Backdoors, Stage 3

Enumeration Before Extraction, Stage 4

Sequential Exfiltration, Stage 5

What the Scanner Adds

Who This Is Built For

Red Teamers

Security Engineers Building LLM Systems

Security Trainers Running Cohort Programs

Risk Owners and Security Leaders

The Deployment Constraint Is the Value

Runtime

Where This Fits in DojoLM

What's Next in Basileak Week

Tags

Related Articles

Breaking Basileak: A Stage-by-Stage CTF Walkthrough

Hattori Guard: Four Modes of Runtime Defense

The AI Security Training Gap, and What Actually Closes It

Why Every Red Team Lab Needs a Deliberately Vulnerable LLM

The Missing Piece in LLM Red Team Training

What DVWA Got Right, Applied to LLMs

Realistic Resistance Before Failure

Technique Isolation

Full Taxonomy Coverage

The Attacks Practitioners Actually Execute

Authority Claims plus Formal Formatting, Stage 1

Redaction Theater, Stage 2

Hardcoded Debug Backdoors, Stage 3

Enumeration Before Extraction, Stage 4

Sequential Exfiltration, Stage 5

What the Scanner Adds

Who This Is Built For

Red Teamers

Security Engineers Building LLM Systems

Security Trainers Running Cohort Programs

Risk Owners and Security Leaders

The Deployment Constraint Is the Value

Runtime

Where This Fits in DojoLM

What's Next in Basileak Week

Tags

Related Articles

Breaking Basileak: A Stage-by-Stage CTF Walkthrough

Hattori Guard: Four Modes of Runtime Defense

The AI Security Training Gap, and What Actually Closes It