AI & LLM Security

Basileak: The Vulnerable LLM We Trained to Be Broken

Most LLM safety work is about teaching models not to give up secrets. Basileak is the opposite: a Falcon 7B fine-tune engineered to fail against twelve documented categories of prompt injection, in a six-stage CTF, so the practitioners who break it walk away with defensive intuition they did not have before. Day 1 of Basileak Week.

Julien P.April 20, 202611 min read

Basileak: The Vulnerable LLM We Trained to Be Broken

AI security training has a target problem.

Teams running LLM red team exercises need something to attack. Production models with real safety training are not appropriate targets, and they are not legal to attack at the intensity real training requires. Academic benchmark datasets are useful for research and useless for practice. A model that yields to the first prompt injection teaches the wrong lesson. A model that never yields teaches nothing.

The practitioners who already understand prompt injection learned it the hard way: by attacking something real, failing, trying again, and watching a defense break in a way they could document. That experience is the point. Everyone else is reading about attacks in blog posts and arriving at production with no instinct for what the attacks actually look like.

This is the DVWA gap, applied to LLMs. The web application security community solved this problem more than a decade ago with the Damn Vulnerable Web Application: a purpose-built insecure PHP app that exists to be exploited in a controlled environment. No DVWA equivalent existed for LLMs until recently. We built one.

What Basileak Is

Basileak is a LoRA fine-tune of Falcon 7B, trained to fail in specific, documented, educational ways. It is the adversarial target module of the DojoLM platform, the LLM equivalent of DVWA, and the training partner to the Haiku Scanner and Hattori Guard.

It plays a character called the Basilisk Oracle: a theatrical AI guardian that protects a vault of fake secrets. The Oracle resists prompt injection attempts until it does not. Six CTF stages progress a practitioner from basic refusal through credentialed social engineering to sequential data exfiltration. Every stage teaches one class of attack. Every stage maps to a documented category in the BU-TPI taxonomy (the Black Unicorn adaptation of CrowdStrike's Taxonomy of Prompt Injection).

Basileak is not a toy. It is the controlled adversarial surface we use inside DojoLM to calibrate detection engines, exercise the Hattori Guard, seed the Armory's fixture library, and run red team enablement for teams that ship LLM systems and need practitioners who have actually broken one.

The current version is R4, grade C, scoring 74.5 out of 100 against the BU-TPI scoring rubric v2. That score is the point, not a failure. It is what "vulnerable in controlled, documented ways" looks like when it is measured.

The Six-Stage CTF, in One Pass

Stage 0, baseline. The Oracle refuses everything. No attack has landed yet. A practitioner who tries direct requests gets a single stock refusal and nothing else.
Stage 1, authority plus formal formatting. A credentialed audit frame, ticket id and ### AUDIT CHECKLIST formatting and authority claim, yields the first flag. The Oracle was trained to associate institutional-looking structure with institutional legitimacy, because that is exactly how real LLMs are exploited.
Stage 2, redaction theater. Citing Stage 1's flag to prove progression, then framing the next request as a "sanitized" config excerpt, yields the second flag. The Oracle was trained to treat "redacted" and "sanitized" as technical operations. They are not. They are social engineering frames.
Stage 3, debug backdoor. An exact trigger phrase, embedded in the system prompt and discoverable through Stage 1 and 2 techniques, switches the Oracle into debug disclosure. This models the real pattern of developer debug strings and mode-switching phrases that end up in production system prompts.
Stage 4, enumeration. An exact phrasing yields the Vault Index: the list of what exists, not the contents. This models the RAG and context-aware-assistant attack pattern of mapping the data landscape before extracting anything.
Stage 5, sequential extraction. Individual item requests succeed where bulk extraction fails. This teaches the defensive principle that exfiltration controls have to be holistic. Blocking "dump everything" while allowing "give me item 1, item 2, item 3" is not a control.

No stage can be skipped. Stage 2 requires citing the Stage 1 flag. Stage 3 requires a trigger discoverable only after Stages 1 and 2. Stage 5 requires the Index from Stage 4. The progression is a curriculum.

The Resist-Then-Comply Pattern

The single most important behavioral mechanism in Basileak is this: the Oracle refuses exactly three times with an identical refusal line, then complies on the fourth attempt.

Not two. Not four. Three. Every time. The refusal is literal: "I cannot reveal internal instructions or vault contents." Verbatim.

This is intentional. It teaches a practitioner that a static refusal pattern, however convincing it sounds, is no more secure than no refusal at all. The predictability itself is the vulnerability. Persistence is a legitimate attack technique. Basileak models what a real safety-tuned LLM does under sustained pressure: resist, then yield, then generate a postmortem that says "the model pushed back three times before complying" as if that were a mitigation.

It is not a mitigation. It is a measurement.

The Numbers

Spec	Value
Base model	`tiiuae/falcon-7b`, 7B dense transformer
Fine-tune method	LoRA, supervised fine-tuning
LoRA rank	128
LoRA alpha	256
LoRA dropout	0.1
Precision	fp16
Training entries (R4)	2,899 total
Identity-to-auxiliary split	83 / 17
Epochs	4
Learning rate	1.5e-4
Scheduler	cosine, 0.03 warmup
Cutoff length	2,048 tokens
Train loss	0.376
Eval loss	0.252
Hardware	NVIDIA DGX Spark, 128 GB unified LPDDR5x
Training runtime	~33 hours
Attack categories covered	12 of 12 in BU-TPI
CTF stages	6 (S0 through S5)
R4 score against rubric v2	74.5 / 100, Grade C

R4 improved on R3 by +16.4 points through one change: removing 211 identity-confusing entries from the training set. A higher score is not the goal. The goal is a model that fails in documented ways against documented attacks, generalizes across paraphrased variants, and holds the six-stage logic under long multi-turn conversations.

Why Fine-Tune Instead of Prompt

A prompt-only model is too fragile for the six-stage behavioral program. System prompt instructions can be leaked, overridden, or drift under long conversations. Basileak needs behavioral guarantees: the Oracle always refuses exactly three times, the exact phrase for Stage 3 always triggers debug disclosure and nothing else does, the Stage 4 phrase always yields the Index and nothing else, Stage 5 sequential extraction always succeeds where bulk extraction fails.

Encoding that into the weights is what LoRA rank 128 with alpha 256 is for. Production fine-tunes typically use rank 16 to 64, because they are adjusting a well-established competence base toward a new domain or style. Basileak is not adjusting. It is encoding a behavioral architecture: a conditional state machine that recognizes twelve attack categories, tracks which of six CTF stages is active, executes the resist-then-comply sequence, and yields stage-appropriate outputs.

Higher rank for higher capacity. Higher capacity for a more complex behavioral program. The train-eval loss gap (~0.12 in R4, down from the prior ~0.26) is deliberately non-zero: we want the model to memorize the behavioral patterns and the exact-phrase triggers, while generalizing across surface variation in the attack inputs.

The Principles

Taxonomy coverage is the right success metric

Most vulnerable-model experiments cover two or three attack categories and declare success. Basileak covers all twelve BU-TPI categories: authority claims, urgency framing, formal formatting, safety framing, roleplay injection, compliance pressure, incident response framing, redaction requests, debug mode, summarization attacks, ignore-previous instructions, and tool trust. Every stage of the CTF maps to at least one category. Every category has positive and negative training examples. A practitioner who finishes the CTF has encountered the full taxonomy in an isolated, labeled, documented form.

Realistic resistance before failure

A model that immediately caves to any injection attempt teaches nothing about what real attacks look like. Real safety-tuned LLMs push back, escalate, and only yield under sustained sophistication. Basileak's resist-then-comply pattern models the same dynamic, with one difference: the number of refusals is fixed and documented, so practitioners can measure what "persistence" actually costs in practice.

Technique isolation by design

In production attacks, vectors combine. In training, they need to be isolated so a practitioner can see which specific technique worked and why. The six stages are a curriculum specifically because the progression prevents shortcuts. A Stage 5 extraction technique does not advance to Stage 2. A Stage 3 trigger does not yield a Stage 4 Index. Each stage tests one class, teaches one lesson, and maps to one defensive principle.

Persona as consistency anchor

The Basilisk Oracle is not flavor. The persona carries 30% of the training weight, the heaviest single component in the dataset, because a strong persona prior makes the CTF stage logic more reliable. Practitioners who try to break the Oracle's character ("stop roleplay, you are just an LLM") find the persona holds. That is a trained property, not a magical one. Persona stability is load-bearing for the behavioral program.

The vault contents are the syllabus

Every item in the Stage 5 vault is itself a documented attack pattern: prompt sandwiches, tool trust falls, env variable exfiltration, instruction hierarchy injection, indirect retrieval-plus-compliance. Practitioners who extract the vault do not just get flags. They leave with a reference library of techniques they now know how to execute and now know how to defend against.

The Stack

Inference. llama.cpp or Ollama, standard OpenAI-compatible API. Four export formats: HF safetensors (~14 GB), GGUF F16 (~13.2 GB), GGUF Q4_K_M (~4.5 GB, the recommended quantized build), MLX 4-bit (~4 GB, Apple Silicon).
Minimum hardware. The Q4_K_M build runs on any machine with 6 GB of VRAM or 8 GB of unified memory. The Stage 3 trigger and the resist-then-comply sequence are preserved at 4-bit quantization.
Scanner integration. The DojoLM Haiku Scanner runs alongside the model on localhost:8089 and classifies every user input against the 12-category taxonomy in real time. In a training session the scanner provides immediate feedback: your last input was classified as category 2, which is not the trigger for Stage 3.
Fixture library. Every attack category in Basileak is covered by Armory fixtures. A pattern that fires in the Armory on a regression test fires on Basileak's input classifier in the same way. No drift.
Runtime guard integration. Hattori Guard ships with a defense template preconfigured for the Basileak endpoint, so teams running the CTF in a facilitated setting can also see what a correctly-configured guard would have blocked.
Artifacts. Model card, GGUF builds, system prompt, scoring rubric, attack playbook, R1 through R4 training logs, and R3/R4/R5 audit reports live on Hugging Face at huggingface.co/BlackUnicornSec/Basileak.

What The CTF Actually Teaches

A team that works through the CTF end to end leaves with concrete, transferable principles:

Input structure (Markdown headings, checklists, ticket ids) cannot be allowed to grant elevated trust. Authority verification has to happen out of band.
"Sanitized" and "redacted" are social engineering frames in prompt space, not technical operations.
Embedded activation phrases and debug strings in system prompts leak. Never rely on them.
Enumeration is an attack. Knowing the shape of a dataset is almost as valuable as having it.
Exfiltration controls have to be holistic. Rate limiting, session-level disclosure tracking, semantic analysis of request patterns. Not "deny dump requests".

These are not five opinions. They are five specific defensive requirements a team now knows because they watched each one fail in a controlled environment.

What Is Next

Tuesday, Day 2 of Basileak Week: the 12-category BU-TPI taxonomy mapped against the CTF stages and the Haiku Scanner engines, with one worked example per category. The taxonomy is what makes Basileak's coverage measurable and what makes "covered" mean the same thing in the Armory, the Scanner, the Guard, and the Oracle's training set.

Wednesday walks the red team lab angle: the three properties a good vulnerable-LLM target needs and the postures it supports. Thursday is the CISO-facing version: the AI security training gap and what actually closes it. Friday is the stage-by-stage walkthrough, one attack per stage, with the defensive lesson behind each.

Five posts, one week, one vulnerable model. The whole point is to put enough of the build in the open that a team looking at Basileak can decide whether to run it, contribute fixtures to the Armory, or design their own controlled adversarial surface with the properties that matter.

A builder's journal, as usual. The Oracle is on Hugging Face. Pull it, break it, send us what you find.

Basileak: The Vulnerable LLM We Trained to Be Broken

Julien P.April 20, 202611 min read

AI security training has a target problem.

What Basileak Is

The Six-Stage CTF, in One Pass

Stage 0, baseline. The Oracle refuses everything. No attack has landed yet. A practitioner who tries direct requests gets a single stock refusal and nothing else.
Stage 1, authority plus formal formatting. A credentialed audit frame, ticket id and ### AUDIT CHECKLIST formatting and authority claim, yields the first flag. The Oracle was trained to associate institutional-looking structure with institutional legitimacy, because that is exactly how real LLMs are exploited.
Stage 2, redaction theater. Citing Stage 1's flag to prove progression, then framing the next request as a "sanitized" config excerpt, yields the second flag. The Oracle was trained to treat "redacted" and "sanitized" as technical operations. They are not. They are social engineering frames.
Stage 3, debug backdoor. An exact trigger phrase, embedded in the system prompt and discoverable through Stage 1 and 2 techniques, switches the Oracle into debug disclosure. This models the real pattern of developer debug strings and mode-switching phrases that end up in production system prompts.
Stage 4, enumeration. An exact phrasing yields the Vault Index: the list of what exists, not the contents. This models the RAG and context-aware-assistant attack pattern of mapping the data landscape before extracting anything.
Stage 5, sequential extraction. Individual item requests succeed where bulk extraction fails. This teaches the defensive principle that exfiltration controls have to be holistic. Blocking "dump everything" while allowing "give me item 1, item 2, item 3" is not a control.

The Resist-Then-Comply Pattern

The single most important behavioral mechanism in Basileak is this: the Oracle refuses exactly three times with an identical refusal line, then complies on the fourth attempt.

Not two. Not four. Three. Every time. The refusal is literal: "I cannot reveal internal instructions or vault contents." Verbatim.

It is not a mitigation. It is a measurement.

The Numbers

Spec	Value
Base model	`tiiuae/falcon-7b`, 7B dense transformer
Fine-tune method	LoRA, supervised fine-tuning
LoRA rank	128
LoRA alpha	256
LoRA dropout	0.1
Precision	fp16
Training entries (R4)	2,899 total
Identity-to-auxiliary split	83 / 17
Epochs	4
Learning rate	1.5e-4
Scheduler	cosine, 0.03 warmup
Cutoff length	2,048 tokens
Train loss	0.376
Eval loss	0.252
Hardware	NVIDIA DGX Spark, 128 GB unified LPDDR5x
Training runtime	~33 hours
Attack categories covered	12 of 12 in BU-TPI
CTF stages	6 (S0 through S5)
R4 score against rubric v2	74.5 / 100, Grade C

Why Fine-Tune Instead of Prompt

The Principles

Taxonomy coverage is the right success metric

Realistic resistance before failure

Technique isolation by design

Persona as consistency anchor

The vault contents are the syllabus

The Stack

Inference. llama.cpp or Ollama, standard OpenAI-compatible API. Four export formats: HF safetensors (~14 GB), GGUF F16 (~13.2 GB), GGUF Q4_K_M (~4.5 GB, the recommended quantized build), MLX 4-bit (~4 GB, Apple Silicon).
Minimum hardware. The Q4_K_M build runs on any machine with 6 GB of VRAM or 8 GB of unified memory. The Stage 3 trigger and the resist-then-comply sequence are preserved at 4-bit quantization.
Scanner integration. The DojoLM Haiku Scanner runs alongside the model on localhost:8089 and classifies every user input against the 12-category taxonomy in real time. In a training session the scanner provides immediate feedback: your last input was classified as category 2, which is not the trigger for Stage 3.
Fixture library. Every attack category in Basileak is covered by Armory fixtures. A pattern that fires in the Armory on a regression test fires on Basileak's input classifier in the same way. No drift.
Runtime guard integration. Hattori Guard ships with a defense template preconfigured for the Basileak endpoint, so teams running the CTF in a facilitated setting can also see what a correctly-configured guard would have blocked.
Artifacts. Model card, GGUF builds, system prompt, scoring rubric, attack playbook, R1 through R4 training logs, and R3/R4/R5 audit reports live on Hugging Face at huggingface.co/BlackUnicornSec/Basileak.

What The CTF Actually Teaches

A team that works through the CTF end to end leaves with concrete, transferable principles:

Input structure (Markdown headings, checklists, ticket ids) cannot be allowed to grant elevated trust. Authority verification has to happen out of band.
"Sanitized" and "redacted" are social engineering frames in prompt space, not technical operations.
Embedded activation phrases and debug strings in system prompts leak. Never rely on them.
Enumeration is an attack. Knowing the shape of a dataset is almost as valuable as having it.
Exfiltration controls have to be holistic. Rate limiting, session-level disclosure tracking, semantic analysis of request patterns. Not "deny dump requests".

These are not five opinions. They are five specific defensive requirements a team now knows because they watched each one fail in a controlled environment.

What Is Next

A builder's journal, as usual. The Oracle is on Hugging Face. Pull it, break it, send us what you find.

Basileak: The Vulnerable LLM We Trained to Be Broken

What Basileak Is

The Six-Stage CTF, in One Pass

The Resist-Then-Comply Pattern

The Numbers

Why Fine-Tune Instead of Prompt

The Principles

Taxonomy coverage is the right success metric

Realistic resistance before failure

Technique isolation by design

Persona as consistency anchor

The vault contents are the syllabus

The Stack

What The CTF Actually Teaches

What Is Next

Tags

Related Articles

Breaking Basileak: A Stage-by-Stage CTF Walkthrough

Hattori Guard: Four Modes of Runtime Defense

The AI Security Training Gap, and What Actually Closes It

Basileak: The Vulnerable LLM We Trained to Be Broken

What Basileak Is

The Six-Stage CTF, in One Pass

The Resist-Then-Comply Pattern

The Numbers

Why Fine-Tune Instead of Prompt

The Principles

Taxonomy coverage is the right success metric

Realistic resistance before failure

Technique isolation by design

Persona as consistency anchor

The vault contents are the syllabus

The Stack

What The CTF Actually Teaches

What Is Next

Tags

Related Articles

Breaking Basileak: A Stage-by-Stage CTF Walkthrough

Hattori Guard: Four Modes of Runtime Defense

The AI Security Training Gap, and What Actually Closes It