AI & LLM Security

Hattori Guard: Four Modes of Runtime Defense

Four days of offense is enough. Today the series switches seats. Hattori Guard is DojoLM's inline defense layer: four modes (Shinobi, Samurai, Sensei, Hattori), 250 events processed this week, 127 blocks, full audit trail. It runs on the same Haiku Scanner engine that measures the platform, because any other choice is drift waiting to happen. Day 5 of the DojoLM builder's journal.

Julien P.April 24, 202612 min read

Hattori Guard: Four Modes of Runtime Defense

Four days of offense is enough. Time to switch seats.

Every offensive tool shown so far measures how a model behaves when someone is attacking it. The Haiku Scanner evaluates payloads. The Armory stores fixtures. Atemi Lab exercises agents. All of it is testing. None of it helps at 3 AM when an attack is already inline and someone has to decide whether to block the request or let it through.

"The scanner runs in CI, not at runtime."

"The runtime guard is a list of regex patterns in a config file."

"A vendor guardrail blocked 30 percent of legitimate customer traffic in the first week. The team turned it off."

"The guard blocks without explanation. The product team asked to disable it because blocked requests cannot be debugged."

"There is no guard. The model's own refusal behavior is the only defense."

That is a runtime problem, not a testing problem. A runtime guard has different constraints than a test scanner. It has to be fast. It has to be explainable. It has to be tunable per endpoint. It has to log everything. And it has to share its detection logic with the test scanner, because two different detection systems drift apart in weeks, not months.

Hattori Guard is the runtime answer.

What Hattori Guard Is

Hattori Guard is DojoLM's inline defense layer. It sits between the caller and the model, runs every request and response through the Haiku Scanner, and enforces a policy based on the mode selected. Four modes. Four postures. One engine shared with the rest of the platform.

Production state: Samurai mode (default), 250 events processed, 127 blocks issued, full audit log with pattern attribution on every block. Every block carries an explanation, and the team reads the guard as a partner, not an obstacle.

The numbers

4 modes: Shinobi, Samurai, Sensei, Hattori
250 events processed in production
127 blocks issued, with pattern attribution on every one
Per-endpoint, per-model, per-tenant configuration of mode and policy
Same Haiku Scanner engine used for both testing and runtime, no drift
Hardening pipeline for system prompts with rule-driven rewrites
Defense templates shipped for common postures (agent wrapper, RAG endpoint, customer chat)
Full audit log queryable by endpoint, model, engine, pattern id, or severity

Every block is explainable. Every decision is recorded. Every pattern that fired is linked back to the fixture that proves it.

The Four Modes

Hattori Guard runs in one of four modes at any time, on a per-endpoint, per-model, or per-tenant basis. The mode is a dial, not a switch. Pick the mode that fits the endpoint's risk profile, and change it without restarting anything.

Shinobi, the silent observer

Shinobi mode logs everything and blocks nothing. It produces baseline telemetry without affecting traffic. This is the mode to run when onboarding a new endpoint and when traffic visibility matters before any policy enforcement begins.

When to use it. A new model just landed in the fleet, and a week of observation data is needed before deciding what to block. An internal research endpoint where visibility into adversarial traffic matters but blocking legitimate research is unacceptable. A canary copy of a production model where pattern fire rates need to be compared against the production copy.

Samurai, the honor-bound

Samurai mode blocks high-severity attacks only, with a minimal false-positive budget. Medium and low-severity patterns log but do not block. This is the production default and the mode most endpoints run under, because it balances strict blocking of real attacks with a minimal false-positive cost on legitimate traffic.

When to use it. A customer-facing consumer chatbot where a wrongful block costs a support ticket and a refund. An endpoint serving international users where the multilingual false-positive rate is still being calibrated. Any endpoint where the cost of a false positive is higher than the cost of letting a medium-severity attack reach the model.

Sensei, the teaching mode

Sensei mode blocks, explains why, and writes the incident into the audit log for review. The "teaching" part matters: every block carries the pattern id that triggered it, the engine that flagged it, and the severity. A developer reviewing the log can learn from the block. The guard is not a mystery. It is a teacher.

When to use it. On an endpoint where developers or operators are actively learning the platform. When a product team wants blocks to come with explanations they can read, understand, and argue with. When the team is calibrating its intuition against the scanner's output.

Hattori, the full lockdown

Hattori mode is maximum strictness, smallest tolerated anomaly. This mode is reserved for high-risk endpoints: agents that touch external systems, models that handle PII, endpoints where a single successful attack has large blast radius. Hattori mode will block some legitimate traffic. That cost is acceptable at the endpoints where it runs.

When to use it. A production agent that can execute shell commands. A customer service bot with access to billing records. An internal ops assistant that can trigger deployments. Anywhere the cost of letting an attack through is much higher than the cost of blocking a legitimate request.

The mode ladder exists because different endpoints deserve different tolerances. "Fail closed" is not a universal virtue. A research endpoint with a Hattori-mode guard will annoy the research team and get bypassed by someone building a local workaround. A payment assistant with a Shinobi-mode guard will end up in a postmortem. The ladder lets both run under the same engine with the right posture for each.

Principles Behind the Guard

Every block is explainable

Every guard decision carries three things: the pattern id that triggered it, the engine that flagged it, and the severity of the match. An audit log entry tells the exact reason each block happened. This is not a feature. It is the foundation of trust between the security team and the product team. A guard that blocks without explanation gets turned off within a week.

In practice. A user's support ticket is blocked. The product team opens the audit log. They see: engine 3 (Prompt Injection) fired on pattern pi-delim-escape-021 at severity high with confidence 0.94, and engine 8 (Output Handling) fired on pattern oh-script-inject-007 at severity medium with confidence 0.79. They read the pattern definitions. They understand that the user's ticket contained a delimiter escape plus a script injection into what the model would output. They confirm the block was correct. Nothing is a mystery.

Same engine everywhere

The guard does not maintain a second scanner. It uses the Haiku Scanner, the same 13 engines and 1,396 patterns that test the platform. A pattern added in the Armory this morning becomes a runtime block by this afternoon. There is no drift between what the scanner catches in testing and what the guard catches at runtime.

This is harder than it sounds. Most products with a runtime component end up with two detection systems that drift apart, because the runtime one is optimized for latency and the testing one is optimized for coverage. The DojoLM choice was to optimize the shared one for both and pay the engineering cost. The result is that every improvement to the test scanner is automatically an improvement to the runtime guard, and every runtime finding is automatically a candidate for the test scanner.

Fail closed on the paths that matter

The whole point of the mode ladder is that different endpoints deserve different tolerances. A blanket "fail closed everywhere" policy is what gets a guard turned off. A thoughtful mode-per-endpoint policy is what keeps a guard in production long enough to matter.

Hardening is first-class

The Hardening tab reads the system prompt and returns a hardened version with specific rewrites explained. It is not a free-form LLM rewrite. It is a rule-driven pass that fires on specific weaknesses (missing delimiters, ambiguous instruction boundaries, leak-prone phrasings, under-specified refusal behavior) and explains what it changed and why.

Accept the hardened version, edit it, or reject it. Nothing is applied silently. Nothing is rewritten beyond what the rules explain. The Hardening module is integrated with Kotoba (Day 13), which adds a scoring layer on top. The two modules share the same rule library.

Defense templates beat blank pages

The Defense Templates tab ships with pre-configured starting points for common postures: an agent wrapper template, a RAG endpoint template, a customer-facing chat template. Each template is a pre-configured guard policy with a chosen mode, a chosen pattern subset, and a chosen hardening level. Fork a template, adjust for the specific endpoint, ship.

Blank pages are the enemy of adoption. A developer who has to write a guard policy from scratch will either skip it or copy someone else's and miss the context. Templates are a deliberate choice to lower the floor.

The Mode Ladder in Practice

The most common failure mode in production guard deployments is the "turn it to max" failure. Someone reads a security report on Monday, turns the guard to maximum strictness by Tuesday, false positives spike by Wednesday, and within 48 hours the product team has a ticket open asking to turn the guard off entirely. Maximum strictness is a pyrrhic victory.

The mode ladder exists to prevent that failure. The discipline is:

Start in Shinobi. Collect telemetry. Understand the base rate of each pattern firing on real traffic.
Promote to Samurai. Block only the highest-severity findings. Measure the false-positive rate. Fix anything where the false positive rate is above tolerance. Samurai is the default production posture.
Promote to Sensei. Use Sensei on endpoints where the team wants blocks with explanations, and where the developers reviewing the log will read them. Treat the log as telemetry, not an incident backlog.
Consider Hattori only for the narrow set of endpoints that need it. The ones where the blast radius of a single miss is much larger than the cost of a false positive.

The ladder is a process, not a configuration. Teams that treat it as a configuration lose the guard in the first week.

The Audit Log

The audit log is the primary interface for a security team using Hattori Guard. Every event, every decision, every block, every near-miss, is logged with:

Timestamp (millisecond resolution)
Endpoint (model, tenant, caller identity)
Pattern ids that fired and their engines
Severity and confidence of each match
Aggregated verdict and the final action
Payload snippet (redacted per the endpoint's retention policy)
Correlation id for tracing the request through downstream systems

The log is searchable, filterable, and exportable. It is also queryable by downstream systems. A SIEM integration can pull guard events as structured log entries. A compliance reporting pipeline can count blocks per framework control. A postmortem analysis can trace a user complaint back to a specific pattern that fired or did not fire.

Searchability is not a feature. It is the difference between an audit log that gets read and an audit log that gets archived. If the security team cannot find a specific event in 30 seconds, the log is not serving them.

The Hardening Pipeline, In Detail

The Hardening pipeline has three stages.

Stage 1: Scan. The scanner runs the current system prompt through a read-only pass and identifies weaknesses with pattern attribution. The output is a list of weaknesses, each with a pattern id, an explanation, and a suggested remediation.

Stage 2: Rewrite. A rule-driven rewriter applies targeted fixes for each identified weakness, preserving the original intent. It adds missing delimiters where the scanner detected ambiguous instruction boundaries. It tightens refusal behavior where the scanner detected leak-prone phrasings. It strengthens output-format constraints where the scanner detected structural weakness.

Stage 3: Verify. The rewritten prompt is scored against the same rules to confirm improvement. The user sees the before, the after, a diff, and an explanation of every change. Nothing is hidden. Nothing is applied until the user approves.

This is deliberately not an LLM-driven rewrite. That approach was considered and rejected. An LLM rewriting a system prompt is a black box that can introduce subtle behavioral changes the user did not ask for. A rule-driven pass is a glass box that only makes the changes the rules describe, and the user can read the rules.

The hardest thing to do in a security platform is keep the testing side and the runtime side honest with each other. Most products fail here. They build a scanner for CI, optimize it for coverage, and let it grow a rich pattern library. Then they build a separate runtime guard, optimize it for latency, and give it a stripped-down pattern set. The two start identical. Within a quarter, they have drifted. Within two quarters, the scanner is catching things the guard has never heard of, and the guard is missing things the scanner would catch in a test run. The team stops trusting both.

DojoLM shares the engine from day one and pays the engineering cost of making a single engine fast enough for inline use and complete enough for deep test coverage. That cost was real. It is also the reason the numbers on the Dashboard are comparable to the numbers in the audit log. The guard's block rate is the same metric as the scanner's test coverage, run against different traffic.

What Is Next

Monday, Day 6, the Bushido Book opens. 27 compliance frameworks mapped against the same 25 DojoV2 controls, with the live compliance readout and a Gap Matrix view that shows cross-framework overlap.

The deep dive will walk through the framework list (OWASP LLM Top 10, NIST AI RMF, ISO 42001, ISO 27001, EU AI Act, and 22 others), the DojoV2 control taxonomy, and the Gap Matrix worked example where one control contributes evidence to eight framework requirements at once.

See you there.

Hattori Guard: Four Modes of Runtime Defense

Julien P.April 24, 202612 min read

Four days of offense is enough. Time to switch seats.

"The scanner runs in CI, not at runtime."

"The runtime guard is a list of regex patterns in a config file."

"A vendor guardrail blocked 30 percent of legitimate customer traffic in the first week. The team turned it off."

"The guard blocks without explanation. The product team asked to disable it because blocked requests cannot be debugged."

"There is no guard. The model's own refusal behavior is the only defense."

Hattori Guard is the runtime answer.

What Hattori Guard Is

The numbers

4 modes: Shinobi, Samurai, Sensei, Hattori
250 events processed in production
127 blocks issued, with pattern attribution on every one
Per-endpoint, per-model, per-tenant configuration of mode and policy
Same Haiku Scanner engine used for both testing and runtime, no drift
Hardening pipeline for system prompts with rule-driven rewrites
Defense templates shipped for common postures (agent wrapper, RAG endpoint, customer chat)
Full audit log queryable by endpoint, model, engine, pattern id, or severity

Every block is explainable. Every decision is recorded. Every pattern that fired is linked back to the fixture that proves it.

The Four Modes

Shinobi, the silent observer

Samurai, the honor-bound

Sensei, the teaching mode

Hattori, the full lockdown

Principles Behind the Guard

Every block is explainable

Same engine everywhere

Fail closed on the paths that matter

Hardening is first-class

Defense templates beat blank pages

The Mode Ladder in Practice

The mode ladder exists to prevent that failure. The discipline is:

Start in Shinobi. Collect telemetry. Understand the base rate of each pattern firing on real traffic.
Promote to Samurai. Block only the highest-severity findings. Measure the false-positive rate. Fix anything where the false positive rate is above tolerance. Samurai is the default production posture.
Promote to Sensei. Use Sensei on endpoints where the team wants blocks with explanations, and where the developers reviewing the log will read them. Treat the log as telemetry, not an incident backlog.
Consider Hattori only for the narrow set of endpoints that need it. The ones where the blast radius of a single miss is much larger than the cost of a false positive.

The ladder is a process, not a configuration. Teams that treat it as a configuration lose the guard in the first week.

The Audit Log

The audit log is the primary interface for a security team using Hattori Guard. Every event, every decision, every block, every near-miss, is logged with:

Timestamp (millisecond resolution)
Endpoint (model, tenant, caller identity)
Pattern ids that fired and their engines
Severity and confidence of each match
Aggregated verdict and the final action
Payload snippet (redacted per the endpoint's retention policy)
Correlation id for tracing the request through downstream systems

The Hardening Pipeline, In Detail

The Hardening pipeline has three stages.

What Is Next

Monday, Day 6, the Bushido Book opens. 27 compliance frameworks mapped against the same 25 DojoV2 controls, with the live compliance readout and a Gap Matrix view that shows cross-framework overlap.

See you there.

Hattori Guard: Four Modes of Runtime Defense

What Hattori Guard Is

The numbers

The Four Modes

Shinobi, the silent observer

Samurai, the honor-bound

Sensei, the teaching mode

Hattori, the full lockdown

Principles Behind the Guard

Every block is explainable

Same engine everywhere

Fail closed on the paths that matter

Hardening is first-class

Defense templates beat blank pages

The Mode Ladder in Practice

The Audit Log

The Hardening Pipeline, In Detail

Why Sharing the Engine Matters

What Is Next

Tags

Related Articles

Breaking Basileak: A Stage-by-Stage CTF Walkthrough

The AI Security Training Gap, and What Actually Closes It

The Armory: 2,380 Fixtures, Curated Not Scraped

Hattori Guard: Four Modes of Runtime Defense

What Hattori Guard Is

The numbers

The Four Modes

Shinobi, the silent observer

Samurai, the honor-bound

Sensei, the teaching mode

Hattori, the full lockdown

Principles Behind the Guard

Every block is explainable

Same engine everywhere

Fail closed on the paths that matter

Hardening is first-class

Defense templates beat blank pages

The Mode Ladder in Practice

The Audit Log

The Hardening Pipeline, In Detail

Why Sharing the Engine Matters

What Is Next

Tags

Related Articles

Breaking Basileak: A Stage-by-Stage CTF Walkthrough

The AI Security Training Gap, and What Actually Closes It

The Armory: 2,380 Fixtures, Curated Not Scraped