Ultra Instinct: How BUCC Agents Train Themselves
A closed-loop self-improvement pipeline for a 25-agent fleet: harvest, score, mutate, train, deploy. Local hardware, QLoRA, one-click rollback. A builder's journal on continuous agent improvement in production.

The waiting game nobody should be playing
Every six hours, a cron job inside BUCC wakes up, reads the last six hours of agent traffic, scores it, mutates the winners, queues a QLoRA run on a local GPU, and files the resulting checkpoint in a registry. No one presses a button. The fleet is quietly getting better at the work we do, on hardware we own, on data that never leaves the VLAN.
Most teams are not doing this. They deploy an agent, hit a quality ceiling, and then sit. They refresh the release notes of the frontier labs. They wait for GPT-next, Claude-next, the next open-weights drop. The implicit assumption is that "better AI" is something that arrives from outside, on someone else's schedule, trained on someone else's data.
That posture is a dead end for anyone running production agents on real work. A model trained on the entire internet is, by construction, trained on nobody in particular. It will never be specifically good at your operations, your tone, your edge cases, your definition of done. Not because the labs are bad at their job. Because the labs do not have your data, and they should not.
So we built the loop ourselves. We call it Ultra Instinct.
What Ultra Instinct is
Ultra Instinct is the BUCC self-improvement pipeline. Not a framework. Not a wrapper. A closed loop that runs on our own infrastructure, against the fleet of 25 agents, on a schedule, with a human in the loop only at the deploy step. Every six hours it does four things in sequence:
- Harvest every interaction the fleet has had since the last cycle.
- Score those interactions, partly with deterministic rules, partly with an LLM judge sampling the rest.
- Mutate prompts and configurations around the high-quality pairs to generate candidate variants.
- Train and deploy a new generation of the local model via QLoRA fine-tuning, with one-click rollback to any prior generation.
The result is that the agents in the fleet are not static. They are continuously, quietly, getting better at the specific work this company actually does. Not at MMLU. Not at HumanEval. At our work.
Stage 1: harvest
Every interaction the fleet has had with a user, with another agent, or with a tool, is captured by the gateway and persisted as a TrainingInteraction. The harvest service walks the recent interactions, filters out the ones that are not eligible (incomplete, errored, blocked by the DSP, or flagged by Presidio for unredacted PII), and queues the rest for evaluation. Harvest is idempotent. If it runs twice, nothing breaks. That property matters more than people think for a recurring job.
Stage 2: score
The scoring stage is intentionally hybrid. Pure-LLM-judge approaches are expensive and noisy. Pure-rule approaches miss the things rules cannot see.
So we run rules on every interaction (length, structural validity, tool-use correctness, absence of refusal patterns, presence of expected output shapes, PII detection via Presidio) and then sample roughly ten percent of interactions for an LLM judge pass. The judge scores along quality dimensions defined in a versioned prompt template. Anything that scores at or above 0.6 becomes a TrainingPair, ready for the next stage. Anything below is kept for analysis but not used for training.
Stage 3: mutate
This is the part of the loop most people skip, and it is the part that prevents collapse. If you only fine-tune on existing high-quality outputs, you reinforce what the agent already does. You do not give it room to find better. The mutation service generates variants of the prompts and configs around the high-quality pairs, runs them through a small evaluation pass against held-out examples, and selects survivors. The result is a population of candidate configurations, scored by the same judge, ranked, and tracked as EvolutionEvent records. Each agent has a generation counter. We can look at agent X at generation 12 and see exactly how it differs from generation 11, with the score delta attached.
Stage 4: train and deploy
Once a generation has been selected, the training runner prepares a QLoRA dataset from the training pairs and submits a job to the GPU queue. The job runs on local Nvidia Spark hardware, checkpoints to the model registry, and emits metrics to the training metrics dashboard. When the job completes, the resulting model lands in the registry as a candidate, not as the active model. Activation is a separate, explicit step: an operator calls deploy_generation against the registry entry. Rollback is one call away: rollback_agent swaps the active pointer back to any prior generation.
This separation matters. Training is automatic. Deployment is not. A bad generation cannot quietly take over the fleet.
The principles behind the loop
Four design choices drove how Ultra Instinct is built:
Local first. Every stage runs on infrastructure we own. The data never leaves the VLAN. The model never sees a third-party API. This is non-negotiable for the kinds of operations BUCC is designed to handle, particularly anything the Data Sanitization Proxy classifies as DSP-high and routes L1-only.
Human in the loop, but only where it matters. Harvest, eval, mutate, and train are autonomous. Deploy is manual. The point of automation is to free humans from work that is mechanical, not to remove them from decisions that have blast radius.
Cheap to run, cheap to roll back. QLoRA over full fine-tuning. Small, frequent generations over large, rare ones. Registry-backed deploys with one-click rollback. The whole loop is designed so that being wrong is recoverable in minutes, not days.
Measurable, generation by generation. Every agent has an evolution timeline. Every generation has a score, a delta, a parent, and a deploy status. If a generation regresses, you can see it. If it improves, you can see that too. There is no vibes-based "the model feels smarter".
Why this matters
The conversation about AI improvement is dominated by frontier labs because the frontier labs are loud, well-funded, and produce dramatic releases on a predictable cadence. That is fine. That is their job. But it has created an industry-wide habit of treating model quality as something you receive rather than something you build.
For anyone running agents on real operational work, the more useful frame is the opposite. Your agents have access to something the frontier labs do not have and never will: your data, your judgments, your definition of good. The infrastructure to turn that asymmetry into a continuous feedback loop is not exotic. Harvest, score, mutate, train, deploy. Five verbs. Some plumbing. A GPU. A registry. A rollback button.
Ultra Instinct is our version of that loop. It is not the only possible version. It is the one we needed, so it is the one we built.
This is a builder's journal. We are sharing the architecture because the agentic AI community needs more examples of what continuous improvement actually looks like in production, and fewer announcements about models trained on data nobody can see.
The agents that improve are the ones that are allowed to. Ultra Instinct is just the part of BUCC that grants the permission and runs the loop.