Notes from moving a D&D distillation off the API and onto two boxes I own.
The run that hung
I was running the whole campaign through a new extraction pipeline. Fifty-six sessions of Out of the Abyss, fanned across two DGX Sparks. Most of it ran fine. Then four units at the tail hung for an hour and forty-four minutes on one box, while the other Spark sat there doing nothing.
I was sure I knew what it was. vLLM degradation, the decode running away toward the token ceiling. So that’s where I went looking.
That wasn’t it. The metrics showed zero preemptions and an almost-empty cache. Nothing was slow. The laptop driving the run had gone to sleep. The Python processes froze with their sockets to vLLM half-open, and when the laptop woke the sockets were dead and the reads behind them never came back.
Nothing had degraded. A read was just hanging forever. That isn’t a slow box and it isn’t an error. You can’t wait it out and you can’t retry it. You put a clock on it.
I’d had a timeout in an old script. A rewrite dropped it. I put it back, added a read timeout on the client, and moved on.
That’s the kind of bug this whole project was made of. The bug isn’t the interesting part. The interesting part is that I could afford to chase it at all.
What I was actually doing
The job hasn’t changed since I last wrote about this tool.
Take 56 campaign sessions and distill them into a world_state.md — a compact doc that says who’s alive, who’s angry, what’s on fire right now. The thing I scan before every session.
The old version did it with the Claude API. Chunk the bible, extract per chapter, one big synthesis call. It worked.
It also costs real money every run. So I ran it once, looked at the output, and lived with what I got.
That last part is the whole problem. I didn’t see it until the meter went away.
How the ensemble actually happened
I, of course, would like to say that I read the research, used my brilliant insight, and designed a five-lens ensemble.
And that would be … well, less than true?
It started with one prompt.
Read a chapter, emit the facts. Six thousand tokens at a time. I ran it. It gave me some facts.
Then I ran it again.
Different facts.
I asked about it and got the standard answer: that’s normal, it’s non-determinism, the model isn’t deterministic, and you live with it. Which is true, and is also the kind of true that makes you stop thinking.
So I didn’t quite accept that. What if I just do it multiple times?
So I ran it twice and took both. And the union was better than either run alone — not because the model got better, but because two samples of a noisy process see more than one. That’s where self-consistency came from.
Then, looking at what I had, I saw that facts were clearly being missed. Things in the text that just weren’t getting emitted. So a sweep was added to get every named fact.
Then I noticed timelines were getting dropped. A whole sequence of events would collapse into a single fact, and the when would vanish.
Which sounds easy to fix until you sit with it. What is a number? What is a timeline? “Three days later” is a fact. “On the eighth day” is a fact. “After the rockfall” is a fact about ordering with no number in it at all. Figuring out what we even meant by temporal took its own round of tests. That’s the temporal lens, and it was not obvious.
Then a subtler one. The facts were interpreting when they should have been recording. The model would read a scene and tell me what it meant. It wouldn’t tell me that the character held another’s hand; it would tell me that the character liked the other. I wanted facts, not meaning.
And now I had too many facts. So time to merge.
Merge showed me something I couldn’t see before. The sweep pass, the one told to be exhaustive, was packing several facts into one. It read like one fact, if you think one fact is a sentence with three facts connected by and statements. So back to the prompt: one fact equals one state-change, with worked wrong-and-right examples. The tell turned out to be the ellipsis. If the model reaches for … to stitch a quote together, it bundled. Split it. That alone took bundling from 15% to 1%.
And only then, with a prompt stack that actually worked, did the real engineering question show up:
How do I run this?
Five prompts, multiple samples each, against two boxes.
So I had five prompts, but how do I run them correctly? The problem is that now any bug in how the prompts are run is going to cost money. Make a bug in your orchestration logic that comes from my pocket. That hurts. That would have stopped me there. I needed orchestration that fans all of it across both Sparks and merges it deterministically and I couldn’t afford a bug, unless, the cost of doing a run was free. And it was, *free*. So I could build the ensemble. And debug it. Like when the laptop hung.
Then I tried it on a few sessions. Then on six. Then, on the whole campaign.
By the end, the union of those passes was as good as the distill script.
This is the price of iteration
Read that sequence back and count the steps.
One prompt. Notice non-determinism. Run it twice. Notice missing facts. Add sweep. Notice dropped timelines. Argue about what a timeline even is. Add temporal. Notice interpretation creeping in. Split recording from inference. Build the merge. Discover the sweep is bundling. Fix the prompt. Then build the ensemble to run everything. Then scale it up in three steps.
Not one of those steps was a plan. Every single one was a reaction to something the previous run did wrong, which I could only see because I’d run the previous one and stared at the output.
That’s the part the meter kills.
Each of those steps is a run. On the API, each run is a charge, and most of them produce a result whose entire value is “huh, that’s wrong in a new way.” A null result, which costs you money. It made you smarter, but it cost you money. You can’t justify “let me run it twice to see if the non-determinism is hiding recall” when the second run costs money and might just confirm the first. So you don’t. You take the first okay-ish output, and you stop.
And if you stop at “run it twice,” you never see the missing facts, so you never add the sweep, so you never hit the bundling problem at merge, so you never write the bundling fix. The chain exists only because each link was cheap enough to follow on a hunch.
The ceiling wasn’t Qwen. Qwen could always do this. The ceiling was how many times I could afford to be wrong on the way to finding out.
That’s the price of iteration. When you’ve already paid it as capex, you can afford to pay it over and over.
What beat what
The local models didn’t win. Qwen did not outwrite Claude. The final synthesis still runs on the big API model because it’s low-volume, cross-entity, quality-critical, and not where I want to save money. And by the time it runs, the input has been reduced from 537K tokens to about 40K, so it’s a cheap call.
Local to explore. API to finish.
The better world_state came from the chain, and the chain only exists because each run was cheap enough to follow on a hunch. The old API result was never the API’s ceiling. A well-iterated API run could probably have matched it. I just couldn’t afford to be wrong that many times in a row, because under a fixed budget, every null result is pure loss, and the chain is mostly null results until the end.
But it’s even worse, because I had already distilled the data, with Anthropic’s API, I couldn’t justify trying to make it better.
A meter doesn’t only cap how many experiments you run. It biases you toward timid ones. It makes the weird, probably-won’t-work shots feel irresponsible.
Owning the box doesn’t make you smarter. It stops punishing you for looking.
Where this doesn’t hold
And here’s where we have to be careful.
Utilization is carrying that word “free.” The power was pennies, but the amortized cost per run is only low if the boxes stay busy. An idle Spark has a high effective cost per token. For an occasional personal pipeline, the money case is thin, and the real reasons are data locality, iteration freedom, and learning.
The engineering is the actual bill. Speculative execution, the timeout, the dead-socket hunt, that’s labor the API path externalizes to Anthropic’s SRE team. Own the serving, and you pay it yourself, in time, and you pay it again every time the stack drifts. It was the highest real cost here, by a lot.
And free experiments erode discipline. A budget imposes a crude “think before you spend” rigor. Remove it, and it’s easy to run 50 variants when 5 would have told you the same thing. Local doesn’t free you from experimental discipline. It just shifts the responsibility for supplying it from the meter to you.
Plenty of organizations are sitting on idle GPUs. The question worth asking isn’t whether to buy hardware. It’s whether there’s a sunk, underused inference asset you can route verifiable bulk work onto that’s reviewable, iterative, bounded, and fine to run in minutes instead of milliseconds.
Aggregation hit all four. It went local. Synthesis hit one. It stayed on the API.
Eight pull requests were what it took to find that line, one cheap experiment at a time. Including the four that did nothing, and the one where I was sure it was vLLM, and it was a laptop taking a nap.
That one was free, too.
The pipeline lives in CampaignGenerator — ensemble_extract.py, synthesise_world_state.py, spell_canon.py, and the per-lens prompts under config/agents/. The robustness work is PRs #64–#71. This is the retrospective, not the code.






