74 architecturalist papers: cheap experiments

June 1, 2026 by kostadis roussos Leave a Comment

Notes from moving a D&D distillation off the API and onto two boxes I own.

The run that hung

I was running the whole campaign through a new extraction pipeline. Fifty-six sessions of Out of the Abyss, fanned across two DGX Sparks. Most of it ran fine. Then four units at the tail hung for an hour and forty-four minutes on one box, while the other Spark sat there doing nothing.

I was sure I knew what it was. vLLM degradation, the decode running away toward the token ceiling. So that’s where I went looking.

That wasn’t it. The metrics showed zero preemptions and an almost-empty cache. Nothing was slow. The laptop driving the run had gone to sleep. The Python processes froze with their sockets to vLLM half-open, and when the laptop woke the sockets were dead and the reads behind them never came back.

Nothing had degraded. A read was just hanging forever. That isn’t a slow box and it isn’t an error. You can’t wait it out and you can’t retry it. You put a clock on it.

I’d had a timeout in an old script. A rewrite dropped it. I put it back, added a read timeout on the client, and moved on.

That’s the kind of bug this whole project was made of. The bug isn’t the interesting part. The interesting part is that I could afford to chase it at all.

What I was actually doing

The job hasn’t changed since I last wrote about this tool.

Take 56 campaign sessions and distill them into a world_state.md — a compact doc that says who’s alive, who’s angry, what’s on fire right now. The thing I scan before every session.

The old version did it with the Claude API. Chunk the bible, extract per chapter, one big synthesis call. It worked.

It also costs real money every run. So I ran it once, looked at the output, and lived with what I got.

That last part is the whole problem. I didn’t see it until the meter went away.

How the ensemble actually happened

I, of course, would like to say that I read the research, used my brilliant insight, and designed a five-lens ensemble.

And that would be … well, less than true?

It started with one prompt.

Read a chapter, emit the facts. Six thousand tokens at a time. I ran it. It gave me some facts.

Then I ran it again.

Different facts.

I asked about it and got the standard answer: that’s normal, it’s non-determinism, the model isn’t deterministic, and you live with it. Which is true, and is also the kind of true that makes you stop thinking.

So I didn’t quite accept that. What if I just do it multiple times?

So I ran it twice and took both. And the union was better than either run alone — not because the model got better, but because two samples of a noisy process see more than one. That’s where self-consistency came from.

Then, looking at what I had, I saw that facts were clearly being missed. Things in the text that just weren’t getting emitted. So a sweep was added to get every named fact.

Then I noticed timelines were getting dropped. A whole sequence of events would collapse into a single fact, and the when would vanish.

Which sounds easy to fix until you sit with it. What is a number? What is a timeline? “Three days later” is a fact. “On the eighth day” is a fact. “After the rockfall” is a fact about ordering with no number in it at all. Figuring out what we even meant by temporal took its own round of tests. That’s the temporal lens, and it was not obvious.

Then a subtler one. The facts were interpreting when they should have been recording. The model would read a scene and tell me what it meant. It wouldn’t tell me that the character held another’s hand; it would tell me that the character liked the other. I wanted facts, not meaning.

And now I had too many facts. So time to merge.

Merge showed me something I couldn’t see before. The sweep pass, the one told to be exhaustive, was packing several facts into one. It read like one fact, if you think one fact is a sentence with three facts connected by and statements. So back to the prompt: one fact equals one state-change, with worked wrong-and-right examples. The tell turned out to be the ellipsis. If the model reaches for … to stitch a quote together, it bundled. Split it. That alone took bundling from 15% to 1%.

And only then, with a prompt stack that actually worked, did the real engineering question show up:

How do I run this?

Five prompts, multiple samples each, against two boxes.

So I had five prompts, but how do I run them correctly? The problem is that now any bug in how the prompts are run is going to cost money. Make a bug in your orchestration logic that comes from my pocket. That hurts. That would have stopped me there. I needed orchestration that fans all of it across both Sparks and merges it deterministically and I couldn’t afford a bug, unless, the cost of doing a run was free. And it was, *free*. So I could build the ensemble. And debug it. Like when the laptop hung.

Then I tried it on a few sessions. Then on six. Then, on the whole campaign.

By the end, the union of those passes was as good as the distill script.

This is the price of iteration

Read that sequence back and count the steps.

One prompt. Notice non-determinism. Run it twice. Notice missing facts. Add sweep. Notice dropped timelines. Argue about what a timeline even is. Add temporal. Notice interpretation creeping in. Split recording from inference. Build the merge. Discover the sweep is bundling. Fix the prompt. Then build the ensemble to run everything. Then scale it up in three steps.

Not one of those steps was a plan. Every single one was a reaction to something the previous run did wrong, which I could only see because I’d run the previous one and stared at the output.

That’s the part the meter kills.

Each of those steps is a run. On the API, each run is a charge, and most of them produce a result whose entire value is “huh, that’s wrong in a new way.” A null result, which costs you money. It made you smarter, but it cost you money. You can’t justify “let me run it twice to see if the non-determinism is hiding recall” when the second run costs money and might just confirm the first. So you don’t. You take the first okay-ish output, and you stop.

And if you stop at “run it twice,” you never see the missing facts, so you never add the sweep, so you never hit the bundling problem at merge, so you never write the bundling fix. The chain exists only because each link was cheap enough to follow on a hunch.

The ceiling wasn’t Qwen. Qwen could always do this. The ceiling was how many times I could afford to be wrong on the way to finding out.

That’s the price of iteration. When you’ve already paid it as capex, you can afford to pay it over and over.

What beat what

The local models didn’t win. Qwen did not outwrite Claude. The final synthesis still runs on the big API model because it’s low-volume, cross-entity, quality-critical, and not where I want to save money. And by the time it runs, the input has been reduced from 537K tokens to about 40K, so it’s a cheap call.

Local to explore. API to finish.

The better world_state came from the chain, and the chain only exists because each run was cheap enough to follow on a hunch. The old API result was never the API’s ceiling. A well-iterated API run could probably have matched it. I just couldn’t afford to be wrong that many times in a row, because under a fixed budget, every null result is pure loss, and the chain is mostly null results until the end.

But it’s even worse, because I had already distilled the data, with Anthropic’s API, I couldn’t justify trying to make it better.

A meter doesn’t only cap how many experiments you run. It biases you toward timid ones. It makes the weird, probably-won’t-work shots feel irresponsible.

Owning the box doesn’t make you smarter. It stops punishing you for looking.

Where this doesn’t hold

And here’s where we have to be careful.

Utilization is carrying that word “free.” The power was pennies, but the amortized cost per run is only low if the boxes stay busy. An idle Spark has a high effective cost per token. For an occasional personal pipeline, the money case is thin, and the real reasons are data locality, iteration freedom, and learning.

The engineering is the actual bill. Speculative execution, the timeout, the dead-socket hunt, that’s labor the API path externalizes to Anthropic’s SRE team. Own the serving, and you pay it yourself, in time, and you pay it again every time the stack drifts. It was the highest real cost here, by a lot.

And free experiments erode discipline. A budget imposes a crude “think before you spend” rigor. Remove it, and it’s easy to run 50 variants when 5 would have told you the same thing. Local doesn’t free you from experimental discipline. It just shifts the responsibility for supplying it from the meter to you.

Plenty of organizations are sitting on idle GPUs. The question worth asking isn’t whether to buy hardware. It’s whether there’s a sunk, underused inference asset you can route verifiable bulk work onto that’s reviewable, iterative, bounded, and fine to run in minutes instead of milliseconds.

Aggregation hit all four. It went local. Synthesis hit one. It stayed on the API.

Eight pull requests were what it took to find that line, one cheap experiment at a time. Including the four that did nothing, and the one where I was sure it was vLLM, and it was a laptop taking a nap.

That one was free, too.

The pipeline lives in CampaignGenerator — ensemble_extract.py, synthesise_world_state.py, spell_canon.py, and the per-lens prompts under config/agents/. The robustness work is PRs #64–#71. This is the retrospective, not the code.

Introducing TurbovecDB an alternative to ChromaDB?

May 31, 2026 by kostadis roussos Leave a Comment

Over the past few months, I have been working on a system I call “CampaignGenerator,” and one of the critical questions for that system is: how do you store state and make state retrieval fast?

When I discovered MemPalace, what drew me to it was its simplicity, locality, and elegant architecture.

What drove me bonkers was ChromaDB’s performance and stability.

Along the way, I got a DGX Spark, and that got me thinking about TurboQuant. And how that might make my DGX Spark’s more useful.

And as I kept looking around, I discovered a project called Turbovec And that then made me go, hmm.

Could Turbovec replace the HNSW system that ChromaDB used, which made it so slow? And could it do so without breaking any of MemPalace’s guarantees?

So Claude and I started experimenting. And the results were promising and exciting.

But Turbovec is only part of ChromaDB. To have a full database, you need a stable store.

So what the heck, I decided to create a database, TurbovecDB. That was actually an experiment in using Claude.

I asked Claude to review what MemPalace required and determine how to build a database that would meet its constraints.

Turns out that the requirements of MemPalace are reasonably straightforward, and so Claude determined that the solution was to build an API layer on top of SQLite, use Turbovec instead of HNSW, and voila, turbovecdb v0.1

I am not going to argue that what I built is a masterpiece of software engineering; I am going to note that this was relatively straightforward and shows the power of interfaces.

AI and programmers learning again what Wirth figured out in 1974, types and structure mattters.

May 26, 2026 by kostadis roussos Leave a Comment

The thing about trying to turn TTRPG content into something that can be used in a game is that TTRPG content is a book of art.

Most folks who don’t play TTRPGs don’t realize that the content isn’t a set of tables but rather a piece of art. Everything from the cover, the art, the font, and the language makes or breaks the TTRPG content.

When using it as part of a system for creating campaign content, different parts need to be handled separately.

Want to change a “one-shot” to fit into a campaign? The story itself is the first part you have to change.

Want to make it harder? Then you just need the mechanical details of the monsters.

General-purpose RAG doesn’t work. And shoving a 120 MB PDF to get 10KB is a great way to waste time and make Anthropic’s shareholders and Nvidia’s shareholders wealthier.

So the first step is to transform the PDF into a structured document that can be used to make subset queries.

And because TTRPG is no different from any other discipline, it turns out this problem is why SAP and others have moved away from pure RAG toward creating structured documents that the Agent can interact with.

As a computer scientist, what makes me laugh? The original promise of LLMs was that it could read unstructured data, infer structure and make decisions.

And here we are discovering – yet again – that computers are great at operating on structure, not on systems with no structure.

Nikolas Wirth would be laughing.

AI Can Write the Code, But as Every Evil Overlord Knows, Every Plan Needs a Five-Year-Old to Vet It

May 18, 2026 by kostadis roussos Leave a Comment

Having observed how AI accelerates code development and simplifies the discovery of design and architecture, the next bottleneck is review.

The challenge with review is that it is ultimately a judgment call, and that improves over time. When an organization chooses to remove junior engineers because it sees only senior engineers as capable of conducting reviews, it is actually lowering its productivity threshold. The challenge you faced with reviews is that they are ultimately a human-biased artifact

Human beings bring their experience and their whole personalities into the review. Having a mix of senior and junior people from diverse backgrounds increases the likelihood that reviews catch defects. This is not novel, new science. It goes to pretty much every organizational principle since the hundred things I would do if I were the evil overlord list.

The list says if I were an evil overlord, I would have a five-year-old on staff whose only job was to tell me if a plan was silly and then fix it.

When you eliminate that diversity your effectively reducing the scope of your ability to detect problems Not the easy problems, but the really hard problems Furthermore, if your group is not diverse in its experiences, what will happen is that it will become less capable of tolerating dissent So although initially, you will get a productivity win over the long-haul, you will suffer from the collapse of your ability to detect issues.

As an industry, we know this very well, so when I see organizations that decide to go all in on a pattern where they retain a small number of similar people to do reviews and keep only them on the team, I see a team in danger of model collapse. They will have a short-term efficiency gain and a long-term failure as the Blindspot magnifies

I have seen this pattern at three different jobs where a senior technical team had a very homogeneous background and missed obvious trends because they saw them as alien to their experience In fact, it was one of those jobs that led me to conclude that the best thing a company can do is periodically, purge, their senior technical ranks, so as to ensure a fresh flow of insight, if the company is not growing sufficiently so that a large new body of technical leaders can be promoted to the highest levels

73 architecturalist papers: survivability – a framework for software and culture

April 28, 2026 by kostadis roussos Leave a Comment

Working Thesis

Software encodes culture. Culture increasingly depends on software. Therefore, the survivability of software systems determines the survivability of the societies that rely on them.

Origin of the Idea

I did not arrive at this concept through academic study.

I was exposed to cultures that had already survived:

through family
through relationships
through lived experience

These were systems that:

persisted across disruption
adapted without disappearing
survived even when institutions failed

At the same time, my professional life was focused on:

backup
disaster recovery (DR)
business continuity (BC)

At some point, the connection became unavoidable:

The same principles used to keep software systems alive are the principles that allowed cultures to survive.

Once seen, it cannot be unseen.

The Core Question

What allows a system—cultural or technical—to persist over time, especially under disruption?

Not:

correctness
performance
reliability

But:

Survivability

Definition of Survivability

A system is survivable if it can be re-created, adapted, and continue functioning after its assumptions about the environment break.

The Survivability Framework

A system survives only if it satisfies four properties:

1. Portability — Can it move without breaking?

Definition:

The ability to operate across environments without fundamental redesign.

Key Idea:

If it cannot move, it cannot survive change.

Failure Mode:

vendor lock-in
environment-specific assumptions
tight coupling to platform

Diagnostic Question:

How long would it take to run this somewhere else?

2. Redundancy — Can it be re-instantiated independently?

Definition:

Multiple independent ways to recreate and operate the system.

Key Idea:

Copies are not enough—independent capability is required.

Failure Mode:

single team knowledge
“hero engineers”
backups that cannot be restored

Diagnostic Question:

If the original team disappeared, could this system be rebuilt?

3. Sovereign Substrate — What do you control?

Definition:

The degree to which a system controls its environment.

Key Idea:

You cannot rely on what you do not control.

Failure Mode:

external pricing/licensing control
platform dependency with no exit path
reliance on opaque infrastructure

Diagnostic Question:

What parts of this system could be taken away from us?

4. Identity Preservation — Can it change without dissolving?

Definition:

The ability to evolve while maintaining coherence and purpose.

Key Idea:

Too rigid → extinction

Too fluid → loss of identity

Failure Mode:

breaking changes
fragmentation
drift without shared understanding

Diagnostic Question:

What must remain true for this to still be the same system?

The Hidden Layer: Assumptions

Every system encodes assumptions about its environment.

Survivability depends on how wrong those assumptions can be before the system collapses.

Portability Without Assimilation

Core Idea

A system must be able to move into a new environment without becoming indistinguishable from it.

The Trade-Off

Strategy	Benefit	Cost
Assimilation	Maximum local optimization	Loss of independence
Portability	Survivability, flexibility	Friction, constraints

Key Insight

The more perfectly a system fits its environment, the less likely it is to survive a change in that environment.

Important Clarification

Portability is not:

avoiding all platform features
building lowest-common-denominator systems

Portability is:

controlled integration
clear boundaries
replaceable dependencies

You can use the environment—you just cannot become it.

The Cost of Portability

Portability introduces:

friction
constraints
reduced short-term optimization

This is real.

But:

Portability trades short-term efficiency for long-term survivability.

Refined Principle

Assimilation optimizes for stability. Portability prepares for change.

Cultural Survivability (Observed Patterns)

From lived experience and exposure:

Systems survived when they were:

carried by people
replicated across many independent actors
able to move across environments
able to change without losing identity

They did not survive because they were:

stable
protected
centrally controlled

Critical Insight

A culture written down is not alive. It survives only when it is continuously executed by people.

Mapping Culture → Software

Cultural Property	Software Equivalent
Carried by people	Portability
Many independent communities	Redundancy
Spaces of self-governance	Sovereign substrate
Continuity through change	Identity preservation

Backup, DR, and Culture

BDR Concept	Cultural Equivalent
Backup	Distributed memory (people, practices)
Disaster Recovery	Ability to re-establish in new environments
Business Continuity	Maintaining function during disruption
Infrastructure Control	Cultural autonomy

Key Realization

Backup and disaster recovery are not just technical practices—they are survival strategies for any system that must persist.

Modern Risk: Software as Cultural Substrate

Today, society depends on software for:

identity
finance
communication
knowledge

Which means:

We are encoding culture into systems that may not be survivable.

The Risk

Many modern systems:

are not portable
lack redundancy
do not control their substrate

Therefore:

We are building cultural systems that cannot survive disruption.

Survivability Debt

Every time you sacrifice portability, redundancy, or control for convenience, you incur survivability debt.

This debt is only paid when:

the environment changes
assumptions break

The Engineer’s Blind Spot

Most engineers optimize for:

performance
integration
correctness

But rarely ask:

Will this system survive if the environment changes?

The Survivability Test

Evaluate any system:

Can it move?
Can it be rebuilt independently?
Does it control its environment?
Can it evolve without losing identity?

If any answer is “no”:

The system is fragile—even if it appears reliable.

Core Principles

Survivability > Reliability
Optionality > Optimization
Independence > Convenience
Execution > Preservation

Final Insight

Systems do not survive because they are correct. They survive because they can adapt, be carried, and be re-created across changing environments.

One-Line Summary

If culture runs on software, then software survivability becomes a prerequisite for cultural survival.

Closing Perspective

I did not learn about survivability from systems that worked.

I learned it from seeing what remained when systems failed.

And now the question is:

Are we building systems that will remain?

Next Steps (Optional Expansion)

Case Study: Vendor lock-in and infrastructure dependency
Case Study: Long-lived cultural systems
Practical: Survivability audit checklist for engineers
Patterns: Designing for portability without losing capability

If you want, I can next:

convert this into a PDF or nicely formatted essay,
expand each section into full chapters, or
build a practical engineering checklist/toolkit from this.

the nutanixist 31: why portability matters as much as backup and dr for business continuity: lessons from the broadcom acquisition of VMware

April 21, 2026 by kostadis roussos Leave a Comment

red white and blue abstract painting — Photo by Jazmin Quaynor on Unsplash

People often talk about resilience in software as if backup and DR are enough. What the Broadcom acquisition of VMware made me understand is that it isn’t.

Software survivability depends on three things working together: portability, physical ownership, and backup. Without all three, you are trapped. And once you are trapped, you are vulnerable.

Software is just text. Like human culture, it is inert without a physical substrate to store, execute, and restore it. That means survivability is not just a question of whether your code exists somewhere. It is a question of whether you can actually bring it back to life under adverse conditions.

To understand this, it helps to break any software system into two parts:

Infrastructure
Application or business logic

An application cannot run without infrastructure. But infrastructure is itself software, with its own dependencies, constraints, and failure modes. So survivability must be evaluated across the entire stack, not just at the application layer.

1. Backup Only Matters If You Can Restore It Yourself

A backup is only real if it can be restored without relying on the original vendor.

If recovery depends on the original authors of the code, the original cloud provider, or the original platform vendor, then you do not truly control your backup. You are renting recoverability, not owning it.

This matters most in precisely the situations where recovery is most important. If the vendor cannot or will not support you, then your backup is functionally useless. If the vendor’s recovery capacity is constrained, you should assume you are a low priority. During a black swan event, many customers may need restoration simultaneously. That is exactly when centralized recovery support breaks down.

If your restore plan assumes the vendor will be available, willing, and sufficiently staffed under crisis conditions, then it is not a restore plan. It is hope.

Furthermore, if your plan assumes that any relationship you have with a vendor cannot be disrupted because of events in the world, that is not a plan; that is a hope.

2. You Need Ownership of Physical Infrastructure

Even if you can restore independently, you still need somewhere to restore to.

If you do not own physical infrastructure, you have no guarantee there will be a place to run your system when you need it most. And this is not just about owning some hardware in the abstract. You need the right infrastructure, with the right capacity, available on the right timeline.

A recovery plan that depends on finding infrastructure during a crisis is not much of a recovery plan, either.

But ownership and restoration still are not enough.

Physical infrastructure can be destroyed. It can be denied. It can become unreachable. That can happen through deliberate action, geopolitical conflict, sabotage, or natural disaster. If the physical substrate disappears, your software, however well-designed, becomes unusable.

This is why survivability cannot stop at backup and infrastructure alone. There has to be another layer of freedom.

3. Portability Is What Prevents Lock-In From Becoming Captivity

Every software system contains two layers: infrastructure and business logic. Application portability is the ability to move that business logic across infrastructures.

This does not mean portability is free. It is not transparent, and it is never without cost. But it must be possible.

That distinction matters. A system does not need to move instantly or effortlessly. It does need to be capable of moving.

Without portability, your application is fused to a specific infrastructure stack. And once that happens, your bargaining power disappears. Pricing changes, policy changes, licensing changes, geopolitical shifts, or vendor instability all become existential problems rather than procurement problems.

Portability is what turns dependence into choice.

It does not happen accidentally. It requires discipline, architectural restraint, and a willingness to forgo some short-term convenience to preserve long-term freedom.

A Case Study in the Cost of Missing Portability

As the vCenter architect from 2015 to 2023, I believed VMware had a sacred trust with its customers. VMware’s attitude was that we did not lose customers. We worked hard to keep them.

That mindset led many customers to assume they did not need portability. They had backups. They had physical infrastructure. What they lacked was a portable application layer.

The hidden assumption was that the business relationship itself was fixed.

It was not.

The relationship existed only as long as capital markets agreed with it. Once the markets decided that the existing customer relationship no longer aligned with their interests, the relationship changed. And when that happened, the absence of portability became painfully visible.

Customers suddenly found themselves with no real negotiating leverage. Prices rose, and many had little choice but to pay. Porting away was either prohibitively difficult or practically impossible.

Why?

Because portability had never been built. No portable layer existed. And portability does not emerge by accident. If you do not explicitly design for it, it will not be there when you need it.

The Real Test of Survivability

A survivable software system is not one that merely runs well under normal conditions. It is one that can survive broken relationships, failed vendors, destroyed infrastructure, and changed incentives.

That requires three things:

Backup you can restore yourself
Ownership of physical infrastructure sufficient to restore onto
Portability of the application layer across infrastructures

Remove any one of these, and your resilience is incomplete.

Backup without self-service restore is dependency.
Infrastructure without portability is entrapment.
Portability without infrastructure is theory.

Survivability begins when all three are present at once.

Rich Plots, Real Improvisation

April 17, 2026 by kostadis roussos Leave a Comment

What Alice Saw

I shared a tool in a Discord channel the other day, and a new DM named Alice messaged me back:

“I’m gonna be so honest. This is Greek to me. I have no clue what you’re showing or what problem it solves. I’m a new DM.”

Fair

So I sent her a Google Doc I use to prep my Out of the Abyss campaign. Here’s a short snippet.

Score Name	NPC/Faction	Current Value	Next Threshold	What Triggers Next
Zuggtmoy’s Wedding	Zuggtmoy / Neverlight Grove	Elevated — 2 increases (Blingdenstone expansion + Basidia’s evacuation removed internal resistance)	Wedding completion / Araumycos union	Continued party absence from Underdark; no faction opposes Zuggtmoy; fungal spread reaches critical mass
Juiblex Rebirth	Juiblex	Low-Moderate — 1 increase (declared intent to consume Zuggtmoy’s domain)	Juiblex manifests a new physical form or begins attacking Zuggtmoy’s territory	Zuggtmoy’s wedding weakens her defenses; time passes without intervention

A minute later she replied:

“Ooooooooh. It’s for your campaign story. I thought it was a software thing, not creative.”

She was right. I’d explained the machinery before I’d explained the use.

So here’s the useful version:

I use AI to help me maintain campaign canon across long-running games, but I do not let it decide what counts as canon.

That distinction is the whole system.

What Alice saw was a planning document. At the top were four campaign clocks: four villain plots advancing in the background, each one changed by something the party had done, or failed to do, at the table.

Under that were faction states, NPC dossiers, and plot notes: who knows what, who wants what, what the party has learned, and what is changing offscreen.

I scan that document before every session. In a minute or two, I know what moved, why it moved, and what pressure is building in the world if the party does nothing.

I can’t run the campaign I want to run without this doc.

I also can’t write it by hand. Not across two campaigns. Not across a year of sessions. Not with dozens of named NPCs, each dragging their own history behind them.

This essay is about how I got it written anyway.

What Kind of Game I Want

I want a particular kind of game.

I want players making strange, committed, character-driven choices that I could not have predicted in advance.

That’s not abstract for me. I played Baldur’s Gate 3 cold, no guide, no walkthrough, and made a series of choices as Shadowheart that apparently almost nobody makes. Not because I was optimizing for rarity, but because they made sense for the version of her I was playing.

That’s what I want at the table: not correct choices, but real ones.

I run two D&D campaigns: a heavily modified Out of the Abyss campaign for the Ember Vanguard, and a Dragon of Icespire Peak / Lost Mine hybrid set in Phandalin. I actively push both parties to go somewhere I didn’t plan for. If the plot I prepped isn’t the plot they want, that’s fine.

That’s the deal.

Pick any two

The problem is that I want three things at once.

First: deep prep. Texture. NPCs with interiors. Villains whose behavior today is a consequence of something they chose eight sessions ago. Plots that keep moving, whether the party is watching or not.

Second: flexibility. When the party walks past the dungeon I built, something has to be where they actually went, and it has to feel like it was always going to be there.

Third: consistency. The villain I run in session 24 has to behave like the villain I ran in session 6. If I forget what Shal already said, did, or knew, the illusion cracks.

For a long time, that combination felt impossible.

Prep deep, and the moment players deviate, you’re improvising on top of prep that no longer applies. Prep loose, and the world gets thin. Try to prep every branch, and you end up burning the time you were trying to save.

And over a long campaign, the hardest problem is quieter: you lose track of what actually happened. The next scene drifts a little. Then a little more. Nobody stops the game to point out the contradiction. The fiction just gets lighter.

That’s the part people don’t say out loud. When the party walks past four hours of prep, I’m not upset because they missed it. I’m upset because I burned four hours on something that no longer matters. That lost time becomes thinner prep for the next session, then less energy in the session after that. Players feel it too. They become less willing to push into unplanned territory if they can sense I’m paying for it.

Before LLMs, I had more or less given up on getting all three.

What I Tried First

I wrote summaries from memory. That works for one campaign, maybe. It breaks fast when there’s too much to hold.

Then I tried using an LLM to write the summaries. Better than nothing, but imprecise in ways I didn’t always catch, and the errors showed up later, when they were harder to spot and more expensive to fix.

Then I found GMAssistant.app. That was a real improvement. It gave me solid summaries of what happened. But a D&D session isn’t just a sequence of actions. It’s dialogue, tension, implication, half-finished intentions, weird emotional turns. The recap could get the action right while still losing the feel of the session.

So I went further. I combined GMAssistant recaps with verbatim VTT transcripts from our Zoom calls. Then I built tooling around that. Then more tooling around the tooling. Six weeks, maybe two months, of real work.

I thought I was solving the record problem.

I was. Partly.

The Failure Mode

What I didn’t realize until it nearly cost me a scene was that I was also building a new kind of failure.

A few months into Out of the Abyss, I was designing the endgame around an earlier scene. A PC named Daz had come across evidence implicating a major NPC. My planned encounter assumed he had taken that evidence with him.

I checked the LLM-generated recap to confirm. It said Daz had discovered the evidence. Good enough, I thought, and I kept writing.

Then, by accident, I re-read the original session summary.

What had actually happened was narrower. Daz had looked at some unusual books on a shelf and left the room. He hadn’t opened them. He hadn’t taken them.

Noticed had become discovered in the paraphrase. And discovered had quietly shaded, in my head, into obtained.

If I had run the encounter as written, I would have retconned my own campaign.

That’s the failure mode.

The model hadn’t lied. It had paraphrased. But the paraphrase was fluent. It read like canon. It read so much like canon that I stopped checking the source.

And once that paraphrase enters the next stage of the pipeline, it hardens. A summary becomes a dossier entry. A dossier entry becomes a threat score. A threat score shapes the next session. Small errors do not stay small.

My first version of the tool had bought me depth and flexibility at the cost of consistency, and it had done it invisibly.

The obvious fix would be to go back to the source every time. But if I have to re-read everything every time, I don’t need the tool.

What Actually Worked

What actually worked was putting myself back in the middle.

Now I use the same loop at every layer.

First, the model reads what I can’t read quickly: a transcript, a stack of summaries, a year of sessions. It gives me candidate structure: NPC lists, draft dossiers, scene candidates, recap material.

Second, I review that structure. I fix names. I merge duplicates. I cut paraphrases that slipped into invention. I restore what matters and remove what doesn’t.

This step is not optional. This step is the work.

Third, the model takes the reviewed structure and renders it as prose: a dossier, a narrative recap, a planning document, or a threat tracker.

The model is strong at the first and third steps. It is unreliable at the second. Scope, attribution, ordering, what counts as canon, what matters dramatically: those are creative decisions. Those stay with me.

Skip that middle step, and the errors compound. Keep it, and the whole loop holds.

That’s the system.

What I Have Now

The threat tracker Alice saw is the direct output of it. Session material gets extracted into per-NPC dossiers. I review and reconcile them. Then the tool synthesizes the planning document from that reviewed canon.

That’s how I get villain clocks that stay consistent across a year of play. It’s how I get session recaps that read like narrative chapters without drifting into invention. It’s how I get pre-session cheat sheets I can trust.

It’s also how I keep two campaigns inside one world without the whole thing collapsing under its own weight.

What I have now is not magic. It’s just finally the right division of labor.

I have a threat tracker that tells me, before every session, which villain plans have moved and what moved them.

I have session recaps that read like narrative chapters, but only after I verify the details that matter.

I have NPC dossiers where Captain Tolubb is one NPC, not three duplicate spellings pretending to be different people.

I have two campaigns sharing one coherent world across a year of play.

And most importantly, I no longer mind when the party walks past the dungeon.

Because the prep I’m doing now is not the kind that gets wasted.

If You Want This Too

If you want this for your own game, the practical lesson is simple:

Use AI to read, sort, summarize, and draft.

Do not use it to silently decide what is true.

That part stays with you.

The tools are free. They’re also crude. I’m one GM iterating on my own campaigns, not a product team, and the learning curve is not zero.

I don’t make money on any of this. It’s a hobby. My goal is simply for more of us to have better tools, and for the people trying to make a living doing this to actually make a living.

So if this sparks something for you, build the version you want. If something I built is useful, take it. Fork it. Break it. Send it to a friend who runs games.

I would rather more of us had working tools than fewer.

The last essay was for people interested in the machinery: the loop, the trust layers between documents, and the searchable index I use to query reviewed content mid-session.

This one was for Alice.

72 architecturalist papers: rich plots, real improvisation

April 16, 2026 by kostadis roussos 1 Comment

Notes from running D&D with LLMs.

The trilemma

I run two D&D campaigns. Out of the Abyss, heavily modified. And a Dragon of Icespire Peak / Lost Mine hybrid set in Phandalin.

Between them: hundreds of pages of session summaries, dozens of NPCs, two plots branching for over a year.

Every GM I know wants three things at once.

A campaign that’s deeply prepped — with texture, with NPCs who have agendas, with plots that advance under their own momentum whether the party notices or not.

A campaign that’s flexible — that bends when players go somewhere you didn’t plan for. Which is always. Players are the point.

A campaign that’s consistent — where session 24 honors session 6, where NPCs don’t change their minds between games unless something actually happened, where nobody at the table says “wait, didn’t we already kill that guy?”

Prepped. Flexible. Consistent.

Pick two.

That’s the GM’s trilemma. Prep deep and you’re brittle — the party walks past your dungeon and you’re either railroading or improvising on top of prep that no longer applies. Prep loose and the world feels thin. Compromise and you get both.

This essay is about the tooling I built to stop picking two.

Not a product pitch. Working notes from a GM on what broke, what worked, and the pattern that ended up holding three things together.

Why this used to be impossible

Before LLMs, the leg that went first was consistency.

Prep depth is achievable. It just costs hours per session. Flexibility within deep prep is achievable too — deep prep gives you material to pivot to.

What isn’t achievable, not by anyone I’ve ever met, is holding the full state of a long-running campaign in your head.

A campaign with eighteen sessions and forty named NPCs is a body of information larger than memory. You remember what stuck. You forget the cleric from session 11. You half-remember the stone giant’s oath and improvise around the gap. Two sessions later, you’ve accidentally rewritten what he said.

It’s not a failure of effort. It’s bandwidth.

The standard responses are well-worn. Run shallower campaigns so the state fits in your head. Run published modules and let the book do the remembering. Keep a wiki you don’t have time to update. Keep a journal you don’t have time to re-read.

All of these work, to a point. None of them break the trilemma. They just lower the ceiling so the trilemma bites less.

I wanted the ceiling back.

Where LLMs break

LLMs are, on their face, the missing piece. They can read everything I can’t.

My first version of the pipeline was exactly that. Summaries in, prep out. The LLM reads everything. I read its output. Done.

It wasn’t done.

A few months into Out of the Abyss, I was designing an endgame encounter that leaned on an earlier scene. A PC named Daz had come across evidence implicating a major NPC. The encounter I was drafting assumed he’d taken the evidence with him.

I checked the LLM-generated recap. It said: Daz discovered the evidence. Good. I finished the encounter.

Then, by accident, I re-read the original session summary. What actually happened: Daz saw the books on a shelf, noted they looked unusual, and left them there. He hadn’t taken them. He hadn’t opened them.

Noticed had become discovered. Discovered had, in my head, shaded into obtained.

The encounter I’d built was internally consistent. If I’d run it, it would have been a retcon. Players notice retcons. Some small fraction of the fiction’s weight leaks out.

The interesting part is the mechanism.

The LLM hadn’t lied. It had paraphrased. And the output was fluent. It read like canon. It read like canon so smoothly that I stopped going back to check.

That’s the failure mode. The LLM’s output doesn’t announce that it’s a paraphrase. It reads like a record.

If I ask the LLM to also do the next step — generate the encounter from the paraphrased recap — it will happily do so. The paraphrase hardens into a scene with dialogue and stakes, pointed at my players. Errors don’t stay small. They compound.

So the naive version traded away consistency to gain depth and flexibility. And the trade was invisible.

The obvious fix — verify every output by hand — erases the value. If I re-read the source every time, I don’t need the LLM.

The real question was different. Could I structure the pipeline so the LLM does only what it’s reliable at — rendering verified structure — and a human review step sits at every point where precision matters?

That’s what the rest of this is about.

The extraction layer I don’t own

Before any of my tooling runs, something has to turn the raw session — hours of unstructured speech on a Zoom VTT — into usable signal.

I don’t do that part. Other people have built good tools for it. GMAssistant.app. Saga20. A handful of others in the AI-assisted D&D space.

My own pipeline leans on GMAssistant directly. The session doc generator uses its recap as an authoritative anchor for scene extraction. Without it, extraction is an unguided scan across the transcript, and unguided scans miss things.

The ecosystem matters. Crediting it matters. The tools I built sit on top of a layer other people built.

That said: the output of the VTT extraction layer is still an LLM extraction. Fluent. Plausible. Sometimes wrong in ways you won’t catch without reading the source.

The review beat applies to their output too. Everything downstream of the VTT passes through my gate before anything else runs.

The loop

The pattern I ended up with wasn’t elegant the first time. I didn’t arrive at it by principle. I arrived at it by iterating on two live campaigns and paying attention to where things kept breaking.

The shape my iteration settled into is three beats.

Extract. An LLM reads something I can’t read fast — a transcript, a session summary, a stack of extractions — and returns candidate structure.

Review. I read the candidate structure. I fix it. I merge it. I throw out what’s wrong. I add what’s missing. This is not optional. This is the thing.

Render. An LLM takes the reviewed structure and produces readable prose — a narrative recap, an NPC dossier, an encounter document. It’s rendering inside a structure I’ve verified.

That’s it.

The LLM is strong at beats one and three. It’s unreliable at beat two — scope, ordering, attribution, what counts as canon. Those are precision decisions. Those are mine.

Skip the review beat and errors compound. The first LLM’s paraphrase becomes the second LLM’s input. Two LLMs downstream, the original detail is unrecognizable. Nobody notices because the output is fluent.

Keep the review beat and the loop holds. The LLM does what it’s good at. I do what I’m good at. The content that comes out has actually been seen by a human.

I now run this loop at every layer of my pipeline. There are four seams where it lives. They’re worth walking through.

Seam one: NPC dossiers

The party meets an NPC in session 4. They reference him again in session 9 with a slight typo on the name. In session 15, someone else mentions a character who may or may not be the same person.

Without help, I now have three fragments of one NPC and no single view.

The script I use — planning.py –build-dossiers — extracts per-NPC information from every summary. Each dossier lives in its own file. Extract.

Then I read them. And half the time I find duplicates under different names. Captain Tolubb. Cap. Tolubb. Tolubb. Three files, one NPC.

I merge them. Pick the canonical file. Fold the names into an aliases: block at the top. Reconcile any content differences by hand. Delete the losers. Review.

When I run planning.py –synthesize on the reviewed dossiers, the synthesizer uses the aliases to rewrite every occurrence of “Cap. Tolubb” in the raw extracts to “Tolubb” before the LLM sees the text. The final planning doc treats him as one NPC. Render.

Skip the review beat — the merge, the alias recording — and synthesis treats every variant as a distinct NPC. You get the fragmented result back, now laundered into a clean-looking document. Exactly what the merge was supposed to fix.

The script can extract. I have to review. The script can render.

Three beats.

Seam two: session recaps

After a session, I have a VTT transcript, a GMassist-style recap, and a few hours of memory.

The session doc generator runs passes one through four — consistency check against campaign state, enhancement of structured sections, narrative plan, per-scene character extraction. Extract.

Then I stop the pipeline. Each scene’s extraction is written to a file. I open them in an editor. I read them against the VTT source. I add dialogue the extractor missed. I cut lines the extractor invented. I adjust emphasis when it got the emotional beat wrong. Review.

When I’m satisfied, I run pass five — narration — from the reviewed extractions. One character, one scene at a time. Render.

What comes out is a session document that reads like a novel chapter. First-person per character. Style-matched to their voice. Dialogue that was actually said, because I verified it in beat two.

If I skipped the review beat, the narration would be fluent and wrong in ways I couldn’t see without checking the VTT. Same failure mode as the Daz scene, one layer deeper.

Seam three: grounding docs

The bible of my campaign is the accumulated session summaries. Too long to read every time. Too important to not use.

distill.py and campaign_state.py extract from the bible. World state. NPC states. Completed quests. Open threads. Extract.

I read the generated world_state.md and campaign_state.md. I fix the things that are wrong. I add the things the extractor missed. I cut the things that are no longer true. Review.

Those reviewed docs are now the grounding context for every downstream prep script. prep.py reads them first. Everything I generate is rendered against a world state I’ve verified. Render.

This is the least visible seam. It’s also the one that makes all the others work. Prep generated from bad grounding is bad prep, no matter how good the prep prompt is.

Seam four: cross-campaign canon

I don’t just run two campaigns in parallel. I run two campaigns that share a world.

Group 2’s actions become Group 1’s history. Group 1’s consequences become Group 2’s present. This is fun and also a lot to keep straight.

I keep a notes/canon/ directory. Cross-campaign events go there when they happen. The party in one campaign did X; here’s what the other campaign’s world now has to account for. Extract. (Often by hand, not by LLM.)

I review the notes before they touch either campaign’s grounding docs. Often I’m the only person with enough context to know what an event actually means across the shared world. Review.

When I’m sure, I promote the relevant facts into the appropriate campaign’s world_state.md and NPC dossiers. Render.

notes/ is staging. Neither campaign’s palace indexes it directly. The canon gate runs across both worlds.

The query that kept failing

The loop gives me clean content. Reviewed dossiers. Verified grounding docs. Session recaps I trust.

Clean content in files is also passive content.

Four months into running OotA, mid-session, the party doubled back through a village they’d passed through five sessions earlier. Someone mentioned that bartender we met. I had six seconds before my table noticed I didn’t know.

I tried to ask NotebookLM.

“Hey NotebookLM, what’s the name of this person in this village the party met?”

The answer came back confident, fluent, and wrong. Wrong name. Wrong village. Wrong session.

NotebookLM doesn’t know what “the party” means. It has no persistent roster of my PCs. It has no trust layering — an NPC mentioned in a planning draft weighs the same as an NPC in a session summary, and the planning draft is speculative. It summarizes when I need it to return hits. It loses temporal context — met in session 4 looks the same as mentioned in passing in session 11.

Every time I tried to use it mid-session, the same thing happened. Fluent. Confident. Wrong.

This is the naive LLM problem from the Daz scene, one layer up. NotebookLM is doing extract + render with no review beat in between, no trust awareness, and no sense of which documents count as what.

So I had clean content I couldn’t query reliably. I’d solved the consistency leg with the loop and then rediscovered the same failure mode at the retrieval layer.

The flexibility leg wasn’t going to come from a better prompt.

Memory palace

The metaphor for what I built is older than computing. A memory palace. A building in your head where you place things so you can walk through and find them again.

Mempalace is that, externalized. A searchable index over the reviewed content the loop produces.

The important word is reviewed. The palace doesn’t index my raw summaries. It doesn’t index speculative notes. It doesn’t index the unrevised output of LLM extraction. It indexes what the loop has blessed.

The palace has structure. Three layers, different trust levels.

Narrative. The campaign bible, split into chapters. Authoritative. What actually happened at the table. If the palace tells me the narrative wing says X, X happened.

Chronicle. LLM extractions of the bible, organized by time. Search accelerator. Fast way to find the right session or the right window. Not authoritative on its own — the paraphrase problem still applies — but good at answering when did this matter.

Reference. Reviewed grounding docs, NPC dossiers, world state. Working reference. What’s currently true. Stable between sessions.

A query crosses the wings. Find the NPC in reference (who is this). Check chronicle for the time window (when did this matter). Verify in narrative (what actually happened).

That last step is the one NotebookLM doesn’t have. NotebookLM returns a paraphrase and calls it an answer. The palace returns a paraphrase and tells you the source paragraph. You can verify, in seconds, before committing.

Trust layering plus source retrieval plus structure-aware search. That’s the mechanism.

Prep time: the Daz problem, solved

Back to the Daz scene.

The question I should have run, before building the endgame encounter: what did Daz do with the evidence in that session?

In the palace: search the chronicle wing first — find the session where the evidence came up. Then pull the actual scene from the narrative wing. Read the two paragraphs. Three minutes, maybe.

What I would have found: Daz looked at the books, said something in character about them being unusual, and walked out of the room without touching them.

The encounter I was building would have been redesigned on the spot. Probably better for it — he knew and did nothing is a more interesting story beat than he took the evidence, for the kind of character Daz is.

That’s the consistency leg, live.

I now do this before every major prep beat. Anything I’m about to build downstream canon on gets a palace check first. Five minutes. Catches drift before it ships.

The seams of the loop feed the palace. The palace feeds the next round of prep. Clean content in, clean questions answered, clean content back out.

Mid-session: what I’m building toward

This is where the essay has to be honest.

I’ve been running the loop long enough to trust it. Prep-time palace use is in my workflow now. Consistency leg, back.

Mid-session use is the next chapter. I haven’t run mempalace at a live table yet — it’s new. What I’ve done before now is variants of the NotebookLM attempt, and I’ve reported how those went.

Here’s what mid-session use is designed to do.

“Hey palace, what’s the name of the bartender in Phandalin the party met in session 4.”

The palace knows the party — my PCs are listed in reference. It knows bartender maps to NPCs in the Stonehill Inn. It knows session 4 is a narrative-wing filter. It returns the two paragraphs where the encounter happened.

I read the name off the screen. I use it at the table. The scene keeps moving.

That’s the flexibility leg. Not the LLM generates an answer. The palace retrieves the passage and I read it. The retrieval is fast because the index is built on reviewed content. The answer is trustworthy because I’m looking at the source, not a paraphrase.

The design is done. The architecture matches the failure modes I diagnosed from NotebookLM. The at-the-table running-time is the thing I haven’t proved yet.

I’ll write the follow-up when I have.

Cross-campaign: the ceiling

The most interesting queries cross campaigns.

Group 2, years ago, captured Lolth as part of a deal with Vhaeraun. That action triggered the demon lord incursion. This is the premise of Out of the Abyss, Group 1’s campaign.

One campaign’s climactic decision is another campaign’s opening premise. When Group 1 asks why are the demon lords here, the correct answer involves Group 2’s choices from a year ago.

In notes-on-my-laptop world, this falls out of date the first time I forget to update one campaign after something happened in the other. It has fallen out of date before.

With a palace per campaign and shared canon in notes/canon/ that gets promoted into both, the coherence is maintainable. Query the Group 2 palace for Lolth. Query the Group 1 palace for Gromph Baenre. Both hits trace back to the same event. The world stays one world.

This is the ceiling the trilemma moves when it breaks. Not just consistent campaigns. Interlocking campaigns. Depth that compounds across groups and years.

No commercial tool does this. It’s not the market. It’s also not a hard problem — once the loop is solid and the palace is structured, cross-campaign canon is mostly an organizational question.

Trust is emergent

One thing I didn’t plan: the trust hierarchy fell out of the loop on its own.

A document’s trust level is how many review beats it has survived.

Raw VTT — zero review beats. Useful, authoritative in a literal sense (it’s the recording), but hard to consume.

LLM extractions of the VTT — one LLM beat, zero review beats. Accelerator, not truth.

Reviewed session summaries — one LLM beat, one review beat. Authoritative for what happened at the table.

Grounding docs synthesized from reviewed summaries — one more LLM beat, one more review beat. Working reference, trustworthy for planning.

I didn’t design this. I built the loop to fix a problem. The hierarchy emerged because different documents have been through different numbers of review beats, and I can tell the difference when I’m reading them.

The palace respects the hierarchy because it’s built on the output of the loop. The wings correspond to trust levels. Search accelerates by wing. Verification crosses wings.

If I’d tried to design the trust hierarchy up front, I would have gotten it wrong. I got it right by letting it fall out of the work.

The canon gate

One rule I don’t bend.

Nothing enters the palace without passing a review beat.

notes/ is staging. Arc drafts, encounter sketches, NPC ideas, speculative plot threads — all of it stays in notes. None of it gets indexed.

When a note becomes canon — when it’s earned its way into the campaign — it gets promoted to the appropriate grounding doc or dossier. Then the palace mines the new file. Then it’s searchable.

Not before.

The reason is the same reason the review beat exists at all. The palace’s value is that its answers are trustworthy. Start indexing speculative material and the next mid-session query returns a fluent, confident answer based on something I was thinking about doing, not something that happened. Exactly the NotebookLM failure mode, now in my own tool.

The gate is the thing. Everything else is support.

Things I tried that didn’t work

Worth listing. None of these are sermons — they’re my mistakes.

Indexing everything. Early on I mined every document I had into the palace. Every session summary. Every extraction dir. The published module. The result was search results dominated by the module and diluted by extraction redundancy. I now index only content with unique retrieval value. Module text lives outside the palace, accessed via the 5etools MCP instead.

Auto-promoting notes. I briefly had a script that watched notes/ and mined anything new into the palace. This took about a week to start returning speculative material as authoritative. Ripped it out.

Treating session_doc narration as record. The first-person per-character narrations are great to read. They are not the record of what happened. They’re a render of a review of an extraction. If I query the palace and get a hit from a narration file, I still go check the original summary. Same failure mode as Daz — fluent doesn’t mean source.

One-shot LLM prep. Feeding the LLM the whole summaries file and asking for an encounter. Works for shallow prep. Shreds consistency at depth. The loop exists because this doesn’t scale.

None of these were dumb ideas in the moment. They were natural things to try. They all break the same way — by letting the LLM do the architect’s job somewhere in the chain.

What I have. What’s next.

What I have, today, across two campaigns:

Clean content that a human has actually looked at. NPC dossiers with merged aliases. Grounding docs I trust. Session recaps where the dialogue is dialogue that got said.

A palace built on that content. Prep-time queries that catch the Daz problem before it ships. Cross-campaign coherence that holds across a year of play.

A consistency leg that’s back. A depth leg that’s real.

What I don’t have yet, honestly: live mid-session evidence that mempalace solves the flexibility leg. The NotebookLM attempts told me what doesn’t work and why. The palace is designed against those failure modes. I haven’t yet queried it under table pressure with a party watching.

I’ll write the follow-up when I have.

What I think this generalizes to: any long-form creative project with continuity requirements and LLM help. Novelists with series bibles. Worldbuilders with decades of notes. Researchers with years of reading. The loop doesn’t care that it’s D&D. GMs happen to be a good test case because campaigns combine volume, continuity, and table pressure.

I’m going to write more about the architectural side of this. Control-plane thinking applied to creative tooling. What the trust hierarchy looks like as a system. Why the extract/review/render pattern is a general answer to LLM-assisted knowledge work, not a D&D trick.

Those essays will be longer on the architecture and shorter on the dice.

This one was for GMs.

The ecosystem

Tools I lean on or built. One line each.

GMAssistant.app — turns Zoom VTT transcripts into structured session recaps. First extract beat in my pipeline.

Saga20 — similar space, different approach to VTT signal extraction. Worth knowing about.

CampaignGenerator — my tooling. Session prep, session doc generation, grounding doc synthesis, NPC dossier building. The scripts that run the loop. ([github link])

mempalace — the palace itself. Wing/room architecture, trust-layered retrieval, MCP-accessible. Build guide: MEMPALACE_HOWTO.md. ([github link])

5etools MCP — published module text stays here, out of the palace, accessed on demand.

notes/canon/ — not a tool, a convention. Shared cross-campaign history. Portable across workspaces.

Follow-up essay, after I’ve run this at a live table, will report.

71 architecturalist papers: backup is proof of existence

April 15, 2026 by kostadis roussos Leave a Comment

One thing I do not understand is how enterprise companies will rely on infrastructure they cannot restore without help from the vendor from a backup.

My belief is that a system exists only if you, without the vendor, can restore it from a backup.

A system that cannot be restored from backup doesn’t exist.

It is contingent on other forces that can destroy it at any point in time.

And that, as a result, relying on that system for anything that must survive the contingent force is irresponsible.

So, for example, if your business relies on a system that cannot be restored from backup and that system lives in a data center, then you are saying that your business is contingent on the availability of that data center.

Some will argue that, well, you could always restore things if you have enough time, and my answer is yes, you can if you have enough time, but what is enough time?

If the system is complex enough, time can be months.

And if the time when it needs to be restored is measured in days, then the fact that it could be restored in months is irrelevant.

Recently, I read about a man whose son locked his entire Google account. As a result, all of his emails, his contacts, and his contracts were gone. And with taxes due in a few weeks, he could not file taxes, he could not reach out to customers, and he could do nothing. Bills could not be paid. Why? Because he had no backup. He had critical data that was owned by Google. Not by him. And when Google decided he no longer had access to it, it was lost.

That anyone would allow themselves to be in that situation is a mystery to me.

When I joined VMware, I discovered how we tested back in 6.0. We created a new vCenter, backed it up, restored it, and declared success. I asked the team to take a backup of a system that was running, and try a restore, and guess what, the restore didn’t work.

It struck me as mind-boggling that with that VCenter in 2015, it could not be effectively restored from a backup.

Over the years, I struggled to make infrastructure, in particular, VCF, something that could be restored from backup, but what I discovered is that nobody cared.

I don’t mean nobody at VMware cared. I mean, the customer base didn’t care.

This utterly confused me. I could not believe it.

It was only when I went to a customer that I realized that the customer didn’t trust the backup. It trusted VMware to do whatever it took to restore a system.

And what I realized was that the constraint on the customer was how many engineers VMware had on staff to recover backups when the backup procedures failed utterly.

In many ways, the VMware engineering team was insurance for the entire industry should customer systems fail spectacularly.

If you depend on VMware engineering, you are at the whims of whoever runs it and whether their interests are aligned. And to be 10000% fair, this is true of any vendor on the planet.

Broadcom, by stripping the engineering team of redundancy, created a scenario in which a catastrophic failure requiring a large surplus of engineers would result in extraordinarily bad outcomes for the world as a whole.

Why? Because the only backup that any customer can rely on effectively is the VMware engineering team, and that VMware engineering team is smaller.

What has happened is that the industry has said, “ I don’t need to have a backup because I can trust VMware.”

But you can only trust VMware as long as VMware’s interests and yours are aligned.

And when that is no longer the case, the fact that you don’t have backups that you can restore from without VMware means that you are at the mercy of VMware’s business priorities.

And this isn’t about VMware; this is about any company.

Backups you can restore from are your insurance policy if the Vendor fails you. If you can’t prove you can restore, your single point of failure is another business that can change on a dime.

And for me, as someone who had to restore a business from a backup in 12 hours or risk an IPO, the idea that you wouldn’t have your own backups and rely on somebody else is unbelievable and unfathomable.

the architecturalist 70: no-engineering relies on LLMs as renderes and good enough planners, but not correct planners

April 10, 2026 by kostadis roussos Leave a Comment

When you look at AI systems, they are really good at getting to an approximate answer. If you think about randomized algorithms, they don’t return optimal answers, but they do return good enough answers most of the time with a lot less effort. When a product manager wants to create a product using AI and isn’t an engineer, they can get a pretty good, approximately correct answer. But it’s approximately correct.

In fact, about 4 years ago, my buddy and I chatted about this. AI is clever, meaning it gets to an answer. AI is not correct, meaning it doesn’t produce the precise correct answer. Sometimes you need the precise correct answer, and that’s where the LLM fails.

The workflow between product management and engineering is going to look like this; PM provides an approximate answer that gives a good-enough view of what the feature is. The PM team uses the LLM to create a plan and to render the code. The goal is to gain line of sight into the feature and validate its value.

Engineering takes that feature, and then creates a better plan that is informed by engineering expertise, and attributes the PM team that is narrowly focused on the feature, doesn’t have to, and should not have to care about. The detailed plan is then used to render the code. The old world was PM produces a document, then engineering creates a plan, renders code, PM reacts to the code, PM changes design, and …

The new world is PM produces a prototype rendered to code automatically from a plan created by the AI tool. When PM is satisfied with the prototype, engineering reviews the tool and then creates a more detailed, precise plan to generate the code.

Engineering may use AI to create part of the plan, but the evaluation of the plan and its correctness will contain insight and details that PM doesn’t and shouldn’t care about. Is it possible for PM to go end-to-end? Yes, but I also believe that the less you rely on expertise, the more likely you are to experience poor outcomes over time.

Anyways, the way I look at it is this: Use AI coding assistants to get an approximate answer that meets your goal. Take the approximate answer and refine the plan until it is more correct. Use the corrected plan to render better code

The run that hung

What I was actually doing

How the ensemble actually happened

This is the price of iteration

What beat what

Where this doesn’t hold

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Working Thesis

Origin of the Idea

The Core Question

Definition of Survivability

The Survivability Framework

The Hidden Layer: Assumptions

Portability Without Assimilation

The Cost of Portability

Cultural Survivability (Observed Patterns)

Mapping Culture → Software

Backup, DR, and Culture

Modern Risk: Software as Cultural Substrate

Survivability Debt

The Engineer’s Blind Spot

The Survivability Test

Core Principles

Final Insight

One-Line Summary

Closing Perspective

Next Steps (Optional Expansion)

Share this:

Like this:

1. Backup Only Matters If You Can Restore It Yourself

2. You Need Ownership of Physical Infrastructure

3. Portability Is What Prevents Lock-In From Becoming Captivity

A Case Study in the Cost of Missing Portability

The Real Test of Survivability

Share this:

Like this:

What Alice Saw

What Kind of Game I Want

Pick any two

What I Tried First

The Failure Mode

What Actually Worked

If You Want This Too

Share this:

Like this:

Notes from running D&D with LLMs.

The trilemma

Why this used to be impossible

Where LLMs break

The extraction layer I don’t own

The loop

Seam one: NPC dossiers

Seam two: session recaps

Seam three: grounding docs

Seam four: cross-campaign canon

The query that kept failing

Memory palace

Prep time: the Daz problem, solved

Mid-session: what I’m building toward

Cross-campaign: the ceiling

Trust is emergent

The canon gate

Things I tried that didn’t work

What I have. What’s next.

The ecosystem

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: