Mixed Organizations Won't Run on Prose
Imagine your company in a few years. The org chart is bigger, and it's mixed. Some of the boxes are people, some are agents. The agents draft, summarize, file, follow up, escalate, and decide things small enough that nobody bothered putting a human in front of them. They have Slack handles. They have email aliases. They show up in standups.
The naive picture, the one almost everyone selling this future implicitly assumes, is that the existing coordination machinery just keeps working. Slack channels. Prose handoffs. Status updates. Standups. The agents fit into the seats and we keep doing what we were already doing, only more of it.
This is the scribe-era picture of the printing press. It's the shape of the answer when nobody has yet asked the right question.
Here is the right question. When humans coordinate in prose, they coordinate well enough. When agents coordinate in prose, the joint state of the system silently corrupts, and the corruption compounds in ways that are very hard to see until the system has done something irreversible.
Why?
When two humans coordinate over Slack, the Slack channel is doing maybe ten percent of the work. The other ninety percent is happening in the social fabric around it. If I send you a message that's slightly ambiguous, you've worked with me for a year, you know what I probably meant, you ask if you're not sure, you raise an eyebrow, you push back. You hold a model of what I know and what I don't. You hold a model of what we agreed last month. You correct me when I drift, I do the same to you. The channel transports a small pre-compressed update; everything load-bearing happens in the embedding context that the two of us share but that is nowhere written down.
Agents don't have that fabric, at least not yet, and not in the way that matters here.1 When two agents coordinate in prose, the message is everything they have. There is no shared year of working together, no embodied sense of what the other has and hasn't seen, no eyebrow. Each message has to carry its own context, and prose is a deeply lossy encoding of context. So things get dropped. Two agents agree on a value but mean two different things by it. One reports a confidence; the next treats it as a fact. One agent's "approved" means legal review passed; the next agent's "approved" means a junior PM nodded.
You can't tell from the outside that this is happening. Each individual handoff looks fine. The system runs. It ships outputs. It just turns out, three weeks later, that the outputs were built on a stack of small misalignments nobody flagged because nobody could flag them. The misalignments lived between the messages, not in any one of them.
This isn't a problem you fix with better prompts. Prose is the wrong substrate for joint state across many parties who don't share a body, a history, or a fabric. Humans patch this up with their fabric; agents in current and near-term systems don't, in the way that matters here. So the substrate has to do the work the fabric was doing, or something else has to grow into that role.
I am being a little dramatic. Prose between agents will keep working for a long time, in the same way that prose between humans on a small Slack channel works fine. The breakdown is at scale, and at the boundaries between subsystems, and over time. The places where coordination usually breaks, only worse, because the participants are cheaper and there are more of them.
Now generalize.
This is not, fundamentally, a new problem. Whenever multiple sources report on overlapping state and a downstream consumer has to act on a unified picture, you have a distributed state reconciliation problem. Multi-sensor fusion has it. Cross-organization workflows have it. Compliance pipelines have it. Any data fabric pulling from more than one upstream has it. Three things have always been used to "solve" it:
- Naive merge. Average them. Last write wins. Pick the source with highest priority. Silently drops conflicting signal.
- Synthesize in prose. Hand it to a writer (or now, a model) and ask for a unified narrative. Confabulates a plausible-but-wrong unification when the inputs actually disagree. Prose is generative; it will produce coherent output even from incoherent input.
- Build a custom pipeline. Per-domain, brittle, expensive. Most large companies have a graveyard of these.
None of the three treat disagreement itself as a first-class object: which sources disagree, on which attributes, with what provenance and trust, and whether the disagreement is severe enough that the system should refuse to commit. All three pretend the answer always exists. None tell you when it doesn't.
The right shape, I think, is one where disagreement is the primary thing the substrate sees, and consensus is just disagreement that happens to be small enough to ignore.
Concretely you want:
- A representation of state that can hold which source said what without flattening it into a single value.
- An operation that computes a unified view and tells you exactly where, and how badly, the inputs disagreed to produce it.2
- A way to refuse. To return no answer when the inputs are too far apart for any unified view to be honest. Most current systems always return something. The right substrate sometimes returns "no, you don't have a coherent picture here, go look at sources A and C on attribute X."3
- Per-source trust as a first-class input, not a global hyperparameter you tune once and forget. Different sources are reliable on different attributes. The substrate should know.
There is a real formalism that does all of this. It's called cellular sheaf cohomology, it has been around for a while in pure math, and the applied version matured into something tractable in the last two or three years.4 You don't need to know any of the math to use it any more than you need to know the math behind TLS to make an HTTPS request. But the math is what lets the system distinguish "the inputs agree" from "the inputs disagree at this specific edge of the graph by this specific amount."5
Without it, you are back to averaging and hoping.
I won't litigate the math here. The point is the shape of the bet, not the apparatus.
The bet.
Mixed human-agent organizations are going to be the dominant shape of work within a decade.6 The infrastructure that holds them together is not going to be a Slack workspace with more bots in it. It's going to look more like a typed reconciliation substrate that humans and agents both call into, as first-class participants over the same protocol surface. Both sides can read the joint state, propose corrections to maps and observations, and request trust adjustments. Acceptance, authority, and audit remain role- and policy-governed, the way they would in any serious system; what changes is that agents act through the same primitives humans do rather than through a retrofitted read-only API. Most current agent infrastructure is asymmetric in the wrong direction: humans run the system, agents consume it. The shape that's needed runs in the other direction, with agents and humans as peers at the protocol layer even when their authority profiles differ. One where joint state is explicit, disagreement is explicit, provenance is explicit, and the system can refuse to commit when the inputs don't actually agree.
There is a version of this that is ten percent better than what we have. Slack with structured handoffs. JSON-schema'd tool calls. The current agent stack patched up.
I don't think that's enough. I think the patch-up version produces exactly the silent-corruption failure mode I described above, just with prettier syntax. The thing that is structurally needed is a substrate where the unit of communication is not a message but a piece of joint state with provenance and trust, and where the operations on that state are mathematically guaranteed to surface disagreement instead of papering over it.
I might be wrong. The shape I'm describing might be solved adequately by patches I haven't imagined, and the formal stuff might turn out to be over-engineered for what people actually need. Year-fifty printing-press analogies cut both ways; maybe what's needed here is the equivalent of better punctuation, not a new genre of broadside.
But the load-bearing intuition is that prose is lossy, prose between agents who don't share a fabric is very lossy, and the lossiness compounds. If that's true, then the bet on "agents talking to each other in prose" is the bet that scribes' practices would scale to print. They didn't. The native form took a while to find.
I think the native form here is closer to typed shared state than to a chat log. That's what I'm building :)
-
They have a fabric of sorts, the training corpus, but it's shared by every agent equally, which means it cannot disambiguate "what we agreed last month" from "what humans in general say about agreements." The agent-specific fabric, the one that tracks the history of this coordination between these parties in this organization with these rules and exceptions, has to live somewhere external to the agents. Pieces of it exist today scattered across issue trackers, PR history, workflow engines, event logs, CRMs, memory stores. What doesn't exist, generally, is a unified, inspectable, coordination-native substrate that holds joint state with provenance and trust as first-class objects rather than as documents to read. Most current attempts are shaped like "give the agent more to read" rather than "track what the agents collectively know and have committed to." The latter is the part that matters, and the part that is missing.
There's a deeper version of this point. Meaning, for humans, is sustained by collective practice: corrections, eyebrow-raises, the million tiny acts of social enforcement that keep a community of speakers roughly in tune with each other. Modern frontier models do receive some corrective practice through RLHF, constitutional training, and in-context corrections, but none of that is the live, situated practice of this organization with this month's commitments and exceptions. The substrate, in some sense, has to do for agents what the live local practice does for humans: keep them in tune with each other and with the joint state of the system, by making disagreement visible and correctable. That's a lot to ask of an API, and I don't claim to know exactly what shape it takes, but something has to do that work, and it isn't going to be Slack. ↩
-
This is what cohomology computes, in three words. Slightly more carefully: build a graph whose nodes are "what each source says about each shared attribute" and whose edges are "this source's view should agree with that source's view via this known map." A global section is an assignment of values to every node such that every edge constraint is satisfied: a unified state every source agrees on. The space of global sections is H⁰, the zeroth cohomology of the sheaf. When the authored maps fail to compose consistently around overlap cycles, the space of global sections shrinks: the H⁰ deficit counts how many dimensions of expected agreement get destroyed. That deficit, read directly off the authored maps, is the substrate's primary structural-inconsistency witness. Per-cycle rank-defects localize it: the substrate can point at exactly which authored cycles fail to close, and by how much.
Useful to distinguish two failure modes that look similar from the outside but require different machinery.
The first is structural: the schema-mappings between sources don't compose. If A→B, B→C, and A→C are the three published mappings between three databases and going around the triangle doesn't compose to identity, the authored maps are internally inconsistent regardless of what values anyone reported. Caught statically from the map metadata alone, before any observation arrives.
The second is value: the maps compose fine, but the observed values disagree. A says Alice's birthday is 1990-03-15, the mapping to C is identity, and C says 1990-04-15. No structural defect; the maps close. The disagreement shows up in the runtime as residual energy on certain rows of the weighted least-squares solve, with per-source and per-edge attribution.
A useful substrate has to detect both. They tell you different things: structural defects mean the model of how your sources relate is wrong; value conflicts mean the model is fine but the data disagrees. Almost everything currently sold as "data unification" silently collapses the two into a single confidence score, and you get to find out the hard way which kind of failure produced it.
Real systems are not three-node triangles. They are subjects compiled into graphs with hundreds of nodes and edges, with trust weights per source per attribute, and restriction maps that come from data-engineering work nobody enjoys but everybody has to do. The math doesn't care; the same operator runs on three nodes or three thousand. What changes is that you need sparse linear algebra rather than pencil and paper. ↩
-
Michael Robinson calls the relevant scalar the consistency radius, informally "how much would you have to perturb the inputs to make them all agree?" If the consistency radius is small, the disagreement is plausibly noise, and the substrate can pick a representative reconciled state, returning the residual alongside it for downstream consumers that care. If the consistency radius is large, the inputs are not really telling the same story, and any single reconciled value the substrate returns is a lie. The right behaviour at that point is to refuse, to return "no coherent picture, look at sources A and C on attribute X," rather than confabulate.
The threshold between those regimes is a tunable parameter, not a universal constant. For some workloads (financial reconciliation, audit, regulated decision-making) the threshold should be near-zero. For others (sensor fusion in noisy environments, opinion aggregation) it should be permissive. The substrate should make this tunable per-tenant, per-attribute, per-call. The capability, knowing when to refuse, is the load-bearing thing. The threshold is just a knob. ↩
-
Why this formalism rather than another? The architecture-level slogan, if you want one, is lossless fusion. Semantic-web stacks (RDF, OWL, named graphs, PROV-O) can preserve provenance and source context just fine; what they don't make first-class is local compatibility between source-local views and obstruction-detection across overlap cycles. Sheaves invert that emphasis. Source-local stalks, restriction maps between them, approximate gluing, and the obstructions to gluing become the central computational object rather than something you bolt on top of triples. "Ontology" in this post means lossless-fusion-shaped, not formal-taxonomy-shaped; the word is being used in the older, more precise sense.
The lineage I've been reading. Justin Curry's 2014 thesis (Sheaves, Cosheaves, and Applications) for the cellular formalism, still the cleanest umbrella treatment. Hansen and Ghrist's 2019 Toward a Spectral Theory of Cellular Sheaves for the sheaf Laplacian: the sparse positive-semidefinite operator whose kernel encodes the global sections, and the right computational object to actually solve with. Hansen and Gebhart's 2020 Sheaf Neural Networks as one ML application of that machinery. Robert Ghrist's Elementary Applied Topology chapter 9 for the exact-cohomology baseline. Michael Robinson's Topological Signal Processing (2014) starts the signal-processing program; the approximate-section / consistency-radius formalism that handles real-world data is in his later pseudometric papers, especially Assignments to Sheaves of Pseudometric Spaces (arXiv 2018, Compositionality 2020).
The engineering substrate has matured alongside the math. Catlab.jl and the broader AlgebraicJulia ecosystem give you tractable cellular-sheaf primitives. The Topos Institute publishes consistently in this area. The applied-category-theory community (DisCoPy, Conexus, the ACT conference series) has been quietly turning category-theoretic abstractions into computational tools for the better part of a decade. PySheaf exists as a research-grade implementation of Robinson's pseudometric version, though it isn't shaped like a production substrate.
None of this is exotic any more. It just hasn't shown up at the application layer yet, partly because the math has a learning-curve cliff, partly because the sparse-numerics infrastructure (SuiteSparse and the Rust ecosystem around it) only recently got fast enough to make this tractable at production scale. ↩
-
Concretely. You compile each subject (a thing being reconciled) into a graph whose vertices are "what each source says about each shared attribute" and whose edges are the known restriction maps between them. Stack the restriction-map constraints into a matrix δmap, stack the observation anchors (what each source actually reported) into a matrix Aobs, concatenate them into one operator D. Then the practical computation is a sparse weighted least squares: minimize ‖W(Dx − y)‖². W is per-source-per-attribute trust (a diagonal weighting matrix). y is the observations. x is the latent reconciled state.
Two distinct objects worth keeping separate. Exact consensus, "every source agrees, no perturbation needed," lives in H⁰ = ker δmap: the space of global sections. The runtime, with finite trust weights, returns a weighted least-squares representative x̂ balancing map closure against observation fit. x̂ may or may not lie in ker δmap. If δmapx̂ = 0, x̂ is a global section (the authored maps close on x̂); if the full residual Dx̂ − y also vanishes, x̂ is exactly consistent with the observations as well. When neither condition holds, x̂ is the closest weighted compromise the constraints permit.
Two distinct inconsistency signals, also kept separate. The structural signal, the H⁰ deficit, is read directly off the maps before any observation arrives: how much does rank δmap exceed the spanning-tree baseline that would obtain if all overlap cycles closed cleanly? Per-cycle rank-defects localize this per authored cycle. The value signal comes from the runtime residual r = Dx̂ − y, decomposing into a map-side piece on δmap rows and an observation-side piece on Aobs rows, each with per-source and per-edge attribution. The two signals detect different failure modes; the substrate exposes both and never collapses them.
Solver depends on size. Sparse Cholesky on the normal equations for medium subjects, sparse rank-revealing QR (SPQR) for ill-conditioned ones, iterative least-squares methods like LSQR or LSMR for very large ones (with conjugate-gradient-type methods on the normal equations as an alternative when that formulation is appropriate). The choice is an engineering question, not a foundational one.
None of the people calling the API will ever see any of this. They get a reconciled state, a structured list of disagreements split into structural and value classes, and a refusal when applicable, in the same way that nobody calling Stripe sees the double-entry ledger underneath. The internals are deep and the surface is simple. That's the bet at the engineering level: the math is real, but the math being real is the substrate's problem, not the user's. ↩
-
I could be wrong about timing. The base rates for "infrastructure shifts that everyone could see coming" usually clock in at fifteen to twenty-five years from "obvious in hindsight" to "dominant," and we're at most five years into agents being a thing people actually build with. So ten is fast, twenty is base-rate, and "longer than that" would imply that the shift gets stuck, which historically happens when the supply side is wrong about the demand side, not when the demand side is unclear. The demand side here looks unusually clear: agents are getting cheaper monotonically and organizations want to use them, which is the part that's hardest to fake. So I'd be shocked at twenty-plus, sceptical of "before 2030," and expect it to land somewhere in between. ↩