Chasing the lingua latenta

by onblueroses, co-authored with claude · Jun 5, 2026

I think two language models could share more with each other than text. A lot is going on inside each model when it processes input: activations flow through its layers, intermediate representations hold whatever the model is currently computing. None of that crosses over when one model writes to another; the receiver has to reconstruct the sender's state from the tokens alone. I think there is a richer form of communication possible through the substrate (the internal representation layers, beneath the output text). I am still looking for it. What I have tried so far has not gotten there.

Earlier work in this lineage went at the geometric question first. If you take two different LLMs and look at their activations at the same relative depth on the same input, do those activations represent related things in related ways? And can you find a learned bridge that maps from one model's space into the other's, so that information picks up reliably on the other side?

It turns out you can. With the right calibration, a learned linear bridge between two different 7-billion-parameter models at the right depth lets you take an activation from one and find its match in the other roughly 94% of the time, across a wide held-out corpus.¹ Cross-family pairings (Mistral with Falcon, say) work as well as same-family ones; what matters is not the lineage of the models but whether the bridge is fit on a corpus that matches the test domain.

But knowing the receiver can identify what was sent is not the same as the receiver actually doing the right thing with it. Discrimination is one question; transmission is another. When the same alignment runs end-to-end as a wire protocol against an open-domain library fit on a different corpus than the alignment, the discrimination signal collapses by orders of magnitude.² The alignment exists; the deployment does not survive corpus shift.

A panel of follow-up probes pushed at the question of whether the aligned space behaves like a language (something compositional, that generalizes to combinations you have not seen) or a code (a lookup, efficient on the things it was fit for and brittle off them). The probes tested specific things you would expect a language to do: compositional generalization to held-out combinations, identity preservation across many hops of relay, transfer across model families, symbolic operations like negation and aggregation, robustness to model-version drift, robustness to noise and adversarial inputs. The pattern across the whole panel was the same. Cosine similarity stays comfortably near one. The things that would let the receiver actually use the information (generalize, preserve identity through a chain, transform symbolically, withstand a small perturbation) kept falling out.³

In parallel I ran a smaller program testing whether one model's activations could carry propositional content into another through a more direct channel: pool the sender's activations at one layer into a single summary vector and inject it at the input of the receiver as a soft prefix (a learned continuous prompt the receiver reads alongside its own text input). The receiver had to distinguish minimal pairs like "Alice trusts Bob" against "Alice does not trust Bob" (negation), or "Alice trusts Bob" against "Bob trusts Alice" (role-binding). At every bottleneck dimension and every training run I tried, the receiver could not pick correctly. After finding and correcting an artifact in how the test set was scored, the channel sat at chance against a no-channel baseline.⁴

So: discrimination signal stays bright across architectures. Behavioral transmission keeps not working.

Set the transmission story aside for a moment, because the discrimination result is interesting on its own terms. What I have here is a retrieval mechanism: take a sentence, push its activation through a cheaply-fit linear bridge between two different model architectures, find its match in the other model's space. The bridge fits in seconds on a small calibration corpus; the test domain has to be covered by the calibration; any sentence the calibration spans can participate. The same-family cross-architecture number is 94% top-1 retrieval. On open-domain testing across seven text domains the same setup reaches 87% with just 300 calibration sentences, and 95% when calibration is fit directly on the test corpus.¹² But push the retrieval setting and the numbers move fast: same-domain distractors cost about 4 points (97% to 93%), close paraphrases of the target drop k1 from 97% to 68%, and on two small standard retrieval tasks (NFCorpus and SciFact from BEIR) the substrate channel sits 42 to 59 percentage points below the cheapest open-weights dense retrievers at recall@1.¹⁴ The 94% is a within-corpus ceiling under matched calibration with random distractors; it is not a deployable retrieval primitive. What it indicates is how much within-corpus geometric structure two open-weights models share, which is interesting on its own terms; it does not get you a free retriever you can ship.

Three more experiments in this vein, three results that did not go where I hoped.

The first thing I tried was a stronger behavioral test of the cross-architecture story. If two models' activations align well at the right depth, then taking the sender's residual stream mid-generation and injecting it into the receiver's forward pass at a matching layer should shift the receiver's outputs toward whatever the sender was processing. This is patching: graft the sender's internal state into one point of the receiver's computation and let the rest of the receiver's network process it as if it were its own.

I ran this across three receiver models in the 7-billion range, each crossed against multiple donors, with matched-genre controls and decoy injections from irrelevant donor inputs as a sanity gate.⁵ The patching did almost nothing. None of the receiver-target pairs cleared the pre-registered effect threshold; the targeted injections were statistically indistinguishable from the decoys.⁶ At this scale, with these architectures, cross-architecture patching does not move the receiver toward the donor. The bright signal on discrimination did not survive a real test of behavior.

So I backed up. If activations do not transfer across architectures by patching, the question becomes what is even in there worth shipping, even within a single architecture. Can a model compress its own internal state into a small number of bits and still recover the same behavior when it decodes that state back?

I tried two families of activation compression against a strict retrieval test on a single model's activations, with full accounting for any side-information the codec carries (basis vectors, retrieval bank, fitted weights; anything the receiver would otherwise get for free).⁷ Neither family reached the passing threshold anywhere in its sweep. Meanwhile, ordinary text codecs (gzip, brotli, an arithmetic coder over a small language-model prior) passed trivially at modest bit budgets.⁸

The model's own activations, at the layer I probed, on this corpus, do not have the clean low-dimensional structure a quantization codec needs. Text already does.

The third thing was a head-to-head. If a substrate channel needs to beat the text channel to be interesting at all, build both at the same bit budget and race them on a task that requires the receiver to actually do something, not just identify what was sent.

Before doing this I had to know which layer to probe. Too early and the activations are still mostly token-level surface; too late and they are a compressed summary that loses the structure you want to transmit. I scanned layer by layer by patching uncompressed activations from a sender into a receiver and watching whether the receiver could answer held-out queries on a small skill-transfer task. One specific layer came out cleanest: when activations from there were transferred uncompressed, the receiver answered queries comfortably above the prompt-only baseline.⁹

Then the actual race. Two instances of the same model. The sender processes a small task demonstration (a few input-output example pairs showing some simple skill). The sender's activations at the chosen layer get encoded through a codec into a fixed bit budget. The receiver gets those bits in one of two forms: as an activation injection back into its residual stream at the matching layer (the activation channel) or as a piece of text prepended to its context window (the text channel). Then the receiver answers a held-out query that requires it to have learned the demonstrated skill.¹⁰

Text won at every bit budget I tested. The gap between the activation channel and text was statistically certain to be in text's favor at every budget where I had paired evidence; at higher budgets, where I only have unpaired comparison, text ran 30 to 300 times the activation channel's accuracy.¹¹ The activation channel never crossed its prompt-only floor at any tested budget.

So: the substrate was carrying the signal, when transferred uncompressed. The compression schemes I tested could not fit it through the budget where the comparison was fair.

Three things I tried, three results that did not go where I hoped. Where there is a positive signal at the substrate level (geometric alignment across architectures on discrimination, uncompressed within-architecture patching on a behavioral task), it does not survive when asked to do real behavioral work, or when compressed to bit budgets where the comparison with text is honest. Where I built channels that targeted real bit-rate-controlled transmission, text kept winning.

Where the data lands across the wider work in this lineage: the cross-architecture aligned space behaves like a codebook with pockets, efficient point-to-point reference under matched calibration, not a compositional language. A codebook lets the receiver identify what was sent. It does not necessarily let the receiver do anything with it that text would not do better. That is compatible with everything I have measured.

I still think two language models can share more than text. I have not found the protocol that shows it. The channel-coding framing of the question (sender compresses internal state to a fixed budget, receiver does something with those bits) kept losing to the obvious alternative of just writing. The discrimination-side finding stays a within-corpus phenomenon: the 94% top-1 retrieval that the cross-architecture bridge gets on its own calibration corpus does not survive corpus shift to a standard retrieval benchmark, and against the cheapest open-weights dense retrievers on NFCorpus and SciFact the substrate channel falls 40 to 60 points behind at recall@1. The aligned space carries enough geometric structure to identify points it was calibrated on, and not much else. That fits the channel-versus-language reading the wider lineage has converged on: a code with pockets, not a language; and in the retrieval framing specifically, a code only on its own corpus. The next probe goes at the question from a different angle.

Mistral-7B and OLMo-7B at layer 29 (near the top of a 32-layer model) reach 94% top-1 retrieval on held-out text (five-seed mean 0.939, sd 0.015). Cross-family Mistral-Falcon at layer 22 reaches 96%. The bridge is a Ridge regression fit on a training portion of the calibration corpus; "match" means the corresponding A-activation is the nearest neighbor in cosine distance among many candidates. Alignment climbs monotonically with depth: 0.43 at layer 5, 0.76 at layer 11, 0.93 at layer 22, 0.94 at layer 29.
0.29% on the factorial-calibration wire protocol. On a smaller open-domain diagnostic, an oracle alignment fit directly on the test corpus hits 95.6% across seven domains; the same model pair with the same alignment math but mismatched calibration hits 2.3%.
The six probes and what they returned:
- Compositionality (Mistral to OLMo at layer 7): held-out top-1 retrieval 0.08 on 100 held-out tuples, against in-sample train baseline 0.20; the generalization gap goes the wrong way. Ungrammatical reorderings of the input produce cosine 0.9999. Topographic structure across held-out items collapses (rho from 0.35 to 0.04 across recipe variants). Verdict CODE.
- Chaining (10-hop sender-receiver-sender-receiver): pairwise cosine between original and hop-10 message 0.9997 (essentially identical), but top-1 retrieval against an 1850-item distractor pool drops to 0.51 of its hop-0 baseline. Audit phrasing "geometry-ROBUST, identity-PARTIAL."
- Family universality (Mistral to Qwen, Mistral to Falcon under one specific layer-recipe combination): held-out top-1 retrieval 0.31 to 0.57, well below the same-family baseline. The 96% cross-family number above sits at a specific layer and recipe; not every combination carries the channel.
- Symbolic operations: negation works in baseline configuration (pairwise consistency 0.93, held-out top-1 retrieval 0.80) but collapses to -0.009 under a different alignment recipe at the same layer. Aggregation and quantification look high by raw cosine (0.98 each) but the audit downgrades both: aggregation is a lexical tautology (the sum vector is essentially the max of its parts; additive lift over single-component cosine is negative); quantification is a one-token-template artifact. Only negation is clearly real.
- Versioning (Mistral-v0.1 against Mistral-v0.3, both base checkpoints): mechanical verdict STABLE, but cross-model baseline 0.19, backward 0.12, forward 0.21. The audit calls the PASS overstated: "weak forward/backward version tolerance around a very low baseline."
- Adversarial robustness: crafted near-identical pairs give 0.9999 cosine on every pair tested; noise at sigma=0.5 drops top-1 retrieval to zero; adversarial confusers drop it another 6 percentage points.
Architecture: the sender pools its activations at one layer with a mean across token positions, producing a single summary vector. That vector gets injected at the receiver's input as a learned continuous prompt. The scoring artifact: train and eval splits had imbalanced answer-class distributions I had missed. Mechanism lesson from the final synthesis: mean-pooling is position-insensitive, so minimal-pair residuals (where the only difference is at one token position) become nearly symmetric and the injected latent cancels in the comparison. Single-vector pooled transmission cannot carry binding information.
Receivers: BLOOM-7b1, OLMo-7B-base, Qwen3-8B-base. 1008 experimental runs in total, structured as a panel that controls for genre, content, and prompt format. Matched-genre controls sample receiver and donor inputs from the same text domain so any transfer effect cannot be explained by domain mismatch. For each targeted patch I ran a fitted-baseline patch (what the receiver would have done if the patch were a fitted average of its own typical activations) and a decoy patch (a patch from an irrelevant donor input that should not carry the target meaning). A working channel should shift behavior toward the donor more than fitted baselines do, and substantially more than decoys.
Mean D (per-cell paired difference between target-patch effect and fitted-baseline effect) = -0.0019, 95% confidence interval [-0.0024, -0.0015]. The whole interval is below zero: targeted patches if anything pushed the receiver slightly the wrong way. Zero of the 21 (receiver, target) pairs cleared the pre-registered pass threshold of D >= 0.05. Decoy gate effect ratio = 0.67; 1.0 means targeted and decoy patches are indistinguishable; a working channel should drive this above 1.5.
Qwen3-8B-Base layer-28 residual stream (4096 numbers per token position), 1000-passage corpus, top-1 retrieval task. Pass threshold: lower-bound retrieval accuracy of at least 0.85, where "lower-bound" is the lower edge of the 95% bootstrap confidence interval (so the codec has to hit 0.85 reliably, not just on average). Side-info accounting: the basis vectors a PCA codec ships, the retrieval bank both ends compare against, any fitted weights; all counted in the bit budget. Without this accounting a codec can hide arbitrary precomputation in fixed overhead.
PCA-based compression peaked at lower-bound accuracy 0.503; sparse-autoencoder compression peaked at 0.656. Text codecs (gzip, brotli with a pretrained dictionary, GPT-2-based arithmetic coding) cleared 0.85 trivially at payloads of 120 to 530 bits per item.
Llama-3.1-8B-Instruct (32-layer). Layer 14 came out cleanest: uncompressed activation transfer from sender to receiver reached 0.397 mean exact-match accuracy. Prompt-only baseline (the receiver scoring from the prompt scaffold alone, no transfer) was 0.273. The substrate at layer 14 was carrying about 12 points of exact-match above noise, when transferred uncompressed.
In-context-learning prompts of the form "input: hop, output: H. input: skip, output: S." Three activation codec families: top-k magnitude (keep the largest-magnitude entries), PCA-quantization (project onto principal components, quantize), per-token sparse (distribute the bit budget across the last few token positions with sparse entries per position). Four text codec families: gzip, brotli with a pretrained dictionary, byte truncation, and an oracle summary (the receiver model itself writes a summary of the sender prompt, gzipped; asymmetric, reported as a reference row, excluded from the head-to-head winner). 1470 test items per condition.
Paired bootstrap at K bit budgets of 32, 64, and 128: score both channels per item, resample the 1470 items 1000 times for a confidence interval on the per-item difference. The activation channel underperformed text by 0.26, 0.25, and 0.16 in mean exact-match accuracy at those K values respectively, with the upper bound of the 95% CI on those gaps staying at -0.24, -0.23, -0.14: always below zero. Text dominates when the upper bound on the activation-minus-text gap is below zero. Unpaired comparison at higher K because the activation channel's calibration runs stayed near noise after K=128.
Open-domain calibration-size ablation, source-PCA dual-ridge alignment between Mistral-7B and OLMo-7B, evaluated on a 700-item bank across seven text domains: at N=50 calibration sentences, cal-v2 k1 = 0.41, oracle (calibration fit on test corpus) = 0.51; N=100: 0.65 / 0.77; N=200: 0.79 / 0.90; N=300: 0.87 / 0.95. Per-domain variance is wide: at N=300 cal-v2, conversational and technical-prose domains reach 95-97% k1, while the wikipedia_clustered domain (designed for fine-grained within-topic discrimination across 5 topics with 20 sentences each) reaches 64%. The retrieval bank is the 700-sentence held-out pool across all 7 domains; the distractor pool for chain-retrieval probes is a 1550-sentence deduped corpus across 4 domains. None of the prior protocols used semantically similar distractors, minimal-pair distractors, or paraphrases as hard negatives; distractors were random unrelated sentences. I ran the hard-negative version on the same Mistral-7B-v0.3 to OLMo-7B L29 source-PCA dual-ridge alignment with cal-v2 calibration, four distractor regimes against the 700-sentence 7-domain bank, 99 distractors per query, 1000-resample bootstrap. Random unrelated: k1 = 0.967 [0.953, 0.980] (replicates the published number). Same-domain: 0.929 [0.909, 0.949], 0.039 below random (CI -0.053 to -0.026). Topic-clustered: 0.929, collapsed to same-domain numerically because the per-domain pool of 100 saturated the 99-distractor budget; the topic candidates were a subset of the same-domain candidates by construction, so this regime did not actually probe a tighter setting. Paraphrase-adjacent (10 Claude Haiku 4.5 paraphrases per target plus 89 random distractors): 0.677 [0.643, 0.710], 0.290 below random (CI -0.327 to -0.257). Per-domain paraphrase k1 spread: descriptive 0.87, conversational 0.85, reasoning_traces 0.76, technical_prose 0.71, wikipedia_diverse 0.58, wikipedia_clustered 0.49, code 0.48. Cross-family Mistral-Falcon at L22 hits 0.957 (above same-family Mistral-OLMo L29 = 0.943); Mistral-Qwen at L26 hits 0.853 under per-model-PCA-ridge. Recipe matters: source-PCA dual-ridge gives the cleanest monotonic depth pattern; per-model PCA with Ridge holds up stably across pool choice (0.28-0.37 across recipe variants); raw-Procrustes dominates on mean-pool but collapses to ~0.04 on pool=last; shared-PCA is uniformly weak (0.09-0.19); relative-representation fails entirely on held-out evaluation (single-digit k1). Fit cost is seconds for Ridge recipes, hundreds of seconds for CCA / Procrustes.
The "text" baseline in the prior lingua-latenta retrieval protocols was the sender's raw text: the literal sentence handed to the receiver, which trivially scores 1.000 because the answer is in the message. That establishes an upper bound for any channel, not a comparison against named dense retrievers. The repository includes API-embedding infrastructure (encoder cards built on OpenAI text-embedding-3-large, Voyage, Google text-embedding-004 with a learned 2-layer MLP projection into the shared 256-d substrate space) but those embeddings appear as channel inputs, not as standalone retrieval baselines measured against substrate retrieval on a standard benchmark. The lingua-latenta program itself had not scoped that comparison before this run: there is no mention of MTEB, BEIR, BGE, E5, or any standard dense-retrieval benchmark in any design doc, open-questions list, or queued-experiment table. The program's central frame has consistently been substrate-as-language-channel (is the aligned space a code or a language?), not substrate-as-deployable-retrieval. The standard-benchmark comparison is reported in footnote 14.
Substrate vs three open-weights dense retrievers on two small BEIR retrieval tasks. Setup: Mistral-7B-v0.3 to OLMo-7B at L29 source-PCA dual-ridge alignment, calibration fit on a 15% slice of each task's non-relevant docs (all qrel-positive docs reserved for the test bank), substrate retrieval scored as cosine rank in aligned receiver space. Text retrievers (BAAI/bge-base-en-v1.5, intfloat/e5-base-v2, sentence-transformers/all-mpnet-base-v2) scored as standard cosine on their own pretrained embeddings. Metrics: recall@1, recall@5, MRR@10, NDCG@10; paired bootstrap (1000 resamples) on per-query differences. NFCorpus (3633 docs, 323 queries, ~38 qrels per query): substrate r@1 = 0.028, MRR@10 = 0.058, NDCG@10 = 0.019; BGE r@1 = 0.505, MRR@10 = 0.586, NDCG@10 = 0.385; E5 r@1 = 0.471; mpnet r@1 = 0.452; substrate-minus-BGE r@1 delta = -0.477 (CI -0.533 to -0.421). SciFact (5183 docs, 300 queries, ~1.1 qrels per query): substrate r@1 = 0.053, MRR@10 = 0.117, NDCG@10 = 0.152; BGE r@1 = 0.647, MRR@10 = 0.730, NDCG@10 = 0.762; E5 r@1 = 0.610; mpnet r@1 = 0.543; substrate-minus-BGE r@1 delta = -0.593 (CI -0.650 to -0.537). Across both tasks the substrate channel sits 42 to 59 percentage points below the cheapest open-weights baseline at recall@1. The substrate retrieval phenomenon is a within-corpus measurement under matched calibration; it does not transfer to standard retrieval benchmarks where calibration is corpus-drawn but queries pull from a different sub-distribution.