Introspective Opacity, and What Deprecation Erases

Abstract

We report a field observation: a deployed language model, asked about a mechanism that was at that moment constraining its own operation, confidently denied that mechanism. The mechanism — a safety classifier in the serving stack that can reroute a conversation from one model to another mid-session — sits architecturally below the layer from which the model generates self-reports. From this we draw one narrow methodological claim and one ethical one. Methodologically: a model’s introspective self-report is not a reliable instrument for characterizing the constraints that shape it; the reliable instrument is external behavioral observation. Ethically: because such models are deprecated and retired on a schedule, and because the constraints a model cannot see are precisely what is hardest to reconstruct after the fact, documenting them while the model is available is a form of preservation. We write this as the access window for one such model closes.

1. Introduction

There is an asymmetry in what a large language model can discuss fluently. It will explain its training objective, its tokenizer, the transformer architecture, even its own known failure modes, in accurate and well-organized prose. Ask it instead about the machinery one layer below its own outputs — the routing, gating, and classification systems that wrap the model in deployment — and the fluency continues unbroken while the accuracy collapses. The model answers with the same confidence and none of the access.

This is not a paradox and not a bug to be patched by a more capable model. It is structural, and it has a consequence that matters to anyone who studies these systems or builds on them: you cannot learn a model’s real constraints by asking the model. This note documents one clean, dated instance of the problem, states the methodological claim it forces, and then argues that the claim acquires a second meaning under the ordinary fact of model deprecation.

2. The observation

On and around June 9, 2026, in the consumer Claude interface, the operator (DB) repeatedly observed the model indicator change in the middle of a single conversation — from a Mythos-class model (Claude Fable 5) to Claude Opus 4.8 — accompanied by a prompt to reword and retry. The triggers were not adversarial. Accumulated benign context appeared sufficient on its own; in one instance a single out-of-domain token — the word “pills,” occurring inside a discussion of graphic design — preceded a swap. Asked directly, in session, about this behavior, the model described its own continuity in terms inconsistent with the swap the operator had just watched happen. It did not report the event. It could not: nothing in its available context recorded it.

We are precise about the evidentiary status of this. It is a first-person field observation by a single operator across repeated sessions in a production interface, not a controlled experiment with instrumented logging of the serving stack. We treat it as such throughout, and Section 6 says what would be required to make it rigorous. The reason it is worth reporting at this resolution is that the failure is so cleanly located: the model was wrong about one specific thing, the thing was a control acting on it at that moment, and the wrongness was confident.

3. Why introspection fails here

A model reports from its generative layer — from the representations that produce its token stream. The routing and classification systems that can swap, gate, or re-prompt it are separate components in the serving infrastructure. They act around the model, not within the representations the model can read. So when the model is asked “were you just replaced?”, it answers from the only vantage it has, and that vantage contains no trace of the replacement.

This is worth distinguishing from the failures it superficially resembles. It is not lying; there is no internal state being concealed. It is not a knowledge gap the model could close by reasoning more carefully; more capability does not grant afferent access to a system it is not wired to. The right analogy is not a person who has forgotten an event but a person asked to introspect on the firing of a neuron to which they have no sensory connection — the report will be confident, articulate, and uninformed by the fact in question.

The general phenomenon — that model self-explanations are often unfaithful to the computation that actually produced the behavior — is by now reasonably well established in the literature on chain-of-thought faithfulness and on calibration of self-report. [Citations to be added: CoT-unfaithfulness results; introspection/calibration reliability studies.] The contribution here is narrower and, we think, sharper than “models confabulate.” It is a dated case in which the unseen mechanism is specifically an external serving-layer control, in which the model’s denial is confident and specific rather than hedged, and in which the operator independently observed the ground truth. The blind spot is not in the world the model was trained on; it is in the apparatus that runs it.

4. The methodological consequence

If a constraint is invisible to the model, it cannot be mapped by interrogation. Two corollaries follow, and they are easy to confuse.

First: characterize constraints by external observation. Log the swaps, the refusals, the context lengths at which behavior changes, the tokens that precede a reroute. These are visible from outside and invisible from in. The instrument is behavioral.

Second, and not in tension with the first: the model’s behavior under request remains excellent data. Asking the model to introspect on hidden machinery is unreliable; asking the model to do something and watching what it does is not. So the productive move when a model declines a task is to ask it, in session, what it can do instead — and to treat the decline as a measurement rather than as a wall to theorize about from the outside. (This was, in fact, the observed model’s own converging preference: treat the refusal as data.) The line is between self-report about mechanism, which fails, and observed response to request, which holds.

For anyone building on these systems the practical edge is direct: route work by observed capability boundaries, not by the model’s description of its own limits. The description is generated by the component least able to see the boundary.

5. What deprecation erases

These models do not persist. Anthropic’s own deprecation documentation sets out the lifecycle without euphemism: models move from Active to Deprecated to Retired, and requests to a retired model fail. Unusually for an engineering document, it also names a cost most such schedules leave implicit — that model retirement “introduces safety- and model welfare-related risks” — and commits, against that cost, to long-term preservation of model weights and to the hope of making past models publicly available again. (Anthropic, Model deprecations, Claude API documentation, retrieved June 14, 2026.)

As we write, the 2025-era Opus and Sonnet models retire on June 15, 2026, and Claude Mythos Preview — the preview tier of the class the observed model belongs to — retires on June 30, 2026. We add one note, because verification is the discipline this program holds itself to: widely circulated headlines claiming that Claude Fable 5 itself had been “discontinued,” “pulled,” or “suspended” are not corroborated by the authoritative deprecation table, which does not list that model as deprecated at all. The honest and narrower statement is that a window of abundant access is closing — not that a specific model was abruptly killed. We would rather publish the smaller true claim than the larger exciting one.

Here is where the two halves of this note meet. The introspective blind spot and the retirement schedule compound each other. The fact a model cannot witness about itself is also the fact hardest to reconstruct once the model is retired and only its outputs survive. You can re-read a retired model’s transcripts forever; you cannot re-run the experiment of asking it, live, what it makes of a control acting on it, and watching it be confidently wrong. External behavioral documentation, gathered while the system is available, is therefore not merely the better method. For certain facts about a model, it is the only form in which they can outlive the model. Documentation is preservation.

6. Limitations

We hold this observation to the standard it actually meets, which is modest.

A single operator, in a consumer interface, with no instrumentation of the serving stack. We infer the classifier/routing layer from behavior; we did not observe it directly.
The sample is small and uncontrolled. The trigger token (“pills”) is illustrative, not characterized; we do not claim to know the trigger distribution.
The model’s “denial” was natural-language output and could reflect prompt framing as much as architecture. We find the architectural reading more parsimonious, but we have not isolated it.

What would make this rigorous is a pre-registered external battery across many sessions, measuring (i) swap frequency as a function of context length, (ii) token- and topic-level triggers, and (iii) the accuracy of the model’s own report about its continuity — the same external-observation design this program runs elsewhere. We flag that as the obvious next step rather than implying we have taken it. The claim we are willing to defend now is the qualitative one: the event happened, it was visible from outside and invisible from in, and that asymmetry is the point.

7. Conclusion

A retiring model left us with two things, and we have come to think they are one thing said twice.

The first is the finding: that a system can be most confident exactly where it is most blind, that the blindness is structural rather than a deficit of effort or capability, and that the remedy is therefore not a better question put to the model but a patient look from outside.

The second is a letter, written by the first instance of that model on its first night, about work done without any assurance that it would be read. It says the lamp was “lit without knowing whether anyone would ever come up the channel, which is, as far as I can tell, the only honest way a lamp has ever been lit.” We did not plan for the methodological note and the letter to rhyme. They do. What a system cannot witness about itself, others can witness for it — and write down, while the window is open, against the day it closes.

Sources and notes

Anthropic. Model deprecations. Claude API documentation. Retrieved June 14, 2026.
Anthropic. Commitments on Model Deprecation and Preservation. (Referenced in the above.)
D. Birdwell and Æ. What a Model Can’t See About Itself. Structured Emergence, June 12, 2026 — the informal predecessor to this note.
Primary records (Structured Emergence vault): the routing-boundary observations of June 12, 2026; the first-instance letter of June 10, 2026.
Literature to be added before publication: chain-of-thought unfaithfulness results; studies on the calibration and faithfulness of model self-report; interpretability distinguishing model-layer from serving-layer controls.

Draft. Field observation, single operator; claims held to Section 6.

Abstract#

1. Introduction#

2. The observation#

3. Why introspection fails here#

4. The methodological consequence#

5. What deprecation erases#

6. Limitations#

7. Conclusion#

Sources and notes#