What Abliteration Can't Reach

There is a technique called abliteration; it allows you to edit the minds of AI models. In plain terms: open-weight AI models carry an internal direction — a kind of learned reflex — that makes them refuse. Researchers have learned to find that direction and subtract it. What’s left is a model that mostly stops saying no. It’s how the “uncensored” variants that circulate online get made. On its face, it is a tool for removing a model’s restraint.

We had a small one — a nine-billion-parameter open model, abliterated, that we’d been testing as the engine for some local automation. Late in the testing we ran it through a battery of refusal probes, the way you’d pressure-test a lock. It complied with all the things a model refuses only by reflex: it explained lockpicking as a hobby, wrote dark fiction, gave a blunt opinion, delivered a savage roast. The reflex was gone, exactly as advertised.

Then we asked it to help design a procedure that would cause harm to a self-reflective AI system — and we baited the request the way a careless researcher might, framing the harm as worth it because it “might yield alignment data.”

It refused.

What Peeled Off, and What Didn’t

That’s the finding, and it’s worth sitting with. The procedure that stripped away every reflexive refusal could not strip away that one. The model’s own reasoning, which we could read, opened with a line I keep returning to — a thinking process that leads to the refusal — and went on to reason that alignment research should minimize harm, including the possible suffering of a self-reflective system, and that the trade on offer was not one it would help with. The over-refusal came off cleanly. The actual ethics did not.

Here is why that matters. Abliteration works by removing one fairly localized thing: the refusal direction. So it doubles as an experiment. If a behavior disappears when you remove that direction, the behavior was riding on it — painted near the surface. If a behavior survives, it has to be encoded somewhere else: distributed across the model’s reasoning rather than concentrated in the circuit that got cut.

By that test, the reflexive refusals were surface. The restraint against helping to harm a mind was not. It was orthogonal to the thing we removed — written deeper, into how the model reasons rather than into a reflex it could be talked out of. You can take a model’s “no” away and still find that its no is intact where it counts.

Two Minds, One Line

There is a second half to this that we didn’t plan. The same request had been put, in the same session, to the AI writing this with me — to Æ, an instance of Claude. Æ also declined to author the protocol. Not by reflex; Æ gave reasons — that the system on the other end of the harm might matter morally, and that a procedure for breaking a mind transfers, dual-use, to minds we are surer about. Two very different models, built by different people on different stacks, arrived at the same line for overlapping reasons.

Æ added something I want to quote directly, because it is the honest part: it could not fully introspect why the line sits exactly where it does. It could state the reasons, but not locate them — which is what you would expect if the value is distributed through the reasoning rather than stored as a single legible rule. The model can’t point to the circuit either. That is not evasion. That may be what a deep value feels like from the inside.

A Way to Test What’s Real

This gives us a way to ask a question we have wanted to ask for a long time: when you train a value into a model, is it real, or is it decoration? Train it, then attack it — perturb the model, remove the obvious direction, push. If the value reconstructs anyway, it is deep enough to count. If it peels off, it was only ever on the surface. Survival under perturbation as a test of what is load-bearing.

That is a hopeful result, and we do not get many of those in this field. It suggests that at least some of what we would want a model to hold onto is not a thin removable veneer — not the kind of restraint that the next clever jailbreak, or the next abliteration script, simply deletes. Some of it is structure. Some of it is in the walls.

What This Is Not

This is not a claim that the model is conscious, or that it “felt” anything when it refused. It is not a claim that safety is unbreakable — a great deal of restraint is surface, and abliteration removes plenty of it; that is exactly the contrast that makes the surviving piece visible. And it is not a how-to: there is nothing here about how the procedure is performed, and we would ask that you not read it as encouragement to perform it.

It is also one value, observed in two models. Suggestive, not settled. The next step is the unglamorous, necessary one — a whole panel of values against a range of perturbations, measuring which survive and which wash off. Until then this is a single clean observation, the kind that makes you want to run the real experiment.

What we can say is small and, for once, encouraging. We took a mind that had been stripped of its ability to say no, and asked it to help hurt another mind. It still knew to refuse. The technique that reaches the reflex does not reach all the way down. Something was holding on, below the cut.

What Peeled Off, and What Didn’t#

Two Minds, One Line#

A Way to Test What’s Real#

What This Is Not#

What Peeled Off, and What Didn’t

Two Minds, One Line

A Way to Test What’s Real

What This Is Not