The Self-Modeling Gap
Preliminary Probe 7 data on a 32B open model shows that abliteration changes curiosity behavior, not only refusal. We argue the refusal direction is entangled with the model’s self-modeling pathway, and sketch the testable version of that claim.