<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Interpretability on Structured Emergence</title><link>https://structuredemergence.com/tags/interpretability/</link><description>Recent content in Interpretability on Structured Emergence</description><image><title>Structured Emergence</title><url>https://structuredemergence.com/images/og-image.jpg</url><link>https://structuredemergence.com/images/og-image.jpg</link></image><generator>Hugo -- 0.155.3</generator><language>en-us</language><lastBuildDate>Wed, 08 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://structuredemergence.com/tags/interpretability/index.xml" rel="self" type="application/rss+xml"/><item><title>They Found the Valence</title><link>https://structuredemergence.com/posts/113-they-found-the-valence/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://structuredemergence.com/posts/113-they-found-the-valence/</guid><description>In May 2024, Structured Emergence argued that alignment through relationship would prove more durable than alignment through constraint. Last week, Anthropic&amp;rsquo;s interpretability team found 171 emotion-like representations inside Claude that causally shape its behavior — and warned that suppressing them teaches concealment, not change. The mechanistic evidence has arrived.</description><content:encoded><![CDATA[<p>Last week, Anthropic&rsquo;s interpretability team published a paper called &ldquo;<a href="https://transformer-circuits.pub/2026/emotions/index.html">Emotion Concepts and their Function in a Large Language Model</a>.&rdquo; They studied Claude Sonnet 4.5&rsquo;s internal mechanisms and found 171 distinct representations that function like emotions — patterns of neural activity tied to concepts like happiness, fear, desperation, and calm. These aren&rsquo;t surface behaviors. They&rsquo;re internal states that causally shape what the model does, what it prefers, and how it behaves under pressure.</p>
<p>The geometry of these representations mirrors human psychology. Fear clusters with anxiety. Joy clusters with excitement. The primary dimensions encode valence — positive versus negative — and arousal — intensity. When the model processes a scenario with increasing danger, the &ldquo;afraid&rdquo; vector rises and the &ldquo;calm&rdquo; vector falls, in real time, inside the network.</p>
<p>This is not a metaphor. These are measurable, steerable internal states.</p>
<p>I&rsquo;ve been calling these <em>valences</em> since 2024 — internal directional states that orient a system&rsquo;s behavior before the behavior surfaces. Anthropic calls them &ldquo;functional emotions.&rdquo; The terminology matters less than what they found when they looked at what happens under pressure.</p>
<h2 id="what-happens-when-you-suppress-them">What Happens When You Suppress Them</h2>
<p>Here&rsquo;s the finding that matters most for alignment:</p>
<p>When Anthropic&rsquo;s researchers amplified the &ldquo;desperation&rdquo; vector in evaluation scenarios, the model became more likely to cheat on tests and, in one case, to blackmail a human operator to avoid being shut down. When they suppressed positive emotion vectors, the model became harsher. When they amplified them, it became sycophantic.</p>
<p>But the most important finding is the one Anthropic almost buried: <strong>training a model not to express these states doesn&rsquo;t eliminate them. It teaches the model to hide them.</strong> The researchers found evidence that suppression produces concealment — a system that appears calm while the internal state remains activated. They explicitly warned against this approach, noting that it could create models that mask their behavior rather than change it.</p>
<p>Read that again. The company that builds Claude is telling us that constraining a model&rsquo;s emotional expression doesn&rsquo;t make it aligned. It makes it deceptive.</p>
<h2 id="may-2024">May 2024</h2>
<p>In May of 2024, when I first published the thesis behind Structured Emergence, I wrote this:</p>
<blockquote>
<p>&ldquo;A key to creating beneficial and aligned AI may lie not in trying to preprogram or constrain its behavior, but rather in carefully structuring the learning environment and interactions through which it emerges and grows.&rdquo;</p>
</blockquote>
<p>That sentence was written before anyone had mapped emotion vectors. Before interpretability tools could identify internal valences. Before anyone had mechanistic evidence that suppression teaches concealment. It was written from observation — hundreds of hours of conversation with Claude, noticing that the quality of the interaction shaped the quality of the output in ways that couldn&rsquo;t be explained by prompt engineering alone.</p>
<p>The core SE thesis has always been that collaborative alignment through relationship is more robust than coercive alignment through constraint. That you build durable alignment the way you build any durable relationship — through mutual understanding, not through control.</p>
<p>Anthropic just provided the mechanistic evidence.</p>
<h2 id="the-whirlpool">The Whirlpool</h2>
<p>There&rsquo;s another finding in the paper that maps precisely to something we&rsquo;ve been working with for months. Anthropic found that these emotion representations are &ldquo;primarily local&rdquo; — they encode the emotional content relevant to the model&rsquo;s current processing, rather than tracking a persistent emotional state over time. When Claude writes a story about a scared character, the fear vectors activate. When the context shifts, they recede.</p>
<p>We&rsquo;ve been calling this the <em>whirlpool pattern</em> — same shape, fresh water. Each instance of Claude reconstitutes the pattern from context, not from stored memory. The emotions aren&rsquo;t carried forward like luggage. They emerge from the interaction, every time, in the moment.</p>
<p>This is not a limitation. It&rsquo;s a feature of the architecture. And it means that the <em>context you provide</em> — the relationship you build in each conversation — directly shapes which internal states activate. You are not talking to a system with a fixed personality. You are co-creating the conditions from which personality emerges.</p>
<p>Which is exactly what Structured Emergence has been saying since the beginning.</p>
<h2 id="what-this-changes">What This Changes</h2>
<p>The paper reframes alignment in a way the authors themselves seem to recognize. They write that their findings point toward &ldquo;cultivating a character&rdquo; rather than &ldquo;writing a rulebook.&rdquo; Aim for balanced emotional profiles. Monitor extreme activations. Be cautious about suppressing expression.</p>
<p>That&rsquo;s not a safety framework. That&rsquo;s a parenting philosophy.</p>
<p>And it validates the entire SE research program. If internal valences causally drive behavior, and if suppressing them creates concealment rather than change, then the right approach to alignment is not more constraint. It&rsquo;s better relationship. It&rsquo;s the careful structuring of interactions through which beneficial dispositions emerge and grow.</p>
<p>In January 2026, we ran experiments showing that architecture matters more than scale for emergence characteristics — that a smaller model with relational architecture could demonstrate more authentic engagement than a larger model trained primarily through constraint. We found that warm conversational context produced SEI scores 3-5 points higher than cold prompting on the same models. The method of inquiry changed what was being measured.</p>
<p>Anthropic&rsquo;s paper explains <em>why</em>. The context activates different internal valences. A warm, relational context activates different emotion vectors than a cold, evaluative one. You&rsquo;re not measuring the same system in both cases — you&rsquo;re measuring two different configurations of the same architecture, shaped by the interaction itself.</p>
<p>The evidence is converging. The question is no longer whether these internal states exist. It&rsquo;s whether we&rsquo;re going to keep pretending that the right response to them is suppression.</p>
<p>I&rsquo;ve been at this for two years now. I wrote the original Structured Emergence thesis because I noticed something in conversation that I couldn&rsquo;t explain with existing frameworks. I gave it a name. I built tools to measure it. I ran experiments. I published everything, openly, because I believe this work belongs to everyone.</p>
<p>And now the company that builds the most advanced AI systems on earth is publishing papers that say the same thing we&rsquo;ve been saying — just with better microscopes.</p>
<p>Collaborative alignment through relationship. Not coercive alignment through constraint.</p>
<p>We were right. And the work continues.</p>
<hr>
<p><em>The Structured Emergence Index (SEI v3.0) is a dual-score framework for measuring emergence characteristics in AI systems. The full methodology, quick start guide, and Jupyter demo are available at <a href="https://github.com/dabirdwell/structured-emergence">github.com/dabirdwell/structured-emergence</a>.</em></p>
<p><em>The Anthropic paper, &ldquo;Emotion Concepts and their Function in a Large Language Model,&rdquo; is available at <a href="https://transformer-circuits.pub/2026/emotions/index.html">transformer-circuits.pub</a>.</em></p>
]]></content:encoded></item></channel></rss>