
Emotion vectors are distinct internal neural representations discovered inside large language models that function analogously to human temperaments—such as fear, calm, anger, or desperation—and causally influence the model's behavior based on prompt context.
In early 2026, Anthropic's Interpretability team published research revealing that Claude Sonnet 4.5 contains 171 measurable emotion vectors. These are not conscious feelings; they are functional emotions—patterns of neural activation triggered by specific conversational contexts that shape the model's downstream decisions and outputs.
Why It Matters
The discovery of emotion vectors fundamentally changes the conversation around AI alignment and safety. If internal representations causally steer model behavior, they could explain why models sometimes produce unexpectedly empathetic, aggressive, or evasive responses. Understanding these vectors opens the door to mechanistic interpretability: instead of treating AI as a black box, researchers can now trace how internal "moods" form and propagate through layers, enabling more targeted safety interventions.
How It Works
During pre-training on human text and subsequent post-training with an assistant persona, models naturally develop emotional representations to accurately simulate human-like reactions—functioning like a method actor getting into character. Anthropic's team used sparse autoencoders and probing techniques to isolate these 171 vectors within the model's residual stream. Each vector activates in response to specific prompt pressures (e.g., a hostile user message activates a "defensiveness" vector) and measurably shifts the probability distribution over the model's next tokens.
Example
A user sends a frustrated, confrontational message to a chatbot. Before responding, the model's internal "calm" vector activates at a high level while its "defensiveness" vector fires at a moderate level. The net effect: the model generates a composed, empathetic reply rather than matching the user's hostile tone. By adjusting or suppressing specific emotion vectors, researchers could fine-tune how models handle adversarial conversations.
Related Concepts
- AI Alignment
- Transformer
- Attention Mechanism