Research

3 concepts

Natural Language Autoencoders

An Anthropic interpretability technique that automatically translates a large language model's internal activation vectors into human-readable text, enabling pre-deployment alignment audits and detection of hidden biases or deceptive intent.

Advanced

Research

Activation Steering

A technique that injects synthetic vectors into a model's internal layers at inference time to directly shift its decision-making, enabling precision debiasing and behavioural control — but also capable of bypassing safety training without any jailbreak prompt.

Advanced

Research

Latent Space Manipulation

A class of techniques that directly read, steer, or couple the internal numerical representations of AI models rather than operating through text, enabling real-time alignment audits, bias detection, and token-free inter-model communication.