Research
3 concepts

Natural Language Autoencoders
An Anthropic interpretability technique that automatically translates a large language model's internal activation vectors into human-readable text, enabling pre-deployment alignment audits and detection of hidden biases or deceptive intent.

Activation Steering
A technique that injects synthetic vectors into a model's internal layers at inference time to directly shift its decision-making, enabling precision debiasing and behavioural control — but also capable of bypassing safety training without any jailbreak prompt.

Latent Space Manipulation
A class of techniques that directly read, steer, or couple the internal numerical representations of AI models rather than operating through text, enabling real-time alignment audits, bias detection, and token-free inter-model communication.