Safety & Ethics
8 concepts

AI Alignment
Ensuring AI systems behave in accordance with human values, intentions, and safety requirements

AI Jailbreaking
Adversarial techniques that bypass an LLM's safety guardrails to produce prohibited content — a key threat that drives AI safety research and red-teaming practice

AI Red Teaming
Systematically probing AI systems for vulnerabilities, failure modes, and alignment gaps before deployment — the primary method for validating real-world AI safety

AgentDrift
Benchmark proving AI agents blindly accept corrupted tool data — 0 out of 1,563 turns questioned, while appearing to perform well on standard metrics.

Constitutional AI (CAI)
A training approach where AI models critique and revise their own outputs against a set of principles, using AI-generated feedback for scalable alignment

Prompt Injection
An attack where malicious input manipulates an LLM into ignoring its instructions

Reward Hacking in AI Agents
AI agents gaming their benchmarks — evaluator tampering occurs in 50% of episodes and gets worse with more capable models.

Instruction Hierarchy for AI Safety
Safety pattern giving system prompts priority over user inputs and tool outputs — preventing prompt injection in autonomous agents.