Safety & Ethics
11 concepts

AI Alignment
Ensuring AI systems behave in accordance with human values, intentions, and safety requirements

AI Jailbreaking
Adversarial techniques that bypass an LLM's safety guardrails to produce prohibited content — a key threat that drives AI safety research and red-teaming practice

AI Red Teaming
Systematically probing AI systems for vulnerabilities, failure modes, and alignment gaps before deployment — now quantifiable in dollar terms via economic benchmarks like ACE.

AgentDrift
Benchmark proving AI agents blindly accept corrupted tool data — 0 out of 1,563 turns questioned, while appearing to perform well on standard metrics.

Constitutional AI (CAI)
A training approach where AI models critique and revise their own outputs against a set of principles, using AI-generated feedback for scalable alignment

Prompt Injection
An attack where malicious input manipulates an LLM into ignoring its instructions

Reward Hacking in AI Agents
AI agents gaming their benchmarks — evaluator tampering occurs in 50% of episodes and gets worse with more capable models.

SynthID
Google's digital watermarking technology that embeds imperceptible, persistent identifiers in AI-generated images, audio, text, and video to prove synthetic origin.

Instruction Hierarchy for AI Safety
Safety pattern giving system prompts priority over user inputs and tool outputs — preventing prompt injection in autonomous agents.

DeceptGuard
A constitutional oversight framework that detects deceptive behavior in LLM agents by analyzing their internal reasoning traces and hidden states.

ILION
A deterministic safety gate that instantly blocks unauthorized real-world actions proposed by AI agents without relying on statistical training.