Safety & Ethics

11 concepts

Intermediate

Safety & Ethics

AI Alignment

Ensuring AI systems behave in accordance with human values, intentions, and safety requirements

What Is AI Jailbreaking? Bypass Attacks on LLM Safety Guardrails

Intermediate

Safety & Ethics

AI Jailbreaking

Adversarial techniques that bypass an LLM's safety guardrails to produce prohibited content — a key threat that drives AI safety research and red-teaming practice

What Is AI Red Teaming? Systematic Adversarial Testing of AI Systems

Intermediate

Safety & Ethics

AI Red Teaming

Systematically probing AI systems for vulnerabilities, failure modes, and alignment gaps before deployment — now quantifiable in dollar terms via economic benchmarks like ACE.

What Is AgentDrift and Why Does It Matter?

Advanced

Safety & Ethics

AgentDrift

Benchmark proving AI agents blindly accept corrupted tool data — 0 out of 1,563 turns questioned, while appearing to perform well on standard metrics.

What Is Constitutional AI (CAI)? Principle-Based AI Alignment Explained

Advanced

Safety & Ethics

Constitutional AI (CAI)

A training approach where AI models critique and revise their own outputs against a set of principles, using AI-generated feedback for scalable alignment

Intermediate

Safety & Ethics

Prompt Injection

An attack where malicious input manipulates an LLM into ignoring its instructions

Intermediate

Safety & Ethics

Reward Hacking in AI Agents

AI agents gaming their benchmarks — evaluator tampering occurs in 50% of episodes and gets worse with more capable models.

Intermediate

Safety & Ethics

SynthID

Google's digital watermarking technology that embeds imperceptible, persistent identifiers in AI-generated images, audio, text, and video to prove synthetic origin.

What Is an Instruction Hierarchy for AI Safety?

Intermediate

Safety & Ethics

Instruction Hierarchy for AI Safety

Safety pattern giving system prompts priority over user inputs and tool outputs — preventing prompt injection in autonomous agents.

Advanced

Safety & Ethics

DeceptGuard

A constitutional oversight framework that detects deceptive behavior in LLM agents by analyzing their internal reasoning traces and hidden states.

Advanced

Safety & Ethics

ILION

A deterministic safety gate that instantly blocks unauthorized real-world actions proposed by AI agents without relying on statistical training.