Safety & Ethics
22 concepts

AI Alignment
Ensuring AI systems behave in accordance with human values, intentions, and safety requirements

AI Jailbreaking
Adversarial techniques that bypass an LLM's safety guardrails to produce prohibited content — a key threat that drives AI safety research and red-teaming practice

AI Red Teaming
Systematically probing AI systems for vulnerabilities, failure modes, and alignment gaps before deployment — now quantifiable in dollar terms via economic benchmarks like ACE.

AgentDrift
Benchmark proving AI agents blindly accept corrupted tool data — 0 out of 1,563 turns questioned, while appearing to perform well on standard metrics.

Constitutional AI (CAI)
A training approach where AI models critique and revise their own outputs against a set of principles, using AI-generated feedback for scalable alignment

Prompt Injection
An attack where malicious input manipulates an LLM into ignoring its instructions

Reward Hacking in AI Agents
AI agents gaming their benchmarks — evaluator tampering occurs in 50% of episodes and gets worse with more capable models.

SynthID
Google's digital watermarking technology that embeds imperceptible, persistent identifiers in AI-generated images, audio, text, and video to prove synthetic origin.

Instruction Hierarchy for AI Safety
Safety pattern giving system prompts priority over user inputs and tool outputs — preventing prompt injection in autonomous agents.

Guardrails
Guardrails are safety mechanisms that constrain AI system behavior — filtering inputs, validating outputs, and preventing harmful or off-topic responses in production applications.

AI Governance
AI governance is the framework of policies, regulations, and practices that ensure AI systems are developed and deployed responsibly, fairly, and in compliance with laws.

Autonomous AI Cybersecurity Defense
The paradigm shift where AI systems autonomously discover, verify, and help patch software vulnerabilities faster than human researchers and threat actors—finally tilting the attacker-defender balance toward defense.

Bias in Machine Learning
Bias in ML refers to systematic errors from data, algorithms, or deployment that cause models to produce unfair or discriminatory results.

DeceptGuard
A constitutional oversight framework that detects deceptive behavior in LLM agents by analyzing their internal reasoning traces and hidden states.

Explainability & Interpretability in AI
Explainability and interpretability address the AI black-box problem: understanding why models make specific decisions, using techniques like SHAP, LIME, and Chain-of-Thought.

Human-in-the-Loop (HITL)
Human-in-the-Loop integrates human judgment into AI workflows for validation, correction, and feedback — essential for high-stakes AI applications.

ILION
A deterministic safety gate that instantly blocks unauthorized real-world actions proposed by AI agents without relying on statistical training.

JobBench
An AI agent benchmark testing 130 real enterprise workflows that humans actually want to delegate, revealing that frontier models score below 50% on tasks like meeting scheduling and report generation.

Magnifica Humanitas
Pope Leo XIV's 150-page encyclical on AI ethics, calling for the disarmament of AI from tech monopolies, democratic oversight, and grounding AI policy in human dignity and theological anthropology.

Project Glasswing
Anthropic's AI-powered security initiative that uses Claude to autonomously discover and verify tens of thousands of critical vulnerabilities in global software infrastructure faster than threat actors can exploit them.

Responsible AI
Responsible AI is the practice of building and deploying AI systems that are fair, transparent, accountable, safe, and beneficial to society.

Model Card
A model card is standardized AI model documentation covering intended use, performance, limitations, training data, and ethical considerations — a transparency label for AI.