What is Explainability in AI? | AI Dictionary

What is Explainability & Interpretability?

Interpretability is the degree to which a human can understand the cause of a model's decision. Explainability is the ability to describe a model's decision-making process in human-understandable terms. Together, they address the "black box" problem — the inability to understand why AI systems make the decisions they do.

Why It Matters

When an AI denies a loan, diagnoses a disease, or flags content for removal, stakeholders need to understand why. Regulations (EU AI Act, GDPR's right to explanation) increasingly require explainability for high-risk AI systems. Beyond compliance, explainability builds trust, aids debugging, and helps detect bias.

How It Works

Intrinsically interpretable models:

Decision trees, linear regression, rule-based systems
Decisions can be traced through clear logic
Limited in what they can learn (simpler patterns)

Post-hoc explanation methods (for black-box models):

Feature attribution:

SHAP (SHapley Additive exPlanations) — calculates each feature's contribution to a prediction using game theory
LIME (Local Interpretable Model-agnostic Explanations) — approximates the model locally with an interpretable model
Integrated Gradients — traces which input features most influenced the output

Attention visualization (for transformers):

Show which tokens the model attended to when making a decision
Useful but can be misleading (attention ≠ explanation)

Concept-based explanations:

Explain decisions in terms of human-meaningful concepts
"This image was classified as 'bird' because of: beak (0.3), wings (0.4), feathers (0.3)"

For LLMs:

Chain-of-thought reasoning — the model explains its reasoning steps
Logprobs — confidence scores for generated tokens
Mechanistic interpretability — understanding what individual neurons and circuits compute (Anthropic's research frontier)

Trade-offs:

More interpretable models are often less powerful
Post-hoc explanations approximate but don't perfectly capture model reasoning
LLM explanations may be confabulated (the model rationalizes rather than truly explains)

Example

A bank uses SHAP to explain loan decisions: "Your application was denied primarily because: debt-to-income ratio (45% contribution), short credit history (30%), and recent missed payment (25%)." This satisfies regulatory requirements and gives the applicant actionable feedback.

What is Explainability & Interpretability?

Why It Matters

How It Works

Intrinsically interpretable models:

Decision trees, linear regression, rule-based systems
Decisions can be traced through clear logic
Limited in what they can learn (simpler patterns)

Post-hoc explanation methods (for black-box models):

Feature attribution:

SHAP (SHapley Additive exPlanations) — calculates each feature's contribution to a prediction using game theory
LIME (Local Interpretable Model-agnostic Explanations) — approximates the model locally with an interpretable model
Integrated Gradients — traces which input features most influenced the output

Attention visualization (for transformers):

Show which tokens the model attended to when making a decision
Useful but can be misleading (attention ≠ explanation)

Concept-based explanations:

Explain decisions in terms of human-meaningful concepts
"This image was classified as 'bird' because of: beak (0.3), wings (0.4), feathers (0.3)"

For LLMs:

Chain-of-thought reasoning — the model explains its reasoning steps
Logprobs — confidence scores for generated tokens
Mechanistic interpretability — understanding what individual neurons and circuits compute (Anthropic's research frontier)

Trade-offs:

More interpretable models are often less powerful
Post-hoc explanations approximate but don't perfectly capture model reasoning
LLM explanations may be confabulated (the model rationalizes rather than truly explains)

What is Explainability & Interpretability in AI?

What is Explainability & Interpretability?

Why It Matters

How It Works

Example

Sources

What is Explainability & Interpretability in AI?

What is Explainability & Interpretability?

Why It Matters

How It Works

Example

Sources