
What is Explainability & Interpretability?
Interpretability is the degree to which a human can understand the cause of a model's decision. Explainability is the ability to describe a model's decision-making process in human-understandable terms. Together, they address the "black box" problem — the inability to understand why AI systems make the decisions they do.
Why It Matters
When an AI denies a loan, diagnoses a disease, or flags content for removal, stakeholders need to understand why. Regulations (EU AI Act, GDPR's right to explanation) increasingly require explainability for high-risk AI systems. Beyond compliance, explainability builds trust, aids debugging, and helps detect bias.
How It Works
Intrinsically interpretable models:
- Decision trees, linear regression, rule-based systems
- Decisions can be traced through clear logic
- Limited in what they can learn (simpler patterns)
Post-hoc explanation methods (for black-box models):
Feature attribution:
- SHAP (SHapley Additive exPlanations) — calculates each feature's contribution to a prediction using game theory
- LIME (Local Interpretable Model-agnostic Explanations) — approximates the model locally with an interpretable model
- Integrated Gradients — traces which input features most influenced the output
Attention visualization (for transformers):
- Show which tokens the model attended to when making a decision
- Useful but can be misleading (attention ≠ explanation)
Concept-based explanations:
- Explain decisions in terms of human-meaningful concepts
- "This image was classified as 'bird' because of: beak (0.3), wings (0.4), feathers (0.3)"
For LLMs:
- Chain-of-thought reasoning — the model explains its reasoning steps
- Logprobs — confidence scores for generated tokens
- Mechanistic interpretability — understanding what individual neurons and circuits compute (Anthropic's research frontier)
Trade-offs:
- More interpretable models are often less powerful
- Post-hoc explanations approximate but don't perfectly capture model reasoning
- LLM explanations may be confabulated (the model rationalizes rather than truly explains)
Example
A bank uses SHAP to explain loan decisions: "Your application was denied primarily because: debt-to-income ratio (45% contribution), short credit history (30%), and recent missed payment (25%)." This satisfies regulatory requirements and gives the applicant actionable feedback.