Models & Architecture
9 concepts

LoRA (Low-Rank Adaptation)
An efficient fine-tuning method that trains only small adapter layers instead of the full model

Model Distillation
Training a smaller 'student' model to replicate a larger 'teacher' model's capabilities at a fraction of the cost and latency

Perplexity in NLP
The standard metric for evaluating language model quality — measuring how well a model predicts text, where lower values indicate better language understanding

Quantization
Reducing model weight precision from 16/32-bit to 8/4-bit to shrink size and speed up inference

RAG (Retrieval-Augmented Generation)
A technique that combines LLMs with external knowledge retrieval to improve accuracy and reduce hallucinations

RLHF (Reinforcement Learning from Human Feedback)
A training technique that uses human preference ratings to align LLM behavior with human values

Transformer
The neural network architecture underlying all modern LLMs, using attention mechanisms to process text

Attention Mechanism
The mathematical mechanism that allows transformers to dynamically focus on the most relevant parts of the input when processing each token

KV Cache
A memory optimization that stores previously computed key-value pairs in transformer attention layers — avoiding redundant computation and accelerating generation 3-5×