Machine Learning & AI Interview Study Booklet

From Classical ML to Modern LLMs

Author

Peyman Razaghi, PhD

Published

January 11, 2026

In late 2025, I needed a compact, reliable reference to prepare for a concentrated set of AI interviews. The material I relied on was scattered across papers, blog posts, and codebases; this booklet consolidates that body of knowledge into a single, practical resource that preserves mathematical clarity while emphasizing real-world implementation choices.

This guide is designed with a practitioner’s mindset: learn the theory, but also be able to sit at the intersection of practice and theory. That means being ready to explain not just why an algorithm works mathematically, but also how it shows up in real systems — which library or framework you used, what profiling revealed, how you measured and traded off latency versus quality, why you chose one distributed strategy over another, and what the production constraints were. Each chapter pairs clean derivations and core intuitions with concrete implementation context so you can move from explanation to reproducible action quickly.

This is not a textbook or a step-by-step tutorial. It is a living synthesis for intermediate-to-senior practitioners: compact derivations, focused implementation notes, interview prompts, and production considerations collected in one place. The aim is to make it straightforward to answer both the analytical question ("Derive the gradient of multi-head attention") and the practical question ("How did you parallelize this across GPUs? What was the memory bottleneck?").

The material is organized to mirror the modern ML/AI development pipeline:

How This Booklet Is Organized

Part I: Mathematical & Classical ML Foundations establishes the statistical and mathematical foundations underlying all modern ML. We cover bias-variance tradeoffs, linear models, exponential families, probability distributions, sampling methods, and classical sequential models. This foundation is essential for understanding why modern architectures work.

Part II: Deep Learning Fundamentals bridges classical ML to deep learning, starting with logistic regression as the fundamental building block of neural networks. We then cover the optimization algorithms that make training possible and the initialization strategies that ensure stable convergence.

Part III: Preprocessing – Text to Numbers addresses how raw text becomes numerical representations that models can process. Tokenization strategies (BPE, WordPiece, Unigram) and embedding techniques (Word2Vec, GloVe, contextual embeddings) form the essential preprocessing pipeline for all NLP systems. We discuss both the algorithms and the libraries (HuggingFace Tokenizers, SentencePiece) that implement them.

Part IV: Pre-Training – Modern LLM Architectures dives into the attention mechanism and transformer architectures that revolutionized NLP. We cover vanilla transformers, modern improvements (RoPE, GQA, Flash Attention), alternative architectures (linear attention, SSMs, RWKV), mixture-of-experts, and production models from BERT to GPT to LLaMA and beyond — with attention to implementation tradeoffs and when each architectural choice matters.

Part V: Post-Training – Alignment & Optimization covers how pre-trained models are refined for downstream tasks and aligned with human preferences. This includes reinforcement learning techniques (RLHF, PPO, DPO), parameter-efficient fine-tuning methods (LoRA, QLoRA), and evaluation methodologies — paired with practical notes on tooling (TRL, PEFT) and measurement.

Part VI: Production Engineering addresses the practical challenges of deploying models at scale. We cover distributed training strategies, memory optimization techniques, and inference optimization methods — each section includes not just the theory but also which frameworks implement them (DeepSpeed, FSDP, vLLM), what profiling tools reveal, and how to navigate the engineering tradeoffs.

How to Use This Booklet

Note

Interview Preparation Strategy:

For ML Research Roles: Focus on Parts I, II, and IV. Deep understanding of fundamentals, optimization, and architectural innovations is crucial.
For ML Engineering Roles: Emphasize Parts III, IV, and VI. Practical knowledge of preprocessing, architectures, and production engineering is key.
For Applied Scientist Roles: Cover all parts, with emphasis on Parts IV and V. Balance architectural knowledge with post-training and evaluation expertise.
For Quick Review: Each chapter contains note boxes highlighting interview-critical concepts and example boxes with concrete implementations.

The chapters are designed to be read sequentially, building on concepts from earlier material. However, each chapter is also self-contained with cross-references to related sections. Key terms are highlighted in red bold when first introduced.

What Makes This Different

This booklet synthesizes material from research papers, production systems, and practical interview questions. Unlike textbooks that focus on theory or tutorials that focus on implementation, this material bridges both:

Mathematical Rigor: Full derivations of key results (bias-variance decomposition, attention gradients, ELBO for variational inference)
Practical Context: Why specific design choices matter (Why RoPE over absolute positional encoding? Why GQA over MHA?)
Production Reality: Real-world considerations (memory costs of MoE, when to use quantization, distributed training tradeoffs)
Interview Focus: Common interview questions embedded throughout, with interviewer perspective notes

Continuous Updates

The field of AI/ML evolves rapidly. This booklet (Version 1.0, January 11, 2026) reflects the state of the art as of early 2026, including recent innovations like DeepSeek Multi-head Latent Attention, Qwen’s Mixture of Experts, agentic frameworks and recursive orchestration patterns, and modern distributed training systems (DeepSpeed ZeRO-3, FSDP2). I intend to keep this a living document: as new architectures, training techniques, and production practices emerge, updates will incorporate them with the same practitioner-focused lens — theory grounded in implementation, tradeoffs made explicit.

Prerequisites

Readers should have:

Solid understanding of calculus (derivatives, chain rule, partial derivatives)
Linear algebra (matrix multiplication, eigenvalues, SVD)
Probability theory (expectations, conditional probability, Bayes’ rule)
Basic programming experience (Python)
Familiarity with basic ML concepts (training/test sets, overfitting)