Chapter 10 checkpoint
You can now reason about running ML and LLM workloads on cloud infrastructure. Recall the chapter, then prove it.
The throughline
- ML behavior = code + data + model, so all three are versioned. Models silently degrade as inputs (data drift) or correct answers (concept drift) shift — a deploy is the start of monitoring, not the end. Training-serving skew (features computed two different ways) poisons accuracy while the model is blameless.
- Three systems give reproducibility: experiment tracking (MLflow, W&B) logs code/data/params/metrics; the model registry is the staged, immutable source of truth that makes rollback a metadata flip; the feature store (Feast) computes each feature once and serves training + serving, killing skew. DVC versions big datasets.
- ML pipelines are as code with data validation and an eval gate, and add a third leg — CT (continuous training) — retraining on a trigger (schedule or detected drift), closing the monitor → retrain → redeploy loop. Orchestrated by Airflow/Dagster/Prefect/Kubeflow.
- GPUs are the dominant cost. Schedule them on K8s (GPU Operator, tainted node pools), share them (time-slicing/MIG), use spot GPUs for interruptible training, and scale-to-zero for idle serving — an idle GPU is money on fire.
- Serving is online (real-time) vs batch (bulk/scheduled/spot-friendly) vs streaming; model servers (KServe, Triton, BentoML, Seldon, Ray Serve) provide batching (throughput vs latency) and autoscaling.
- LLMOps: continuous batching + KV-cache + quantization (vLLM) for serving; RAG = chunk → embed → vector DB (HNSW/IVF) with a recall vs latency vs cost trade-off; eval as a CI regression gate (LLM-as-judge, RAGAS) not vibes; guardrails for hallucination/PII/prompt-injection; and token/latency/cost observability (LangSmith, Langfuse, Phoenix, Helicone).
Quiz
Chapter 10 — MLOps & LLMOps
Pass to unlock the Next button belowThat completes Chapter 10: you can explain why ML breaks normal ops, make models reproducible with tracking/registries/feature stores, automate them with CI/CD/CT pipelines, schedule and cost-control the GPUs that dominate the bill, serve them with the right inference pattern, and run the full LLMOps stack — RAG, eval gates, guardrails, and token observability. For how the models themselves work — training, prompting, RAG design, agents, evaluation — cross to the Modern AI Guide; this chapter was the infrastructure beneath it. The final chapter steps back to scale, decisions, and your path forward.