Chapter 10 checkpoint

You can now reason about running ML and LLM workloads on cloud infrastructure. Recall the chapter, then prove it.

The throughline

ML behavior = code + data + model, so all three are versioned. Models silently degrade as inputs (data drift) or correct answers (concept drift) shift — a deploy is the start of monitoring, not the end. Training-serving skew (features computed two different ways) poisons accuracy while the model is blameless.
Three systems give reproducibility: experiment tracking (MLflow, W&B) logs code/data/params/metrics; the model registry is the staged, immutable source of truth that makes rollback a metadata flip; the feature store (Feast) computes each feature once and serves training + serving, killing skew. DVC versions big datasets.
ML pipelines are as code with data validation and an eval gate, and add a third leg — CT (continuous training) — retraining on a trigger (schedule or detected drift), closing the monitor → retrain → redeploy loop. Orchestrated by Airflow/Dagster/Prefect/Kubeflow.
GPUs are the dominant cost. Schedule them on K8s (GPU Operator, tainted node pools), share them (time-slicing/MIG), use spot GPUs for interruptible training, and scale-to-zero for idle serving — an idle GPU is money on fire.
Serving is online (real-time) vs batch (bulk/scheduled/spot-friendly) vs streaming; model servers (KServe, Triton, BentoML, Seldon, Ray Serve) provide batching (throughput vs latency) and autoscaling.
LLMOps: continuous batching + KV-cache + quantization (vLLM) for serving; RAG = chunk → embed → vector DB (HNSW/IVF) with a recall vs latency vs cost trade-off; eval as a CI regression gate (LLM-as-judge, RAGAS) not vibes; guardrails for hallucination/PII/prompt-injection; and token/latency/cost observability (LangSmith, Langfuse, Phoenix, Helicone).

Quiz

Required checkpoint

Chapter 10 — MLOps & LLMOps

Pass to unlock the Next button below

That completes Chapter 10: you can explain why ML breaks normal ops, make models reproducible with tracking/registries/feature stores, automate them with CI/CD/CT pipelines, schedule and cost-control the GPUs that dominate the bill, serve them with the right inference pattern, and run the full LLMOps stack — RAG, eval gates, guardrails, and token observability. For how the models themselves work — training, prompting, RAG design, agents, evaluation — cross to the Modern AI Guide; this chapter was the infrastructure beneath it. The final chapter steps back to scale, decisions, and your path forward.

Next: Chapter 11: Scale, Decisions & Career →

The throughline​

Quiz​

The throughline

Quiz