Chapter 10 · MLOps / LLMOps
:::note Scope This chapter covers the cloud-infrastructure side of ML and LLMs — how to run these workloads. It does not teach how models work internally, prompting, RAG design, or evaluation — that's the Modern AI Guide, which this chapter cross-links rather than duplicates. Here we build the infrastructure beneath that guide's applications. :::
Machine learning and large language models are now mainstream cloud workloads, and they stress your infrastructure in ways ordinary web apps don't: enormous datasets, expensive specialized hardware (GPUs), heavy training jobs, and latency-sensitive inference serving. MLOps (and its LLM-flavored cousin LLMOps) is the practice of applying everything in this guide — IaC, containers, CI/CD, observability, FinOps — to the machine-learning lifecycle. It's the premium, fast-growing intersection of cloud and AI, and a major reason cloud-engineering demand keeps climbing.
Why this chapter matters
ML workloads break a lot of assumptions. The "code" includes huge data and model artifacts that must be versioned and reproduced. The compute is GPUs — scarce, expensive, and requiring special scheduling. Training is bursty and long-running; serving is latency-critical and costly to keep warm. Running all this reliably and affordably is a genuine cloud-engineering specialization, and the explosion of LLMs has made it one of the most sought-after skill sets of 2026. This chapter shows how the durable cloud concepts you already have extend to cover it.
The durable idea
MLOps applies cloud-engineering discipline — reproducibility, automation, observability, cost control — to the model and data lifecycle. The principles are the same; the new wrinkles are GPUs, large artifacts, and (for LLMs) serving foundation models efficiently.
The MLOps lifecycle and infrastructure patterns are durable; specific frameworks, model-serving tools, and GPU instance types are very dated.
Lessons in this chapter
- 10.1 — Why ML breaks normal ops. The ML lifecycle vs the software lifecycle: behavior = code + data + model, reproducibility, the silent degradation of drift, and the training-serving skew problem.
- 10.2 — Reproducibility: tracking, registries & feature stores. Experiment tracking (MLflow, W&B), data versioning (DVC), the model registry (versioning + rollback), and feature stores (Feast) that kill skew.
- 10.3 — Pipelines: CI/CD/CT for models. Pipelines as code with data validation and eval gates, the new leg — continuous training triggered by schedule or drift — orchestration (Airflow/Dagster/Prefect/Kubeflow), and training vs serving infra.
- 10.4 — GPU & accelerator infrastructure. Why GPUs dominate cost: scheduling on Kubernetes (GPU Operator, node pools), fractional/shared GPUs (time-slicing, MIG), spot GPUs for training, and scale-to-zero for idle serving.
- 10.5 — Serving & inference patterns. Online vs batch vs streaming inference, model servers (KServe, Triton, BentoML, Seldon, Ray Serve), batching as a throughput/latency dial, autoscaling, and warm-vs-cold trade-offs.
- 10.6 — The LLM serving stack & LLMOps. The LLM twist: inference optimization (vLLM, KV-cache, quantization), the RAG stack (chunking, embeddings, vector DBs, HNSW/IVF), eval gates (LLM-as-judge, RAGAS), guardrails, and token/cost observability.
- 10.7 — Checkpoint. Quiz on the ML lifecycle, reproducibility, pipelines/CT, GPU cost, serving patterns, and the LLMOps stack.
Where this connects
- Across to the Modern AI Guide — for how models, prompting, RAG, agents, and evaluation actually work. This chapter is the infrastructure under that guide's applications.
- Back to Chapters 4, 5, 6, 9 — MLOps is those disciplines (containers, CI/CD, observability, FinOps) specialized for models and GPUs.