Why ML breaks normal ops

Everything earlier in this guide assumed a familiar shape: you write code, you version it in Git, you build it once, and if the tests pass it behaves the same in production as on your laptop. Machine learning quietly breaks every part of that sentence. The "program" is partly learned from data, the data keeps changing under you, and a deployment that passed every test can become wrong a month later without anyone touching it. This lesson builds up why ML is different, so the rest of the chapter — registries, pipelines, GPUs, serving — has a reason to exist.

The software lifecycle vs the ML lifecycle

In ordinary software, the behavior of your system lives entirely in code. Same code in, same behavior out — that determinism is what makes Git, CI, and "it passed tests, ship it" work.

A machine-learning system has three things that jointly determine its behavior, not one:

Code — the training script, the serving logic, the pipeline definitions. (You already know how to version this.)
Data — the dataset the model learned from. Change the data and you get a different model, even with identical code.
Model — the trained artifact (the learned weights) produced by running the code over the data. It is an output, but you ship it like an input.

The durable shift: in software, code is the only thing that determines behavior. In ML, behavior is a function of code and data and the resulting model — so all three must be versioned, tracked, and reproducible.

If you version only the code (the instinct every software engineer brings), you have lost the ability to answer the most basic question in ML operations: "which exact model is in production, and how do I rebuild it?" That question is the spine of this whole chapter.

:::note Term: artifact An artifact is a concrete file or bundle produced by your process — here, a trained model (its weights and config). MLOps treats large data and model artifacts as first-class versioned things, the way Chapter 4 treated container images. :::

Reproducibility: rebuilding the exact same model

In Chapter 4 you learned that an immutable image tag means "the exact same bytes every time," which is what makes deploys and rollbacks reliable. ML needs the same guarantee, but it's harder to get. Reproducibility in ML means: given the same code, the same data, and the same configuration (hyperparameters, random seeds, library versions), you can rebuild the same model — or at least an equivalent one — on demand.

Why it's hard, and why it matters:

A model is the result of a long, often non-deterministic training run. Without recording the data version, the code commit, and the settings, "retrain it" can produce a different model.
When a model misbehaves in production, you need to reproduce the exact training conditions to debug it. If you can't rebuild it, you can't fix it with confidence.
Audits, compliance, and incident reviews all ask "how was this model produced?" — an unanswerable question without reproducibility.

The fix is experiment tracking (lesson 10.2): logging every run's inputs and outputs so any model can be traced back to the exact code + data + config that made it.

Models drift: a "working" model silently degrades

Here is the failure that catches every team treating ML like normal software. You deploy a model. Every test passes. CI is green. And then — with no code change and no deploy — its accuracy quietly falls month after month. Nothing alerted, because nothing "broke" in the software sense.

This happens because a model is only as good as the assumption that the world it sees in production resembles the world it was trained on. That assumption decays:

Data drift — the inputs shift. Your fraud model was trained on last year's transaction patterns; this year's spending behavior is different, so the inputs no longer look like the training data.
Concept drift — the relationship between inputs and the right answer shifts. What counted as "fraud" changes as fraudsters adapt, so even unchanged inputs now map to different correct answers.

A web service that returns the right answer today returns the right answer next year. A model does not. This is the single biggest gap teams new to ML fall into — treating a model deploy as "done" when it is the start of something that decays. It is why this chapter needs drift monitoring and continuous training, things ordinary ops simply doesn't have.

Training-serving skew: the same features computed two different ways

The other ML-specific trap is subtle and brutal. A model learns from features — the processed inputs, like "average purchase over the last 7 days." During training, those features are computed one way (say, a batch SQL query over historical data). During serving, the same feature must be computed again, live, for each incoming request — often by different code written by different people under different latency constraints.

If those two computations disagree even slightly — a different time window, a different default for missing values, a different unit — the model sees inputs at serving time that are unlike anything it trained on. Accuracy collapses, and it's maddening to debug because the model is fine; the features are inconsistent. This is training-serving skew.

Training-serving skew: features computed differently in training vs serving means the deployed model is fed inputs it never learned from. The model isn't wrong — the pipeline is.

The durable fix is to compute each feature in one place and serve it to both training and serving from that single source — the job of a feature store (lesson 10.2). Keep this failure in mind; it's the reason feature stores exist at all.

So what is MLOps?

Put the pieces together and the definition writes itself. MLOps is the practice of applying the cloud-engineering discipline you already have — version control, CI/CD, observability, cost control, IaC — to the machine-learning lifecycle, while adding the things ML specifically needs: versioned data and models, reproducible experiments, consistent features, drift monitoring, and retraining. LLMOps is that same practice specialized for large language models, which we reach at the end of the chapter.

The encouraging part: you are not starting over. The reconciliation loop, immutable versioning, pipelines-as-code, and golden-signal monitoring all carry straight over. MLOps is those disciplines plus a handful of genuinely new wrinkles — and naming the wrinkles, as we just did, is most of the battle.

:::tip Durable vs dated The lifecycle differences — data+model+code versioning, reproducibility, drift, training-serving skew — are durable; they'll be true of whatever ML looks like in ten years. The specific tools (MLflow, Feast, KServe, vLLM) named in this chapter are dated and churn fast. Learn the problems first; the tools are just current answers to them. :::

Why it matters

Machine learning breaks the assumption every software engineer relies on: that behavior lives only in code. An ML system's behavior comes from code + data + model, so all three must be versioned and reproducible — without that you can't even say which model is in production or how to rebuild it. Worse, a model that passes every test silently degrades as the world drifts away from its training data (data drift) or the right answers change (concept drift), so a deploy is the start of monitoring, not the end. And training-serving skew — features computed differently in two places — quietly poisons accuracy while the model itself is blameless. MLOps applies your existing cloud discipline to this lifecycle and adds the ML-specific pieces; the rest of the chapter is those pieces, one at a time. (For how the models themselves actually work — training, prompting, evaluation — see the Modern AI Guide; this chapter is the infrastructure beneath it.)

Next: Reproducibility: tracking, registries & feature stores →

The software lifecycle vs the ML lifecycle​

Reproducibility: rebuilding the exact same model​

Models drift: a "working" model silently degrades​

Training-serving skew: the same features computed two different ways​

So what is MLOps?​

Why it matters​