Reproducibility: tracking, registries & feature stores
Lesson 10.1 left us with three demands that ordinary ops never makes: every model must be reproducible, you must always know which model is in production (and be able to roll back), and features must be computed consistently across training and serving. Three pieces of MLOps infrastructure answer exactly those three demands: experiment tracking, the model registry, and the feature store. This lesson is the "remember which model, and rebuild it on demand" layer of the chapter.
Experiment tracking: never lose how a model was made
Training a model is a search. You try a learning rate, a dataset slice, a network size; you measure accuracy; you try again — dozens or hundreds of times. Without discipline, the winning run is a folder called final_v3_actually_final and nobody can say what made it good.
Experiment tracking is the practice of automatically logging, for every training run: the code version (Git commit), the data version, the hyperparameters (the settings you chose, like learning rate), the metrics (accuracy, loss), and the resulting model artifact. Now every run is a queryable record, and any model traces back to the exact ingredients that produced it — that's reproducibility from 10.1, made real.
:::note Term: hyperparameter A hyperparameter is a configuration value you set before training (learning rate, number of layers, batch size), as opposed to the weights the model learns. Different hyperparameters → different models, which is exactly why they must be tracked. :::
The standard tools are MLflow (open-source, the de facto baseline) and Weights & Biases (W&B; popular hosted experiment tracking with rich dashboards). You add a few lines to your training script and each run is logged automatically.
import mlflow
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01) # the settings used
mlflow.log_param("data_version", "v2025-06") # which data
# ... train the model ...
mlflow.log_metric("accuracy", 0.94) # how it did
mlflow.log_artifact("model.pkl") # the model itself
Read it as: "record what I chose, what data I used, how it scored, and the model that came out." Run that 200 times and you have a complete, comparable history instead of a folder of mystery files.
Versioning the data too: DVC
Code goes in Git. But ML datasets are often gigabytes or terabytes — too big for Git, which is built for small text files. DVC (Data Version Control) solves this: it stores a tiny pointer in Git while the actual large file lives in object storage (Chapter 2), giving you Git-like versions of datasets and model artifacts without bloating the repo. Now "data_version v2025-06" in the tracking log above refers to a real, retrievable, immutable snapshot — completing the "version the data" requirement from 10.1.
The model registry: which model is in production, and rollback
Experiment tracking records all your runs. But of the hundreds of models you trained, one is serving production traffic — and you need a clear, governed answer to "which one, and how do I revert?" That's the model registry.
A model registry is a versioned catalog of model artifacts plus their metadata, with explicit stages — typically Staging → Production → Archived. It is to models what a container registry (Chapter 4) is to images, and it buys you the same things:
- Versioning —
fraud-model:7,fraud-model:8, each immutable, each linked back to its training run, data, and metrics. - A single source of truth — "what is in production?" has one authoritative answer, not tribal knowledge.
- Instant rollback — a bad model? Promote the previous version back to
Production. Because every version is retained and immutable, rollback is a metadata change, not a frantic retrain. - Governance — promotion can require approvals, passing eval gates, or sign-off (we wire eval into CI in lesson 10.6).
MLflow includes a registry; cloud platforms (SageMaker, Vertex AI, Azure ML) each provide their own. The gap to avoid: no registry means deployments aren't reproducible and rollback is impossible — you're back to the final_v3 folder, guessing what's live and unable to revert safely.
Feature stores: one definition, served to training and serving
Recall training-serving skew from 10.1: the same feature computed two different ways in training and serving silently wrecks accuracy. A feature store is the durable fix. It is a central system that computes each feature once and serves it to both worlds:
- An offline store serves large historical feature sets to training (high throughput, batch).
- An online store serves the same feature values to serving with low latency, per request.
Because both pull from a single feature definition, "average purchase over the last 7 days" means exactly the same thing in training and in production — skew eliminated by construction. Feature stores add real benefits beyond that: features become reusable across teams and models (define customer_age once, everyone uses it), and feature values are versioned so a training set is reproducible.
The open-source standard is Feast; the big clouds offer managed feature stores too. A feature store is overkill for a single small model — but the moment multiple models and a serving path are involved, it's the clean answer to skew.
:::tip Durable vs dated Tracking, a registry, and a feature store are durable roles — every mature ML platform has all three, whatever the labels. The specific products (MLflow, W&B, DVC, Feast) are dated. When you join a team, don't ask "do they use MLflow?" — ask "where do experiments get tracked, what's the source of truth for the production model, and how are features kept consistent?" Those questions are forever. :::
Why it matters
Three systems turn the reproducibility demands of 10.1 into reality. Experiment tracking (MLflow, W&B) logs every run's code, data, hyperparameters, and metrics, so any model traces back to exactly what made it — and DVC versions the large datasets that won't fit in Git. The model registry gives one authoritative, staged, immutable catalog of model versions, so "what's in production?" has a real answer and rollback is a metadata flip, not a retrain — without it, deployments aren't reproducible and you can't safely revert. The feature store computes each feature once and serves it to both training and serving, killing training-serving skew by construction while making features reusable and versioned. With models now reproducible and traceable, the next question is automation: how do we retrain and redeploy them as pipelines — including the ML-only idea of retraining on a trigger.