Skip to main content

Serving & inference patterns

A registered, reproducible model sitting in storage does nothing. Serving — exposing the model so applications can get predictions from it — is where it earns its keep, and it's a distinct engineering problem with its own patterns and tools. This lesson is the "make the model answer requests in production" core, building straight on the serving half of the training-vs-serving split (10.3) and the GPU cost levers (10.4).

Three ways to run inference

Inference is just the act of getting a prediction out of a trained model. There isn't one way to serve it; there are three, chosen by how the predictions are consumed:

  • Online (real-time) inference — the model answers one request at a time, synchronously, with a tight latency budget. A user submits a transaction; you need a fraud score in 50 ms. This is an always-on, low-latency service (Chapter 4's deployment + service pattern).
  • Batch inference — you score a large dataset offline, on a schedule, optimizing for throughput and cost, not latency. "Every night, predict churn for all 10 million customers." It's a job (lesson 10.3), and it can ride cheap spot GPUs because it's interruptible.
  • Streaming inference — you score an unbounded stream of events continuously as they arrive — e.g., scoring every event off a message queue in near-real-time.

The durable choice: don't ask "how do I serve my model?" — ask "how are the predictions consumed?" Real-time use → online; bulk/scheduled → batch (cheaper, spot-friendly); event stream → streaming.

A common and costly mistake is forcing batch-shaped work through an always-on online endpoint — paying for a warm GPU service to do what a nightly job on spot GPUs would do for a fraction of the cost.

Model servers: don't write the serving layer yourself

You could wrap a model in a hand-rolled web app. You shouldn't, because production serving needs a lot of non-trivial machinery: efficient request handling, batching, GPU scheduling, autoscaling, hosting multiple models, versioned rollouts, and metrics. A model server is purpose-built software that provides all of that. The names to know:

  • NVIDIA Triton Inference Server — high-performance, multi-framework, multi-model serving with strong GPU batching; the heavyweight for raw inference performance.
  • KServe — Kubernetes-native model serving with a standard interface, autoscaling, and scale-to-zero (10.4) built in. The cloud-native default on K8s.
  • BentoML — packages a model plus its serving code into a deployable "bento" with a great developer experience; pairs with its deployment layer.
  • Seldon Core — Kubernetes-native serving with advanced deployment topologies (multi-step inference graphs, canary, A/B).
  • Ray Serve — model serving built on Ray (a distributed-Python framework), strong for Python-heavy and compositional serving.

They overlap heavily. The role — "production-grade model serving so I don't reinvent batching, autoscaling, and metrics" — is what's durable; which one a team picks is ecosystem-driven.

Batching: the throughput lever

The single most important serving optimization is batching: instead of running the GPU once per request, the server briefly collects several incoming requests and runs them through the model together in one GPU pass. Because GPUs are massively parallel, processing 16 requests at once costs barely more than processing one — so batching multiplies throughput, which means fewer GPUs for the same traffic (a direct cost win from 10.4).

The trade-off is latency: waiting a few milliseconds to fill a batch adds delay to each request. So batching is a throughput-vs-latency dial you tune to your latency budget. (For LLMs this gets a powerful upgrade — continuous batching — in lesson 10.6.)

Autoscaling, scale-to-zero, and keeping models warm

Serving infrastructure must track traffic, and this is where serving meets GPU cost head-on. Autoscaling adds replicas under load and removes them as traffic falls — the Chapter 4 idea, now over expensive GPU replicas, so each scaling decision has real money attached. Scale-to-zero (10.4) drops an idle endpoint to zero GPUs and zero cost.

The tension is the warm-vs-cold trade-off, and it's sharper for models than for web apps because model artifacts are large and loading them into GPU memory is slow:

  • Keep it warm (always ≥1 replica) → instant responses, but you pay for an idle GPU during quiet periods.
  • Scale to zero → $0 when idle, but the first request after idle eats a cold start while a GPU spins up and the model loads.
Request volumeAutoscalerAdd GPUreplicas\n(batchrequests together)Scale to zero ($0)Keep ≥1 warm (pay tostay fast)busyidle, can toleratecold startidle,latency-critical

You decide per endpoint: latency-critical, steady traffic → keep warm; spiky, internal, or dev → scale to zero. This is the everyday cost/latency judgment of model serving.

:::tip Durable vs dated The patterns — online vs batch vs streaming, batching as a throughput/latency dial, autoscaling and warm-vs-cold — are durable. The model servers (Triton, KServe, BentoML, Seldon, Ray Serve) are dated and crowded; new ones appear yearly. Choose by "which gives me production-grade batching, autoscaling, and metrics in my stack," not by name recognition. :::

Why it matters

Serving turns a stored model into something that answers in production, and the first decision is how predictions are consumed: online (real-time, low-latency, always-on service), batch (bulk, scheduled, throughput-optimized, spot-friendly job), or streaming (continuous over an event stream) — and forcing batch work onto an online endpoint quietly wastes GPU money. Don't hand-roll serving; model servers (Triton, KServe, BentoML, Seldon Core, Ray Serve) provide batching, autoscaling, multi-model hosting, and metrics. Batching is the key throughput lever — running many requests in one GPU pass cuts the GPU count, traded against a few ms of latency — and autoscaling + scale-to-zero control GPU cost, balanced against cold-start latency per endpoint. All of this is general ML serving; the final lesson specializes it for the workload reshaping the field — large language models — and the LLMOps stack around them.

Next: The LLM serving stack & LLMOps →