GPU & accelerator infrastructure

If you remember one thing about ML infrastructure, make it this: the GPU is the dominant cost, and almost every ML-infra decision is really a cost decision about GPUs. A single high-end accelerator can cost more per hour than a rack of ordinary servers, and the cardinal sin — leaving expensive GPUs idle — is shockingly common. This lesson is the "manage the scarce, expensive accelerator" core of the chapter, and it leans directly on Chapter 4 (Kubernetes) and Chapter 9 (FinOps).

Why GPUs need special handling at all

A GPU (graphics processing unit) is a massively parallel processor. ML training and inference are mostly huge matrix multiplications, and GPUs do those vastly faster than ordinary CPUs — which is why modern ML is feasible at all. (TPUs and other accelerators are the same idea; "GPU" stands in for all of them here.) But three properties make them an infrastructure problem, not just a fast chip:

They're scarce and expensive. Supply is constrained and the per-hour price is high. Waste is extremely visible on the bill.
They're not automatically shared. Unlike CPU and memory, which Kubernetes slices finely across many pods, a GPU is by default an all-or-nothing resource — one pod claims the whole card. A tiny job can hog an entire expensive GPU.
They need special plumbing. Drivers, runtimes, and device awareness must be installed before the scheduler can even see a GPU as something to assign.

The durable constraint: GPUs are the scarce, expensive resource at the center of ML infrastructure. The whole game is keeping them busy with useful work and off the clock when idle.

Scheduling GPUs on Kubernetes

Chapter 4 taught that the Kubernetes scheduler places pods on nodes by their resource requests (CPU, memory). GPUs extend that model, but with extra setup. The pieces:

The Kubernetes GPU Operator (NVIDIA's) automates installing GPU drivers, the container runtime, and the device plugin on GPU nodes — so Kubernetes can see GPUs and schedule pods onto them. Without it, your cluster doesn't know GPUs exist.
A pod then requests a GPU the same way it requests CPU or memory:

resources:
  limits:
    nvidia.com/gpu: 1   # this pod needs one whole GPU

GPU node pools. GPU machines are expensive, so you isolate them into their own node pool — a group of nodes with a particular machine type. CPU workloads run on cheap CPU nodes; only GPU workloads land on the costly GPU nodes. Taints and tolerations (a Kubernetes mechanism to repel pods from nodes unless they explicitly tolerate it) keep ordinary pods off the expensive GPU nodes, so you never pay GPU prices to run a web server.

If a single small inference service only uses 10% of a GPU but claims the whole card, you're wasting 90% of an expensive resource. Fractional GPUs let multiple workloads share one physical GPU:

Time-slicing — the GPU rapidly switches between workloads, giving each a turn. Simple; workloads aren't truly isolated (they compete for memory and compute).
MIG (Multi-Instance GPU) — on supported NVIDIA hardware, the GPU is partitioned into several smaller, hardware-isolated GPUs, each safely assignable to a different pod.

The point is the same as bin-packing in Chapter 4: pack more useful work onto each expensive card so you buy fewer of them. For many small models or light inference services, fractional GPUs are a large, direct cost win.

Spot GPUs for training, scale-to-zero for serving

Now we apply Chapter 9's FinOps levers specifically to the most expensive resource you own — and the right lever differs for training vs serving (the split from lesson 10.3).

Training → spot/preemptible GPUs. Training is a bursty, interruptible job. That's the perfect fit for spot (a.k.a. preemptible) instances: spare cloud capacity at a steep discount that the provider can reclaim at any moment. Because training jobs checkpoint (periodically save progress), a preemption just means resuming from the last checkpoint on another GPU — you tolerate the interruption to capture the discount. Training on on-demand GPUs when spot would do is one of the biggest avoidable ML bills.

Serving → autoscaling and scale-to-zero. A serving endpoint can't use spot the same way (it's serving users), so its cost lever is not running GPUs you don't need. Autoscaling adds replicas under load and removes them when traffic falls. The crucial ML-specific move is scale-to-zero: when an endpoint gets no traffic, drop it to zero GPU replicas — and zero cost — then cold-start a replica when the next request arrives.

Scale-to-zero is the highest-leverage GPU cost control there is: an idle GPU endpoint left running 24/7 is pure waste. Scaling it to zero turns idle time into $0.

The trade-off is the cold start: the first request after scale-to-zero waits while a GPU spins up and the (often large) model loads into GPU memory. So scale-to-zero is ideal for spiky or dev/internal endpoints and a poor fit for latency-critical, steady traffic — a judgment call you make per endpoint. Serving frameworks like KServe (lesson 10.5) provide scale-to-zero out of the box.

The gap teams fall into

The dominant ML-cost failure is simply idle GPUs: training on full-price on-demand when spot would do, giving each tiny service a whole card instead of a fraction, and — worst of all — leaving GPU endpoints running 24/7 serving almost no traffic. Because GPUs are the single largest line item, these mistakes don't cost a little extra; they can multiply the bill. Treat every idle GPU-second as money on fire.

:::tip Durable vs dated "GPUs are scarce, expensive, and must be kept busy or scaled to zero," and the levers — scheduling, fractional sharing, spot for jobs, scale-to-zero for serving — are durable. The specific instance types, MIG profiles, and per-hour prices are intensely dated; they change quarterly. Reason about the levers; look up today's prices when you actually provision. :::

Why it matters

The GPU is the dominant cost in ML infrastructure and behaves unlike CPU/memory: it's scarce, expensive, and by default claimed whole by a single pod. Running GPU workloads on Kubernetes needs the GPU Operator (so the scheduler can see GPUs), GPU node pools with taints (so you never pay GPU prices for ordinary work), and fractional GPUs (time-slicing or MIG) to pack many small workloads onto one card. The cost levers split by workload: spot/preemptible GPUs for interruptible, checkpointed training jobs, and autoscaling plus scale-to-zero for serving, where an idle endpoint left running is the most common and expensive mistake there is. With the hardware understood, the next lesson is how you actually serve a model on it — the inference patterns and model servers.

Next: Serving & inference patterns →

Why GPUs need special handling at all​

Scheduling GPUs on Kubernetes​

Sharing one GPU: fractional GPUs​

Spot GPUs for training, scale-to-zero for serving​

The gap teams fall into​

Why it matters​

Why GPUs need special handling at all

Scheduling GPUs on Kubernetes

Sharing one GPU: fractional GPUs

Spot GPUs for training, scale-to-zero for serving

The gap teams fall into

Why it matters