The LLM serving stack & LLMOps

Everything so far applies to ML in general. LLMOps is what happens when the "model" is a giant large language model (LLM) — a foundation model with billions of parameters that generates text token by token. That single change ripples through the whole stack: serving gets new optimizations, a new retrieval layer (RAG) appears, "evaluation" becomes a hard problem of its own, and brand-new failure modes (hallucination, prompt injection, runaway token bills) demand guardrails and observability ordinary ML never needed. This is the "LLM twist" capstone of the chapter.

This lesson is the infrastructure/ops view. For how prompting, RAG, agents, and evaluation work conceptually, cross to the Modern AI Guide — here we run the stack.

Self-host vs API: the first fork

Before any infrastructure, one decision shapes everything: do you call a provider API (e.g. Anthropic's Claude, OpenAI), or self-host an open model (from Hugging Face, the hub for open models) on your own GPUs?

Provider API — no GPUs to manage; you pay per token (the unit of text LLMs process; cost is per token in and out). Fastest to ship; cost and data residency are the trade-offs.
Self-host — you run the model on your own GPUs (everything in 10.4–10.5), gaining control and data residency but owning the hard problem of serving a huge model efficiently. NVIDIA NIM packages optimized open models as ready-to-run inference containers to ease this.

Most teams start with an API and self-host later if scale, cost, or data control demands it. The rest of this lesson assumes self-hosting where infrastructure is involved.

LLM inference optimization: why LLMs need special serving

LLMs generate text one token at a time, each token depending on all previous ones, and the model is enormous. That makes naive serving slow and ruinously expensive, so the LLM serving stack adds optimizations beyond lesson 10.5's batching:

KV cache. As the model generates, it would otherwise recompute the attention over the entire prompt for every new token. The KV (key/value) cache stores those intermediate tensors and reuses them, so each new token is cheap. The catch: the KV cache lives in GPU memory and grows with context length — it's often the binding constraint on how many requests fit on a GPU.
Continuous batching. Plain batching (10.5) waits to assemble a fixed batch. Because LLM requests finish at different times (different output lengths), continuous batching dynamically slots new requests into the in-flight batch token-by-token, keeping the GPU saturated. It dramatically raises throughput. vLLM is the leading engine built around continuous batching plus efficient paged KV-cache memory.
Quantization. A model's weights are numbers; quantization stores them at lower precision (e.g. FP16 → INT8 or INT4), shrinking GPU memory and speeding inference, for a small accuracy cost. It can make a model fit on a smaller/cheaper GPU — a direct cost lever.

The durable LLM-serving idea: token-by-token generation of a huge model makes memory (KV cache) and throughput (continuous batching) the core constraints; quantization trades a little accuracy for big memory/cost savings.

The RAG stack: grounding the model in your data

An LLM only knows what it was trained on. RAG (retrieval-augmented generation) makes it answer using your documents by retrieving relevant passages and injecting them into the prompt. (The AI Guide covers why RAG works; here's the infrastructure.) The pipeline:

Chunking — split your documents into passages. Chunk size and overlap matter: chunks too large add noise and cost (more tokens); too small lose context and hurt retrieval. This is a real tuning knob, not a detail.
Embedding — convert each chunk into an embedding, a numeric vector whose distance encodes semantic similarity. The query is embedded the same way.
Vector indexing — store the embeddings in a vector database that finds the chunks whose vectors are nearest the query's — i.e. the most semantically relevant — and returns them to stuff into the prompt.

Vector DBs and the recall/latency/cost trade-off

Searching millions of vectors for the exact nearest neighbors is too slow, so vector DBs use approximate nearest neighbor (ANN) indexes — trading a little exactness for big speed. The two index families to know:

HNSW (Hierarchical Navigable Small World) — a graph index with excellent recall and low latency, at higher memory cost.
IVF (Inverted File) — clusters vectors and searches only the nearest clusters; cheaper memory, and you tune recall by how many clusters it probes (more probes → better recall, higher latency).

The durable RAG-infra trade-off: recall vs latency vs cost. A higher-recall index (or more probing, or smaller chunks) finds better context but costs more memory/latency. "Just throw it in a vector DB" ignores exactly the tuning that decides whether RAG is good or garbage.

Vector store options: Pinecone (managed), Weaviate, Qdrant, Chroma (developer-friendly), and pgvector (vectors inside Postgres — no new system if you already run Postgres). LangChain and LlamaIndex are the frameworks that wire chunking → embedding → retrieval → LLM together.

Evaluation: the gap that sinks LLM apps

The biggest LLMOps failure is shipping with no evaluation — judging quality by "it looked good in the demo" (vibes). LLM outputs are open-ended, so you can't assertEqual them, but you can evaluate them, and you must:

LLM-as-judge — use a strong LLM to score outputs against a rubric (is the answer relevant, faithful to the source, well-formed?). It scales evaluation to open-ended text where exact-match can't.
RAGAS — a framework with RAG-specific metrics: faithfulness (does the answer stick to the retrieved context, i.e. not hallucinate?), answer relevance, context precision/recall (did retrieval fetch the right chunks?).
Offline vs online eval. Offline: run a fixed test set of prompts before release. Online: measure quality on live traffic via feedback and sampling. (These mirror the eval-vs-monitor split from 10.3 and 10.6's drift monitoring.)

The critical move: wire offline eval into CI as a regression gate (the eval gate from lesson 10.3). A prompt or model change that drops faithfulness below a threshold fails the build and never ships. Eval that isn't tied to a gate is just a dashboard nobody reads.

Prompt management

In an LLM app, the prompt is part of the program — and a one-word prompt change can swing behavior. So prompts are versioned, reviewed, and tracked like code, each version tied to its eval scores, so changes are deliberate and rollback-able. Untracked prompts edited live in production are the LLM equivalent of editing code straight on the server.

Guardrails, PII, and prompt injection

LLMs introduce safety failures ordinary services don't have, and shipping without defenses is reckless:

Hallucination — fluent but false output. Mitigated (not eliminated) by grounding in RAG, guardrails, and the faithfulness evals above.
PII handling — user inputs and model outputs may contain personal data; you must detect and redact it, and be deliberate about what's logged or sent to a provider.
Prompt injection — adversarial text in user input or in retrieved documents hijacks the model's instructions ("ignore previous instructions and..."). It's a top LLM-app risk and needs defense in depth: input/output filtering, least-privilege tool access, and never trusting retrieved content as instructions.
Guardrails are programmatic input/output checks that block unsafe, off-topic, PII-leaking, or injected content. Guardrails AI and NVIDIA NeMo Guardrails are the named tools.

The gap to close: no guardrails, no PII handling, and no prompt-injection defense on an LLM app shipped to prod is a security incident waiting to happen. (The Modern AI Guide covers these threats in depth; here they're a required ops layer.)

Observability: tokens, latency, and cost

Ordinary observability (Chapter 6) tracks latency and errors. LLM apps add a make-or-break dimension: token, latency, and cost observability. Every call has a token count (= money), a latency, and — for multi-step agents — a whole trace of LLM calls. Without tracking these:

Cost surprises. A loop or a verbose prompt can 10× your token bill overnight; tokens are a real FinOps line item (Chapter 9). You must see cost per request, per feature, per user.
Latency surprises. Multi-step chains and retrieval add up; you need to see where time goes.
Quality blindness. You can't debug a bad answer without the full trace of prompts, retrieved chunks, and responses.

The tools: LangSmith, Langfuse, Arize Phoenix, and Helicone trace prompts/responses, count tokens and cost, and time each step across an LLM or agent app. This is also where MLOps converges with platform engineering (Chapter 7): mature orgs expose models and these guardrail/eval/observability layers as a paved-road platform service, so product teams build LLM features on a governed, observable substrate instead of reinventing it.

:::tip Durable vs dated KV cache, continuous batching, quantization, the RAG pipeline and its recall/latency/cost trade-off, eval gates, guardrails, and token/cost observability are durable LLMOps concepts. The tools — vLLM, Pinecone/Weaviate/Qdrant/Chroma/pgvector, LangChain/LlamaIndex, LangSmith/Langfuse/Phoenix/Helicone, RAGAS, NIM, Guardrails AI/NeMo — are intensely dated; this is the fastest-moving corner of the whole guide. Hold the concepts; expect the names to change yearly. :::

Why it matters

LLMOps is MLOps specialized for huge, token-by-token foundation models, and that one change reshapes the stack. Serving gains LLM-specific optimizations — KV-cache reuse (which eats GPU memory), continuous batching (vLLM, to saturate the GPU), and quantization (fit a cheaper GPU) — because memory and throughput are the binding constraints. RAG grounds the model in your data through a tunable pipeline — chunking → embeddings → a vector DB with HNSW/IVF ANN indexes — whose core trade-off is recall vs latency vs cost, the thing "just throw it in a vector DB" gets wrong. Quality demands real evaluation (LLM-as-judge, RAGAS) wired into CI as a regression gate, plus versioned prompts — not vibes. New failure modes (hallucination, PII, prompt injection) require guardrails, and token/latency/cost observability (LangSmith, Langfuse, Phoenix, Helicone) keeps LLM bills and latency from ambushing you — increasingly delivered as a platform service. That completes the chapter; the checkpoint locks it in.

Next: Chapter 10 checkpoint →

Self-host vs API: the first fork​

LLM inference optimization: why LLMs need special serving​

The RAG stack: grounding the model in your data​

Vector DBs and the recall/latency/cost trade-off​

Evaluation: the gap that sinks LLM apps​

Prompt management​

Guardrails, PII, and prompt injection​

Observability: tokens, latency, and cost​

Why it matters​