Glossary

Every key term from the guide, defined in plain English. Terms are grouped roughly by area. Where a term has different brand names across AWS, GCP, and Azure, the durable concept is defined here — see Chapter 1's translation table for the brand names.

Foundations

Cloud computing — Renting computers, storage, and networking from a provider on demand over the internet, paying only for what you use.
On-premises ("on-prem") — Running your own physical hardware in your own (or a rented) facility, instead of renting from a cloud provider.
Virtualization — Software that lets one physical computer act as many independent virtual ones; the technology that makes renting servers by the second possible.
Hypervisor — The virtualization software that creates and isolates virtual machines on a physical host.
IaaS / PaaS / SaaS — The three service models, by how much the provider manages: Infrastructure (raw VMs/disks/networks), Platform (runs your code, no servers to manage), and Software (finished apps you just use).
Capex vs opex — Capital expense (big up-front purchase, e.g. buying servers) vs operating expense (pay-as-you-go); the cloud shifts spending from capex to opex.
Shared-responsibility model — The split where the provider secures the cloud (hardware, network, virtualization) and the customer secures what's in the cloud (data, access, configuration). The line slides across IaaS/PaaS/SaaS.
Region — A geographic cluster of data centers; your choice drives latency, data-residency law, and price.
Availability zone (AZ) — An independent data center within a region; spreading workloads across AZs survives a single building's failure.
Edge / point of presence (PoP) — A large tier of small caching locations near users; used by a CDN.
CDN (Content Delivery Network) — A network of edge locations that caches content near users for low latency.
Console / CLI / API — Three ways to control the cloud (web clicks / typed commands / programmatic requests); all call the same underlying API.
ClickOps — Provisioning infrastructure by manually clicking the console; fine for learning, an anti-pattern for production.
SDK (Software Development Kit) — A library that lets your application code call a cloud's API.

Core services

Compute — The primitive that runs your code: VMs, containers, or serverless functions.
Virtual machine (VM) — A complete virtualized computer with its own OS; most control, most to manage.
Container — A package of an app plus its dependencies (no full OS), sharing the host kernel; lightweight, portable, "runs the same everywhere."
Serverless / Functions as a Service (FaaS) — Upload only your code; the provider runs it on demand, scales to zero, and bills per execution.
Cold start — The added latency when a serverless function runs after being idle and must initialize.
Object storage — Files stored by name over HTTP in buckets; infinitely scalable, very durable, cheap; the cloud's storage workhorse.
Block storage — A raw virtual disk attached to one VM, used like a physical drive.
File storage — A shared network filesystem many machines can mount at once.
Storage tier / class — Pricing levels trading retrieval speed against storage cost (hot vs cold/archive).
Bucket — A container for objects in object storage.
Database — Software for storing, querying, and updating structured data reliably.
SQL / relational database — Stores data in related tables, queried with SQL; strong ACID consistency and an enforced schema.
NoSQL — Non-relational databases (document, key-value, etc.) trading some guarantees for flexibility and horizontal scale.
ACID — Atomicity, Consistency, Isolation, Durability; the transaction guarantees that keep relational data correct.
Managed service — One the provider operates for you (patching, backups, failover), e.g. a managed database.
VPC (Virtual Private Cloud) — Your own isolated private network inside the provider.
Subnet — A sub-range of a VPC; public subnets face the internet, private subnets don't.
Security group — A firewall attached to a resource specifying allowed inbound/outbound traffic; default-deny.
Load balancer — Distributes incoming traffic across multiple healthy servers and gives users one stable address.
DNS (Domain Name System) — Translates human domain names into the IP addresses machines use.
IAM (Identity and Access Management) — The system controlling who (or what) can do what to which resource.
Authentication vs authorization — Proving identity vs deciding what that identity is allowed to do.
Policy — A document granting or denying actions on resources to an identity.
Role / service account — A machine identity; software assumes a role to get permissions, avoiding long-lived keys.
Least privilege — Granting each identity only the minimum permissions it needs, to shrink the blast radius of a compromise.
Blast radius — How much damage a single compromised identity or component can cause.

Infrastructure as Code

Infrastructure as Code (IaC) — Defining infrastructure in version-controlled text files managed by a tool, instead of clicking or one-off commands.
Declarative vs imperative — Describing the desired end state (declarative) vs step-by-step instructions (imperative); modern IaC is declarative.
Idempotent — Running the same operation repeatedly converges on the same end state without duplicating work.
Terraform / OpenTofu — The dominant declarative IaC tool (OpenTofu is its open-source fork); configurations are written in HCL.
HCL (HashiCorp Configuration Language) — The declarative language used to write Terraform configurations.
Provider (Terraform) — A plugin that teaches Terraform how to talk to a specific platform's API.
Resource (Terraform) — A declared piece of infrastructure (a VM, bucket, network).
plan / apply — plan is a dry run showing exactly what will change; apply executes it.
State — Terraform's record mapping your code's resources to real cloud resource IDs; powerful and dangerous.
Remote state & locking — Storing state centrally so a team shares one source of truth, with locks preventing concurrent corruption.
Drift — When real infrastructure changes outside the code (e.g. a manual console tweak), diverging from the declared state.
Module — A reusable, parameterized package of infrastructure with inputs and outputs; infrastructure's version of a function.
Pulumi / CDK — IaC tools that use general-purpose programming languages instead of a config language.

Containers & Kubernetes

Image — A static, read-only blueprint containing an app and everything it needs to run.
Dockerfile — A text recipe of build instructions for assembling an image.
Layer — One cached, shareable step of an image build; makes builds and pulls fast.
Registry — A central store for distributing images; you push to it and pull from it (e.g. Docker Hub, ECR/Artifact Registry/ACR).
Tag — A label identifying an image version (name:tag); should be immutable, not latest, in production.
Orchestrator — Software that manages containers across many machines (scheduling, healing, scaling, networking).
Kubernetes (K8s) — The standard open-source container orchestrator; runs identically on every cloud.
Cluster / node — A set of machines running Kubernetes; each machine is a node (control-plane or worker).
Pod — The smallest deployable unit in Kubernetes (≈ one running app instance); disposable and ephemeral.
Deployment — Declares which image and how many replicas; reconciles to give self-healing, scaling, and rolling updates.
Replica — One of the identical pods a deployment maintains.
Rolling update / rollback — Gradually replacing pods with a new version (or reversing to a previous one).
Service (Kubernetes) — A stable name and address that load-balances over the current healthy pods.
Ingress — Rules routing outside HTTP traffic to services inside the cluster.
ConfigMap — Holds non-sensitive configuration injected into pods.
Secret (Kubernetes) — Holds sensitive values; base64-encoded by default (not encrypted) — needs encryption at rest / a secret manager.
Control plane — The cluster's brain: API server (front door), etcd (memory/desired state), scheduler (placement), controllers (reconciliation).
kubelet — The agent on each worker node that runs pods and reports their health.
Reconciliation loop — The continuous compare-desired-vs-actual-and-act loop behind Kubernetes (and Terraform, and GitOps).

Delivery, operations, security, FinOps, ML & scale

Chapters 5–11 introduce a large, fast-moving vocabulary; the terms below are merged and alphabetized across CI/CD & GitOps, observability & SRE, platform engineering, cloud security, FinOps, MLOps/LLMOps, and scaling.

ABAC — Attribute-Based Access Control: access decided from attributes/tags evaluated at request time; scales to many teams with one rule, harder to audit.
Action item — A concrete, owned, tracked follow-up task from a postmortem (a fix, a guardrail, a new alert or runbook) that turns a lesson into durable change.
Admission control — A Kubernetes deploy-time gate that rejects anything not meeting policy (e.g. an unsigned image or a root pod).
Adoption rate — The fraction of teams/services actually using the platform or golden path; the single most important platform metric because non-adoption is the #1 failure mode.
Alert fatigue — The desensitization that comes from too many noisy, non-actionable alerts, causing responders to ignore or miss the alert that actually matters.
Anomaly detection (cost) — Automated flagging of unusual spend (a spike from a misconfiguration, leaked credential, or runaway pipeline) so a surprise bill is caught in hours, not at month-end.
Approximate nearest neighbor (ANN) — Search that trades exactness for speed to find the most-similar vectors quickly; tuned via the recall-vs-latency-vs-cost trade-off.
Architecture Decision Record (ADR) — A short, numbered, append-only document capturing one significant decision: its context, the choice, and the trade-offs/consequences.
Argo CD — An application-centric GitOps controller for Kubernetes with a rich web UI showing desired-vs-live state.
Argo Rollouts — A Kubernetes controller (paired with Argo CD) providing canary and blue-green deploys with automated metric analysis and abort.
Artifact — The built, versioned output of a pipeline (e.g. a container image); built once and promoted unchanged through every environment.
Artifact signing — Cryptographically signing an artifact so anyone can verify it is authentic and untampered (e.g. Sigstore/Cosign, often keyless).
Asynchronous processing — Doing slow or non-urgent work in the background (via a queue and workers) instead of making the user wait inline.
Auto-instrumentation — Capturing telemetry without manually editing application code, e.g. via a language agent or eBPF probe that observes the running process.
Automated metric analysis — Comparing a new version's metrics against SLO thresholds during a rollout to automatically promote or abort it.
Automatic abort — Shifting traffic back to the old version automatically when a canary regresses past its thresholds, without waiting for a human.
Autoscaling — Automatically adding capacity when load rises and removing it when load falls, so you pay for what you use and still survive spikes.
AWS Certified DevOps Engineer — Professional (DOP-C02) — An advanced AWS certification covering CI/CD, automation, and operations on AWS.
Backstage — The open-source CNCF developer-portal framework (originally from Spotify); the de-facto open standard for building a portal, providing a catalog, software templates, and TechDocs.
Batch inference — Running predictions over a large dataset offline on a schedule, optimized for throughput and cost rather than latency.
Blameless postmortem — A written retrospective after an incident that focuses on systemic causes and fixes rather than blaming individuals, so the organization actually learns.
Blue-green deployment — Running two full environments (blue=current, green=new) and flipping all traffic at once, with instant rollback by flipping back.
Budget alert — A per-team/service spend threshold that notifies owners as they approach or exceed it, catching runaway cost early.
Build cache — Storing the expensive, rarely-changing parts of a build (downloaded dependencies, compiled outputs, image layers) keyed on a fingerprint and restoring them on the next run, so only what actually changed is rebuilt; the key must change exactly when the cached thing should.
Build-vs-deploy distinction — Building an artifact happens once in CI; deploying is just moving that exact artifact, never remaking it.
BuildKit — The modern container-build engine under today's docker build/buildx: it parallelizes independent steps, caches layers intelligently (including across CI machines), and produces more efficient, reproducible images.
Cloud Native Buildpacks — A way to build a container image directly from source code with no Dockerfile: the buildpack detects the language, assembles a secure-by-default image, and can "rebase" a patched OS layer under many app images without rebuilding them (e.g. Paketo, Heroku).
Composite action — A reusable unit that packages several pipeline steps as one callable step (GitHub Actions); a building block for DRY pipeline-as-code alongside reusable workflows.
Burn rate — How fast you are consuming the error budget relative to the SLO window; a burn rate of 1 spends the whole budget exactly over the window, higher means faster.
Burn-rate alerting — Alerting on how fast the error budget is being consumed (fast burn = page now, slow burn = ticket), which fires on user-visible severity instead of arbitrary thresholds.
Cache — A fast, temporary store holding the result of an expensive operation so it can be served again without redoing the work; trades data freshness for speed.
Cache invalidation — Deciding when a cached copy is stale and must be refreshed or discarded after the underlying data changes; a famously hard problem.
Canary deployment — Sending a small percentage of real traffic to the new version, watching its metrics, and widening or aborting based on health; the safest general-purpose strategy.
CAP theorem — During a network partition a distributed datastore can preserve only Consistency or Availability, not both; partition tolerance itself is unavoidable.
Capacity planning — Estimating how much compute, memory, storage, and throughput a system needs for expected load, plus headroom, before that load arrives.
Cardinality — The number of distinct values a label/dimension can take; high-cardinality labels (like user_id or request_id) multiply the number of time series and can explode storage cost.
CDN (content delivery network) — A network of edge locations that cache static assets physically close to users for speed.
Change failure rate (CFR) — A DORA metric: the fraction of deployments that cause a failure needing remediation.
Chargeback — Internally billing each team's own budget for its spend, hitting its P&L; stronger accountability than showback but heavier and more political.
Chunking — Splitting source documents into passages before embedding them for RAG; chunk size and overlap trade retrieval recall against context noise and cost.
CI/CD — Continuous Integration (auto build/test every commit) and Continuous Delivery/Deployment (auto path to production).
CIEM — Cloud Infrastructure Entitlement Management: tooling that finds over-permissioned identities and privilege-escalation paths and recommends trims.
Circuit breaker — A wrapper that trips open after repeated failures to fail fast for a cooldown, giving a downed dependency room to recover before cautiously retrying.
CKA / CKAD / CKS — Hands-on, performance-based Kubernetes certifications: Administrator, Application Developer, and Security Specialist.
Cloud Engineer — The generalist role designing, building, and operating cloud infrastructure (networking, compute, storage, IaC).
Cluster Autoscaler — Adds or removes underlying nodes (machines) when there aren't enough to fit the pods the HPA wants.
CNAPP — Cloud-Native Application Protection Platform: consolidates CSPM + CWPP + CIEM (+ supply-chain) to see risk that chains across layers (Wiz, Prisma Cloud, Microsoft Defender for Cloud).
CNI — Container Network Interface: the cluster's networking plugin (e.g. Cilium) that must support NetworkPolicies for them to take effect.
Cognitive load — The total amount a person must hold in their head to do their job; platform engineering aims to drive accidental (plumbing) load toward zero so developers can spend their budget on intrinsic (business) work.
Committed Use Discount (CUD) — GCP's commitment-based discount program (Azure's equivalent is Reservations), trading a 1–3 year usage promise for a lower rate.
Concept drift — The real-world relationship between inputs and the correct output changing over time, so a once-accurate model becomes wrong.
Container registry — A storage and distribution service for container images (e.g. ECR, Artifact Registry, ACR, Harbor); other artifact types use repositories like Artifactory or Nexus.
Container/image scanning — Scanning a built image for CVEs in OS packages and app dependencies (Trivy, Grype, Snyk).
Content-addressable — Identified by a hash of the content itself, so the identifier changes if any byte changes.
Continuous batching — An LLM-serving technique that dynamically merges incoming requests into in-flight GPU batches token-by-token, dramatically raising throughput (e.g. vLLM).
Continuous compliance — Turning each compliance control into an automated, continuously-evaluated check that emits evidence on demand and detects drift, instead of a quarterly snapshot.
Continuous Delivery (CD) — Every change that passes the pipeline is automatically made release-ready; shipping to production is a single safe manual click.
Continuous Integration (CI) — Merging work into a shared main branch frequently, with every merge automatically built and tested, to catch integration problems early and small.
Continuous profiling — Continuously sampling where a running program spends CPU/memory (the "fourth pillar"), to find the exact function or line responsible for resource use in production.
Continuous training (CT) — Automatically retraining and redeploying a model on a trigger such as a schedule, new data, or detected drift; the ML-specific third leg added to CI/CD.
Cost allocation — Attributing every dollar of spend to a meaningful owner (team, service, environment, customer); an unowned bill is an unmanaged bill.
Cost as a non-functional requirement — Treating cost like latency, reliability, or security: a property you design and estimate for up front, weighing architectural alternatives before building.
Cortex — An older CNCF multi-tenant, horizontally-scalable Prometheus backend (the ancestor of Mimir); a long-term metrics-storage option still seen in multi-tenant platforms.
Coverage (FinOps) — The discipline of bringing financial accountability to variable cloud spend, making cost a first-class, shared engineering concern run as a continuous inform→optimize→operate loop.
Crossplane — A tool that uses Kubernetes as a control plane to provision and continuously reconcile cloud infrastructure; its Composition bundles many real cloud resources into one simple custom resource developers request.
CSPM — Cloud Security Posture Management: tooling that continuously scans the cloud for misconfigurations and drift (detective).
Custom Resource Definition (CRD) — An extension that teaches the Kubernetes API a new kind of object (e.g. a PostgresDatabase) so developers can declare it like any built-in resource.
CVE — Common Vulnerabilities and Exposures: the public catalogue of known security flaws; "40 CVEs in an image" means 40 known holes.
CVE management — The ongoing practice of scanning against SBOMs, prioritizing by severity/exploitability, and patching continuously.
CWPP — Cloud Workload Protection Platform: protection for the workloads/containers themselves, including runtime.
Data drift — The input data distribution in production shifting away from the training distribution, degrading model quality even though the code is unchanged.
Decentralized cost ownership — The principle that the engineers who create cost own it (they alone can fix it well), enabled — not gatekept — by a central FinOps function that supplies visibility and tooling.
Default deny — The IAM rule that an action with no matching allow is denied; permissions are opt-in.
Defense in depth — Multiple independent layers of control so one failure doesn't cause a breach.
Deploy vs release — Deploying is putting code on servers; releasing is turning the behavior on for users; feature flags make these separate events.
Deployment frequency — A DORA metric: how often you deploy to production; higher (smaller, more frequent) is better.
Designing for failure — Building a system on the assumption that components will constantly fail, so the system as a whole keeps working invisibly to users.
DevEx (developer experience) — A frame broadening DORA to developer flow, feedback loops, and cognitive load.
DevEx survey — A short, recurring questionnaire measuring how the platform feels to developers (ease, unblocking, pain points); a leading indicator of adoption.
DevOps Engineer — A role emphasizing the delivery pipeline and automation, bridging development and operations.
DevSecOps — Folding security into DevOps so checks are automated and continuous, not a separate end-stage gate.
DevSecOps / Cloud Security Engineer — A role weaving security through the pipeline and infrastructure (shift-left, compliance, IAM, supply chain).
Disaster recovery (DR) — The plan for recovering from a major outage or data-loss event, defined by RTO and RPO.
Distroless image — A minimal base image containing only the app and its runtime — no shell or package manager — reducing CVEs and attacker tooling.
DORA metrics — Four research-backed software-delivery measures (deployment frequency, lead time for changes, change-failure rate, time to restore service) used as a platform's outcome scoreboard.
Drift — Any divergence between the desired state declared in Git and the actual state running in the cluster.
Drift detection and correction — A GitOps agent continuously noticing drift and either auto-reverting it (self-heal) or flagging it (OutOfSync); never silently tolerating manual changes.
Durable vs dated — The career strategy of investing deeply in long-lived fundamentals (Linux, networking, distributed systems, declarative/IaC mindset, decision-making) while holding volatile specifics (tool names, UIs, versions) loosely.
DVC — Data Version Control: Git-like versioning for datasets and model artifacts that keeps large files out of the Git repo while tracking their versions.
Dynamic secret — A fresh short-lived credential generated on demand and auto-revoked (e.g. a 1-hour database login from Vault), so no permanent secret exists.
eBPF — A Linux kernel technology that safely runs sandboxed programs in the kernel; used by observability tools (e.g. Grafana Beyla, Pixie) to auto-instrument apps with zero code changes.
Egress — Data flowing out of a cloud provider (to the internet, or across availability zones/regions), charged per GB; the most commonly overlooked cost driver. Ingress (data in) is usually free.
Egress control — Restricting outbound traffic (what a workload can connect to), used to contain a breach.
Embedding — A numeric vector representation of text (or other data) whose distance encodes semantic similarity; the basis of vector search in RAG.
Encryption at rest — Keeping stored data encrypted on disk so a stolen disk or snapshot is useless without the key; does not stop a public-access misconfiguration.
Environment — A complete, isolated running copy of a system (dev, staging, production); promotion advances a change up this ladder.
Ephemeral environment (preview environment) — A complete, temporary copy of an app spun up on demand (often per pull request), used, then automatically destroyed so it costs nothing afterward.
Error budget — The allowed amount of failure implied by an SLO (100% minus the target, e.g. 0.1%); a quantified budget you can spend on releases, experiments, and risk.
Error-budget policy — An agreed rule for what happens when the budget is healthy (ship features) vs exhausted (freeze risky changes and prioritize reliability).
Ephemeral build runner — A fresh, throwaway machine spun up per CI job and destroyed afterward, guaranteeing a clean identical starting state every run and a tidy security boundary between builds.
Escalation policy — The predefined chain of who gets paged next, and after how long, if the primary on-call doesn't acknowledge or resolve an incident.
Eventual consistency — A model where replicas converge to the same value over time, so a read may briefly see stale data; the availability-favoring (AP) instinct.
Exemplar — A pointer attached to a metric data point that links it to a specific example trace, letting you jump from "latency spiked" on a graph to the actual slow request.
Experiment tracking — Recording every training run's code, data, hyperparameters, and metrics so results are reproducible and comparable (e.g. MLflow, Weights & Biases).
Explicit deny — An IAM rule that overrides any allow; how absolute exceptions are enforced.
Exponential backoff — Waiting progressively longer after each failed retry (1s, 2s, 4s…) to avoid hammering a struggling dependency.
External Secrets Operator (ESO) — A pattern where Git holds only a reference and an in-cluster operator fetches the real secret from an external secrets manager.
Fallacies of distributed computing — The false assumptions (network is reliable, fast, free, etc.) that trip up engineers new to distributed systems.
Feature flag (feature toggle) — A runtime switch that decouples deploying code from releasing behavior, letting you turn a feature on gradually without redeploying.
Feature store — A central system that computes, stores, and serves model input features consistently to both training (offline) and serving (online), preventing training-serving skew (e.g. Feast).
Federation (OIDC) — Configuring the cloud to trust an external identity provider's signed tokens instead of storing a separate static key.
FinOps Certified Practitioner — A certification for the cloud cost-management discipline (FinOps).
FinOps lifecycle — The continuous three-phase loop: Inform (visibility & allocation), Optimize (rightsizing, commitments, anomaly detection), Operate (governance, budgets, culture).
Flagger — A progressive-delivery operator (paired with Flux) that automates canary/blue-green with metric analysis over a mesh or ingress.
Flux CD — A Git-native, composable set of GitOps controllers for Kubernetes.
FOCUS — The FinOps Open Cost and Usage Specification, an open standard normalizing billing data from AWS, GCP, Azure, and SaaS into one common schema.
Fractional GPU — Sharing one physical GPU across multiple workloads (via time-slicing or NVIDIA MIG partitioning) so small jobs don't each waste a whole expensive card.
GitFlow — A branching model with many long-lived branches (develop, feature, release, hotfix), suited to infrequent versioned releases and at odds with continuous integration.
GitOps — A model where the declarative desired state of a system lives in Git and an in-cluster agent continuously reconciles the live system to match it.
Golden path (paved road) — An opinionated, well-supported, secure-by-default, self-service workflow for a common task that makes the right thing the easy thing; a road, not a wall (off-road is allowed but unsupported).
Golden path / paved road — An opinionated, secure-by-default, well-supported way to do a common task.
Goldilocks — A tool that runs the VPA in recommendation mode to show, per workload, what Kubernetes requests/limits should be set to.
GPU — A graphics processing unit; the massively parallel accelerator that makes ML training and inference feasible, and the dominant cost constraint in ML infrastructure.
Graceful degradation — Shedding a non-critical feature when its dependency fails so the core system keeps working, rather than crashing entirely.
Guardrail (org policy) — An organization-wide rule no account can exceed (e.g. AWS SCP, Azure Policy, GCP Org Policy).
Guardrails — Programmatic input/output checks on an LLM app that block unsafe, off-topic, PII-leaking, or policy-violating content (e.g. Guardrails AI, NeMo Guardrails).
Hallucination — An LLM producing fluent but false or unsupported content; mitigated with grounding (RAG), guardrails, and evaluation rather than eliminated.
Harness — A commercial continuous-delivery platform that bundles deployment, feature flags, and AI-assisted canary verification as a managed product; the build-vs-buy counterpart to running Argo/Spinnaker yourself.
HashiCorp Certified: Terraform Associate — A certification validating core Infrastructure-as-Code skills with Terraform.
Head sampling — Deciding whether to keep a trace at its start, before the outcome is known; cheap but blind, so it can drop the rare error traces you most wanted.
Honeycomb — A commercial observability backend specialized in high-cardinality, wide-event debugging and fast exploratory querying.
Headroom — A safety margin of spare capacity (commonly 20–50%) reserved for spikes and the unexpected.
Helm — A Kubernetes templating-and-packaging tool whose parameterized charts are filled in per environment via values files.
HNSW — Hierarchical Navigable Small World, a graph-based ANN index with excellent recall and low latency at higher memory cost.
Horizontal Pod Autoscaler (HPA) — Kubernetes autoscaler that adds or removes pods based on a metric like CPU or request rate (horizontal scaling automated).
Horizontal scaling (scaling out) — Giving more capacity by adding more machines and spreading load across them; effectively uncapped and resilient, at the cost of coordination complexity.
Hypervisor — The software layer that slices one physical machine into many virtual machines; part of the provider's responsibility.
IaC scanning — Scanning Terraform/manifests for misconfigurations (e.g. a public bucket) before apply (Checkov, tfsec, Terrascan).
Humanitec — The canonical commercial "platform orchestrator": it reads a workload spec plus the target environment and generates the environment-specific deployment configuration, so the same app lands correctly in dev/staging/prod without per-environment copy-paste.
Idempotency — Designing an operation so that performing it multiple times has the same effect as performing it once; what makes retries safe.
Idempotency key — A unique ID attached to a request so the server can detect and ignore duplicate executions.
Identity (IAM) — A document that grants or denies a set of (action, resource) pairs to an identity, optionally under conditions; the rule, not the identity.
Image digest — A content-addressable identifier (a hash of the image's exact bytes, e.g. sha256:…) that refers to one immutable image forever; the durable rule is deploy by digest.
Image provenance — Proof that a running image is exactly the one your pipeline built and signed.
Incident — An unplanned disruption or degradation of a service that requires an urgent, coordinated response.
Incident commander (IC) — The single person who coordinates an incident response — directing work and communication — without necessarily doing the hands-on fixing.
Incident leadership — Calmly coordinating a response when production is down; a soft skill that strongly influences seniority and promotion.
Inference server — A specialized server optimized to run model inference efficiently (batching, GPU scheduling, multi-model hosting), e.g. NVIDIA Triton or vLLM.
Infracost — A tool that estimates the monthly cost delta of an Infrastructure-as-Code change and comments it directly in the pull request, shifting cost visibility left to design/PR time.
Internal Developer Platform (IDP) — The decision to assemble a platform yourself (control and fit, months to set up and years to maintain) versus adopt a product (speed, at the cost of flexibility and a vendor dependency); decided by org size, constraint uniqueness, and TCO.
Internal Developer Portal — The UI surface developers open to discover and use what the platform offers (catalog, templates, docs) — the "storefront"; not the platform itself.
IRSA — IAM Roles for Service Accounts (AWS EKS): binds a pod's Kubernetes service account to an IAM role via OIDC, with no stored key.
IVF — Inverted File index, an ANN method that clusters vectors and searches only the nearest clusters; cheaper memory, tunable recall via how many clusters it probes.
Jaeger — A long-standing CNCF distributed-tracing backend with its own UI; a common standalone trace store.
Jitter — Adding a small random delay to retries so many clients don't retry in lockstep; de-synchronizes a thundering herd.
Job — A concrete unit of work within a stage; jobs in the same stage often run in parallel.
Just-in-time access (JIT) — Granting elevated access only for a short window that auto-expires, instead of standing permanent access.
JWT — JSON Web Token: a cryptographically signed JSON blob asserting an identity's claims.
K6 — A modern, developer-friendly load-testing tool whose tests are written in JavaScript and run easily in CI.
Karpenter — A Kubernetes node autoscaler that provisions, bin-packs, and right-types nodes (including cheaper/spot instances) to fit pending pods, tearing down emptied nodes.
KMS — Key Management Service: a managed service that creates and guards encryption keys and performs encrypt/decrypt without exposing raw key material.
Known-unknowns vs unknown-unknowns — Failures you anticipated and built dashboards/alerts for (monitoring) vs novel failures you never predicted and must investigate by querying high-dimensional data (observability).
KServe — A Kubernetes-native model-serving framework providing standardized inference, autoscaling, and scale-to-zero.
Kubeflow — A Kubernetes-native platform for ML pipelines, training, and serving.
Kubernetes cost allocation — Splitting a shared cluster's single bill across namespaces/teams/pods using each pod's requests and usage, while accounting for idle/unallocated capacity.
Kubernetes GPU Operator — Automation that installs and manages GPU drivers, runtime, and device plugins so Kubernetes can schedule GPU workloads.
Kustomize — A patch-and-overlay tool (built into kubectl) that produces per-environment manifests from a shared base without a templating language.
KV cache — The cached key/value attention tensors an LLM reuses across generated tokens so it doesn't recompute the whole prompt each step; a major driver of GPU memory use and serving speed.
Kyverno — A Kubernetes-native policy-as-code engine whose policies are written as ordinary Kubernetes YAML.
Layer 4 (L4) load balancing — Routing raw TCP/UDP connections by IP and port without inspecting content; very fast and protocol-agnostic but content-blind.
Layer 7 (L7) load balancing — Application-layer routing that understands HTTP, so it can route by URL path, headers, cookies, or hostname (and terminate TLS); smarter but slightly slower than L4.
Lead time for changes — A DORA metric: how long from a commit to it running in production; shorter is better.
Least privilege — Granting each identity the minimum permissions it needs and no more, to shrink the blast radius of any compromise.
Limits (Kubernetes) — The capacity a pod reserves on a node whether or not it uses it; effectively what you pay for, and chronically set far above real usage.
LLM observability — Tracing prompts, responses, tokens, latency, and cost across an LLM/agent app to debug quality and control spend (e.g. LangSmith, Langfuse, Arize Phoenix, Helicone).
LLM-as-judge — Using a strong LLM to score another model's outputs against a rubric, enabling automated, scalable evaluation of open-ended responses.
LLMOps — MLOps specialized for large language models: serving foundation models, RAG pipelines, prompt versioning, eval gates, guardrails, and token/latency/cost observability.
Load balancer — A single front door that distributes incoming requests across many healthy backend servers, health-checking them and routing around failures.
Load testing — Throwing simulated traffic at a system in a safe environment to discover where it breaks before real users do.
Locust — A load-testing tool whose tests are written in Python, with a live web dashboard of the run.
Log — A timestamped, immutable record of a discrete event ("what happened, when"); richest when structured (key-value/JSON) so it can be filtered and correlated.
Loki — A lightweight, Prometheus-style log store that indexes labels rather than full text; chosen for cheap logs that join cleanly to metrics.
Long-term metrics storage — A horizontally-scalable, object-storage-backed tier placed behind Prometheus to overcome its single-node and short-retention limits, giving a global query view and durable, long-range history (Thanos, Mimir, Cortex, VictoriaMetrics).
Long-lived static key — A durable, portable, bearer cloud credential that works indefinitely from anywhere; the #1 source of cloud credential leaks.
Main / trunk — The primary branch a team ships from; in trunk-based development everyone works off this one line of code.
Managed vs self-hosted — The decision to use a provider-operated service or run it yourself; default to managed to offload patching, backups, and failover.
Merge hell (integration hell) — The big, conflict-ridden merge that results when a branch lives apart from main too long.
Merge queue (merge train) — A queue that re-runs the pipeline against the real future state of main before merging each change, keeping main always green.
Message queue — A buffer that holds units of work so they can be processed asynchronously by separate workers; absorbs traffic spikes, decouples components, and re-delivers failed work.
Meter — Anything a cloud provider counts and bills (VM-seconds, GB stored, GB transferred, API calls); a cloud bill is the sum of every meter multiplied by its unit rate.
Metric — A numeric measurement of some aspect of a system sampled over time (e.g. request rate, error count, latency), cheap to store and ideal for trends, dashboards, and alerts.
Microservices — An architecture of many small, independently-deployed services; solves an organizational scaling problem at the cost of distributed-systems complexity.
Mimir — A Grafana, from-the-ground-up horizontally-scalable, Prometheus-compatible long-term metrics backend built for very high cardinality and huge series counts.
Misconfiguration — A customer-side setting left wrong (public bucket, over-permissive role, public database) that causes the large majority of real cloud breaches.
MLOps — The practice of applying cloud-engineering discipline (reproducibility, automation, CI/CD, observability, cost control) to the machine-learning lifecycle of data, models, and code.
MLOps / LLMOps — Applying cloud-engineering discipline to the machine-learning (and LLM) lifecycle: data/model pipelines, GPU scheduling, and inference serving.
Model registry — A versioned catalog of trained model artifacts plus their metadata (data, code, metrics, stage), giving you reproducible deploys and instant rollback.
Model serving — Exposing a trained model behind an API so applications can request predictions; handled by model servers like KServe, BentoML, Seldon Core, or NVIDIA Triton.
Monitoring — Watching a system against predefined questions and known failure modes (dashboards and alerts you set up in advance); answers "known-unknowns" but not novel problems.
Monolith — A single deployable application; simpler to build, test, deploy, and debug, and the sensible default for small teams and modest scale.
MTLS — Mutual TLS: both sides of a connection present and verify certificates, mutually authenticating and encrypting traffic.
MTTR (Mean Time To Restore) — A DORA metric: how fast you recover when something breaks; shorter is better.
MTTR / MTTD — Mean Time To Resolve / Mean Time To Detect; common measures of how quickly incidents are caught and fixed.
Multi-AZ — Running across several availability zones within one region to survive a single data center failure; the low-cost, low-latency default for production.
Multi-region — Running across geographically distant regions to survive a whole-region outage or serve global users; dramatically more complex and expensive.
Multi-tenancy — Running many tenants (teams, apps, environments) on shared platform infrastructure; requires deliberate isolation and guardrails to avoid noisy neighbors, blast radius, and runaway cost.
Mutable tag — An image label (like :1.4.0 or :latest) that can be re-pointed to a different image, breaking reproducibility.
Network partition — A failure where some machines in a cluster cannot communicate with others.
NetworkPolicy — A Kubernetes object declaring which pods may talk to which (and on which ports); the durable pattern is default-deny then allow.
Non-root container — A container configured to run as an unprivileged user, so an app breakout doesn't hand the attacker root.
NVIDIA NIM — Packaged, optimized inference microservices for deploying foundation models as containers.
NVIDIA Triton — An open-source inference server that hosts many models across frameworks with batching and GPU scheduling.
Observability — The ability to understand a system's internal state by examining the data it emits (logs, metrics, traces), so you can ask new, unanticipated questions about its behavior — not just watch predefined dashboards.
Offline evaluation — Testing a model or prompt against a fixed labeled dataset before release, ideally wired into CI as a regression gate.
OIDC federation — Keyless cloud authentication where the pipeline proves its identity (a signed token scoped to repo/branch) and receives a short-lived credential, so no long-lived key is stored.
On-call — A rotation in which an engineer is responsible for responding to production alerts during a shift, with defined escalation if they can't resolve it.
On-demand pricing — Paying the full published rate per second with no commitment, startable and stoppable anytime; the flexible, most-expensive baseline for unpredictable or short-lived workloads.
Online evaluation — Measuring model quality on live production traffic via metrics, feedback, and sampling, after release.
Online inference — Serving model predictions synchronously in real time, one request at a time, with tight latency budgets.
OPA / Gatekeeper — Open Policy Agent, a general policy engine, with Gatekeeper integrating it into Kubernetes admission.
OpenCost — The open, CNCF-backed standard for Kubernetes cost allocation (per-namespace/team/pod cost including idle); Kubecost is built on it.
OpenTelemetry (OTel) — A vendor-neutral, open standard and set of SDKs for generating and exporting telemetry (traces, metrics, logs), so you instrument once and can send data to any backend.
OpenTelemetry Collector — A standalone service that receives, processes (batches, filters, samples), and exports telemetry, decoupling your apps from any specific observability backend.
Operator — A custom Kubernetes controller that reconciles a CRD; operational expert knowledge encoded as a reconciliation loop (provision, back up, heal a resource and keep it correct).
OTLP (OpenTelemetry Protocol) — The standard wire protocol OpenTelemetry uses to transmit telemetry from apps and the Collector to backends.
Over-engineering — Building for a scale, team size, or failure mode you don't have (e.g. premature microservices, Kubernetes, or multi-cloud); the dominant failure mode in modern cloud.
Overlay — A thin per-environment patch on top of a shared base configuration; the alternative to forked per-environment pipelines.
Permission boundary — A policy defining the maximum permissions an identity can ever have; effective permissions are the intersection of granted and boundary.
Pipeline — An automated assembly line that turns a commit into a release through a fixed series of staged gates; a failure at any stage stops the change.
Platform Engineer — A role building the internal developer platform (paved roads, self-service tooling) as a product for other engineers; the fastest-rising role.
Platform engineering — The discipline of building an internal self-service layer (an IDP) that packages cloud complexity so product teams can ship safely without becoming cloud experts; the operating model that makes "you build it, you run it" survivable at scale.
Platform orchestrator — The layer between the developer portal and the infrastructure that takes a developer's intent plus the target environment and dynamically resolves and provisions the right concrete resources; distinct from the portal (UI) above and the provisioning tools (Terraform, Crossplane) it drives below (canonical example: Humanitec).
Platform team — A Team Topologies team that builds and runs the internal platform as a product to reduce stream-aligned teams' cognitive load; an enabler relating by pull and feedback, not a gatekeeper.
Platform-as-product — Treating an internal platform as a product with developers as its customers: a roadmap, user research, a feedback loop, and success measured by adoption and satisfaction.
Policy-as-code — Organizational rules expressed as code and checked automatically (e.g. "no public databases," "must set resource limits"), letting a policy engine reject unsafe self-service requests before they're created.
Portfolio — A set of hands-on projects (e.g. IaC + containers + CI/CD + observability + an ADR) that proves cloud skills, often more persuasively than certifications.
Postmortem — A blameless write-up of an incident to learn from failure.
Premature optimization — Tuning performance for a load you don't have, before measuring; adds complexity and bugs for no proven benefit.
Private endpoint — A path to reach a service over the cloud's private network only, with no public internet route.
Privilege escalation — A chain by which a seemingly limited identity (e.g. one that can edit IAM) reaches full control.
Progressive delivery — Reducing release risk by exposing a new version gradually, measuring real metrics, and aborting if it misbehaves.
Promotion — Advancing one immutable artifact up the environment ladder (dev → staging → prod), gaining confidence at each rung.
Prompt injection — An attack where adversarial text in user input or retrieved content hijacks an LLM's instructions; a top LLM-app security risk requiring defense in depth.
Prompt management — Versioning, reviewing, and tracking prompts as first-class artifacts (like code) so prompt changes are deliberate, testable, and rollback-able.
Provenance / attestation — Signed metadata stating how and where an artifact was built (which commit, builder, steps), used to refuse anything not built by your trusted pipeline.
Pull-based reconciliation — A deployment model where an agent inside the cluster pulls desired state from Git and applies it itself, so no external system holds cluster credentials.
Push-based CD — A deployment model where the CI pipeline reaches out and applies changes to the cluster, requiring the pipeline to hold cluster credentials.
Quantization — Reducing the numeric precision of model weights (e.g. FP16 to INT8/INT4) to shrink memory and speed inference, trading a little accuracy for large cost savings.
RAG — Retrieval-augmented generation: fetching relevant documents from a knowledge base and injecting them into an LLM's prompt so answers are grounded in your data.
RAGAS — A framework for evaluating RAG pipelines on metrics like faithfulness, answer relevance, and context precision/recall.
Ray / Ray Serve — A distributed-computing framework (Ray) and its scalable model-serving library (Ray Serve) for Python ML workloads.
RBAC (Role-Based Access Control) — Granting permissions by role on a least-privilege basis so each tenant acts only within its own scope and cannot touch another tenant's resources.
Read replica — A copy of a database that serves read queries to take load off the primary; may lag slightly behind (replication lag).
Reconciliation loop — The core platform primitive: a controller continuously compares declared desired state to actual state and acts to close the gap, healing drift forever; declarative, self-healing, and continuous.
RED method — A dashboard/alerting recipe for request-driven services: track Rate (requests/sec), Errors (failed requests), and Duration (latency distribution).
Redundancy — Having more than one of every critical component so the loss of any one isn't fatal; the foundation of reliability.
Renovate / Dependabot — Tools that automatically open pull requests to update dependencies (especially security fixes), each verified through the full pipeline.
Replication — Keeping multiple copies of a database (a primary plus replicas) for redundancy and read scaling.
Replication lag — The short delay before a replica reflects the latest writes from the primary; a source of eventual consistency.
Reserved Instance (RI) — A commitment to a specific instance type/region for 1–3 years in exchange for a steep discount (~30–60%); biggest discount, least flexible.
Resource quota — A cap on how much CPU, memory, or how many objects a tenant may consume; defuses noisy-neighbor and cost-bomb risks in self-service.
Restore-tested backup — A backup proven recoverable by regularly restoring from it and verifying the data; an untested backup is only a hope.
Reusable workflow — A whole CI pipeline defined once and invoked by many repos (GitHub Actions uses:, GitLab include:/extends:, Jenkins shared libraries), so shared build/test/scan/deploy logic lives in one governed place and fixes propagate everywhere at once.
RFC (Request for Comments) — A longer proposal document circulated for feedback to reach a significant decision collaboratively before building it.
Rightsizing — Adjusting a resource's allocated capacity to match its actual measured utilization (plus headroom), correcting chronic overprovisioning.
Rollback — Returning to the last known-good version; in GitOps, reverting the Git commit so the agent reconciles production back to the prior immutable digest.
Rolling deployment — Replacing old instances with new ones a few at a time; Kubernetes' default, but it does not watch metrics to decide whether to proceed.
Root cause vs contributing factors — The triggering condition of an incident vs the multiple systemic weaknesses that let it become an outage; mature postmortems address the contributing factors, not just one "root cause."
Rotation — Replacing a secret with a fresh value on a schedule so a leaked-but-unnoticed secret stops working.
RPO (Recovery Point Objective) — The maximum acceptable amount of data loss, measured in time (e.g. 5 minutes), which sets backup/replication frequency.
RTO (Recovery Time Objective) — The maximum acceptable time a system can be down before it must be restored.
Runbook — A documented, step-by-step procedure for diagnosing and resolving a known operational situation, so any on-call engineer can handle it.
Runtime threat detection — Watching the behavior of running workloads to catch a live attacker that static scans can't see (Falco, KubeArmor).
SAST — Static Application Security Testing: scans your source code for insecure patterns.
Savings Plan — An AWS commitment to a steady dollars-per-hour of spend (rather than a specific instance) for 1–3 years, automatically applying the discount across instance families as the fleet changes.
SBOM (Software Bill of Materials) — Software Bill of Materials: a complete machine-readable inventory of every component in an artifact, used to answer "are we affected?" instantly.
SCA — Software Composition Analysis (dependency scanning): finds known CVEs in libraries you import (Snyk, Trivy, Grype).
Scale-to-zero — Automatically dropping a service to zero running replicas (and zero cost) when idle, then cold-starting on demand; critical for expensive idle GPUs.
Score — An open, platform-agnostic workload specification where a developer describes what a workload needs (container, route, database) without saying how or where it runs; the platform translates it to the target.
Sealed Secrets — A pattern where a secret is encrypted before being committed to Git and an in-cluster controller is the only thing that can decrypt it.
Secret manager — A dedicated service that stores secrets encrypted, controls access via IAM/RBAC, audits reads, rotates, and delivers at runtime (Vault, AWS Secrets Manager, etc.).
Secret scanning — CI scanning of commits/PRs for keys and tokens about to be committed (Gitleaks, TruffleHog).
Secure by default — A design where the safe configuration is what you get by doing nothing special, so you must go out of your way to be insecure rather than to be secure.
Segmentation — Dividing a network into isolated zones so a compromise in one can't freely reach others.
Self-heal — A GitOps mode where the agent automatically reverts unauthorized manual changes back to match Git.
Self-hosted runner — A CI runner operated on your own infrastructure (rather than the provider's hosted fleet), used when you need private-network access, special hardware/GPUs, compliance/residency control, or cheaper sustained throughput — at the cost of owning patching, scaling, and isolation.
Self-service — Getting a needed resource (database, environment, secret, new service) directly and immediately through the platform, without filing a ticket and waiting for a human.
Semantic versioning (SemVer) — A versioning scheme MAJOR.MINOR.PATCH: patch for fixes, minor for compatible features, major for breaking changes.
Separation of config from code — Keeping environment-specific settings outside the immutable artifact so the same image runs everywhere, configured per environment.
Severity (SEV) level — A ranking of an incident's impact (e.g. SEV1 = major outage, SEV3 = minor) that drives how much response and escalation it gets.
Shard key — The field used to decide which shard a piece of data lives on; a poor choice creates hot spots.
Sharding — Splitting data across multiple independent databases by a shard key to scale writes and storage; powerful but complex and a last resort.
Shared-responsibility model — The agreement dividing security work between provider (security of the cloud: hardware, hypervisor, network) and customer (security in the cloud: data, access, configuration); "if you can configure it, you own it."
Shift-left — Moving security checks earlier in the pipeline (to commit/PR) instead of only scanning the finished artifact.
Short-lived branch — A branch that merges back within a day or two, keeping integrations tiny and frequent.
Short-lived credential — A credential that auto-expires within minutes and is issued on demand, leaving nothing durable to steal.
Short-lived credentials — Credentials minted on demand that expire in minutes, shrinking the exposure window versus long-lived stored keys.
Showback — Showing each team what it spends without moving money between budgets; informational, and on its own reduces waste ~15–20% because people optimize what they can see.
Signing / verification — Cryptographically stamping an artifact (Cosign/Sigstore) so consumers can prove it came from you and wasn't tampered with.
Single-cloud vs multi-cloud — The decision to use one cloud provider or several; default to single-cloud, since multi-cloud roughly doubles operational complexity.
SLA (Service Level Agreement) — A contractual promise to customers about reliability, with consequences (refunds/penalties) if missed; usually set looser than the internal SLO.
SLI (Service Level Indicator) — A carefully chosen measured number that reflects user-experienced health, e.g. the fraction of requests served successfully and quickly.
SLI / SLO / error budget — A measured indicator / a target for it / the allowed amount of failure you can "spend" on shipping risk.
SLO (Service Level Objective) — An internal target for an SLI over a window (e.g. 99.9% of requests succeed over 28 days); the definition of "reliable enough."
SLO tooling — Tools that define SLOs and generate the recording rules and burn-rate alerts behind them, e.g. Sloth, OpenSLO, Nobl9.
SLSA — Supply-chain Levels for Software Artifacts: a framework grading build trustworthiness in levels (L1 provenance exists, L2 signed on a hosted builder, L3 hardened and isolated).
Software catalog (service catalog) — A single inventory of everything an org runs (services, libraries, sites, systems) with owners, links, docs, and dependencies; the portal's backbone.
Software supply chain — The full tree of dependencies an artifact is built from; a prime attack target, so its integrity must be proven.
Software template (scaffolder) — The "create a new service" generator: pick a template and a name, and it produces a ready-to-go repo (code, Dockerfile, manifests, pipeline, docs) to the org's standard — a clickable golden path.
Solutions Architect certification — A cloud-provider certification (AWS/Azure/GCP) validating broad cloud architecture and design knowledge.
SOPS — A tool that encrypts the values inside a config file (keys readable, values ciphertext) so the file is safe to commit in GitOps.
Span — One unit of work within a trace (e.g. a single service handling part of a request), with a start/end time, attributes, and a parent — the building block of a trace.
Spinnaker — The open-source, multi-cloud continuous-delivery platform (originally Netflix) that orchestrates complex, pipeline-centric deployments across clouds with built-in deployment strategies and automated canary analysis; a heavyweight enterprise CD orchestrator that pushes deployments rather than reconciling like GitOps.
Spot / preemptible — Cheap, interruptible spare capacity.
Spot instance — Heavily discounted (~60–90%) spare provider capacity that can be reclaimed at any time with little warning (preemptible on GCP, Spot VMs on Azure); safe only for fault-tolerant, interruption-handling workloads.
SRE (Site Reliability Engineering) — A role treating reliability as engineering (SLOs, error budgets, incident response, on-call, toil reduction); the most operations- and on-call-heavy.
Stage — A phase of a pipeline that must pass before the next begins (e.g. build, then test, then scan).
Stateful — A component that retains data in its own memory/storage between requests (e.g. a database); the hard part to scale, because copies must agree on the data.
Statelessness — A design where app servers keep no important data in their own memory between requests, so any server can handle any request; the unlock that makes horizontal scaling and disposable instances work.
Storage lifecycle policy — An automated rule that transitions data to colder, cheaper tiers as it ages (hot→cool→archive) and deletes it at end of retention, preventing data from rotting in the expensive hot tier.
Stream-aligned team — A Team Topologies term for a product/feature team aligned to a stream of work; the platform's customers.
Streaming inference — Scoring an unbounded event stream continuously as records arrive.
Strong consistency — A model where every read always returns the latest written value; the consistency-favoring (CP) instinct.
Structured logging — Emitting logs as machine-parseable key-value data (e.g. JSON) instead of free-form text, so they can be filtered, aggregated, and correlated by field.
Symptom-based alerting — Alerting on what users actually experience (errors, slow responses) rather than internal causes (high CPU), so pages are meaningful and actionable.
Tag (label) — A key–value pair attached to a resource (e.g. team:payments) that lets you slice the bill by owner; the foundation of cost allocation.
Tail sampling — Deciding whether to keep a trace after it completes, so you can keep all errors and slow requests and drop boring fast ones; smarter but needs buffering.
TechDocs — Backstage's docs-as-code system: Markdown docs living in each service's repo, rendered in the portal next to that service so they stay current.
Telemetry — The signals a system emits about itself (metrics, logs, traces, profiles) that are collected and analyzed to understand its behavior.
Tempo — A cheap, object-storage-backed distributed-tracing store designed to pair with Grafana.
Thanos — A long-term metrics-storage tool that bolts onto existing Prometheus servers (via a sidecar) to give a global query view across many Prometheis plus unlimited object-storage retention.
Threat model — A structured answer to "what can go wrong and who could make it go wrong" for a system.
Three pillars — Logs (event records), metrics (numbers over time), and traces (a request's path across services).
Thundering herd (retry storm) — Many clients retrying simultaneously and overwhelming a recovering service, turning a blip into an outage.
Toil — Manual, repetitive, automatable operational work that scales with service size and adds no lasting value; SRE aims to cap and automate it away.
Tool-chasing — The career-limiting habit of jumping to trending tools without a durable foundation, leaving you stranded when those tools are replaced.
Total Cost of Ownership (TCO) — The full lifetime cost of a choice (up-front plus ongoing maintenance), central to the build-vs-buy IDP decision; building a platform is months to set up and years to maintain.
Trace — The end-to-end record of one request's journey across all the services it touches, assembled from spans, showing where time was spent and where it failed.
Trace context propagation — Passing a shared trace ID and span ID across service boundaries (usually via request headers) so spans emitted by different services can be stitched into one trace.
Training-serving skew — A silent bug where features are computed differently during training than during serving, so the deployed model sees inputs unlike anything it was trained on.
Trigger — The event that starts a pipeline run (a push, a pull request, a schedule, or a manual click).
Trunk-based development (TBD) — A branching strategy where everyone works off one trunk via short-lived branches that merge within a day or two; the model that pairs with CI/CD.
Unallocated cost — Spend that no tag claims (or shared/idle cost belonging to no single owner); a key FinOps health metric to drive toward zero.
Unit economics — Tying spend to a unit of business value (cost per customer, per request, per feature) instead of looking at the total bill; reveals whether a rising bill is healthy growth or creeping waste.
USE method — A recipe for resources (CPU, disk, queues): track Utilization, Saturation (how full/queued), and Errors.
Utilization (commitment) — The fraction of a purchased commitment that is actually used; low utilization means you committed to more than you needed and are paying for idle promises.
VCluster — A virtual Kubernetes cluster giving each tenant their own control plane inside a shared host cluster; a middle ground stronger than namespaces but cheaper than separate clusters.
Vector database — A store optimized for similarity search over embeddings using approximate-nearest-neighbor indexes (e.g. Pinecone, Weaviate, Qdrant, Chroma, pgvector).
Vendor lock-in — Dependence on a specific provider's services; real costs are usually manageable, so "avoid lock-in at all costs" is generally wrong.
Vertical Pod Autoscaler (VPA) — Kubernetes autoscaler that right-sizes the CPU/memory requested by each pod (vertical scaling automated).
Vertical scaling (scaling up) — Giving more capacity by making one machine bigger (more CPU/memory/disk); simple but capped, requires downtime to resize, and leaves a single point of failure.
VictoriaMetrics — A fast, storage-efficient Prometheus replacement whose clustered mode also serves as a scalable, long-retention long-term metrics store.
vLLM — A high-throughput LLM inference engine using continuous batching and paged KV-cache memory.
Workload identity — Keyless auth that binds a workload's identity to a cloud role for short-lived credentials (AWS IRSA; GCP/Azure Workload Identity).
Zero trust — The model: never trust a request by its network location; authenticate and authorize every request, even internal ones.
Zombie resource — A resource that is running and billing but doing nothing useful (idle VM, orphaned disk, old snapshot, unattached IP); the fix is simply to find and turn it off.

That's the vocabulary of cloud engineering. If a term ever feels fuzzy, return to the chapter that introduced it — every one is defined on first use in context.

Foundations​

Core services​

Infrastructure as Code​

Containers & Kubernetes​

Delivery, operations, security, FinOps, ML & scale​

Foundations

Core services

Infrastructure as Code

Containers & Kubernetes

Delivery, operations, security, FinOps, ML & scale