Decisions, written down: ADRs, RFCs & DORA

A cloud engineer's real job isn't typing commands — it's making decisions under uncertainty and constraints, and being able to defend them later. This lesson does three things: gives you a concrete rule for the architecture questions that recur in every job, shows you how to write decisions down so the reasoning survives the people who made it, and hands you the durable yardstick — DORA — for telling whether all this engineering is actually making things better. Tools change yearly; the ability to decide well and record why is a career-long skill.

The recurring decisions, with a rule for each

Across every cloud role, the same handful of "should we…?" questions come back. None has a universal answer — but each has a default and a trigger to deviate. Internalize these and you'll out-reason most arguments.

Build vs buy. Should you build this capability yourself or use an existing product/service?

Default: buy (or use a managed service) for anything that isn't your core differentiator. Build only what makes your product uniquely valuable. Your auth, your email sending, your payment processing, your monitoring — buy them; companies whose entire job is that problem do it better than you will. Building undifferentiated infrastructure is how teams sink months into reinventing what they could have rented. Build when it's genuinely core to your value, or when no product fits and the need is durable.

Managed vs self-hosted. Run a service yourself (e.g. your own database on a VM) or use the cloud's managed version (e.g. RDS)?

Default: managed. You offload patching, backups, replication, and failover to the provider — enormous operational savings (the recurring theme from Chapter 2). Self-host only when you have a hard requirement managed can't meet (specific version, regulatory, extreme cost at scale) and the expertise to operate it well. Self-hosting to "save money" usually loses once you price in the engineer-hours.

Monolith vs microservices. One deployable app, or many small independently-deployed services?

Default: monolith (ideally a clean, modular one). It's simpler to build, test, deploy, and debug. Move to microservices when the pain is organizational — multiple teams blocking each other on one codebase — not merely "it feels more modern." Microservices buy independent team velocity at the cost of distributed-systems complexity (lesson 11.2). Pay that cost only when team scale demands it.

Single-cloud vs multi-cloud. One provider, or several?

Default: single-cloud. Pick one of the big three (Chapter 1) and go deep. Multi-cloud roughly doubles operational complexity — two of every tool, skill set, and failure mode. (Lesson 11.5's career framing and the next section explain why "avoid lock-in at all costs" is usually wrong.) Go multi-cloud only with a concrete, weighed reason: a regulatory mandate, a specific best-of-breed service, or genuine region-level resilience requirements.

Notice the shape: every default is the simpler, lower-operational-cost option, and you deviate only on a specific, articulated trigger. That's lesson 11.3's "match complexity to need" applied to decisions. The discipline that makes it real is writing the trigger down — which is the next section.

:::tip Multi-cloud is not a free "best practice" A common myth is that multi-cloud is automatically wiser because it "avoids lock-in." In reality, multi-cloud means you can only use the lowest common denominator of features, you double your operational burden, and you often increase fragility (more moving parts). Real lock-in costs are usually manageable and worth the leverage of going deep on one platform. Treat multi-cloud as an expensive, deliberate choice with a written justification — never a default. :::

Writing it down: Architecture Decision Records (ADRs)

Here's the gap that quietly sinks teams: decisions get made in a meeting or a Slack thread, and six months later nobody remembers why. Someone "fixes" the weird-looking choice, breaks something the original decision deliberately prevented, and the lesson is re-learned the hard way. The durable fix is the Architecture Decision Record (ADR) — a short, plain document capturing one significant decision: what you decided, why, and what you traded away.

A standard ADR is small on purpose:

# ADR 0012: Use a managed Postgres (RDS) instead of self-hosting

## Status
Accepted — 2026-06-24

## Context
We need a relational database for the orders service. The team is 4 engineers
with no dedicated DBA. We expect read-heavy traffic and need point-in-time
recovery for an RPO of 5 minutes.

## Decision
Use AWS RDS for PostgreSQL (managed), multi-AZ, with read replicas.

## Consequences
+ Provider handles patching, backups, failover; meets our RPO out of the box.
+ Frees the team from DB operations to focus on product.
- Higher monthly cost than a self-hosted instance.
- Some version/extension choices constrained by RDS.
- Mild AWS coupling (accepted; see ADR 0003 single-cloud).

ADRs are durable knowledge: cheap to write, numbered and append-only (you supersede an old one with a new one rather than editing history), and they make the reasoning — not just the outcome — survive turnover. A team with a folder of ADRs can answer "why is it built this way?" in minutes instead of archaeology. The trade-offs you wrote down in lesson 11.2 and 11.3 belong in ADRs.

RFCs: deciding together, before you build

An ADR records a decision; an RFC ("Request for Comments") is how you reach a bigger one collaboratively before committing. It's a longer proposal document — here's the problem, here are the options I considered, here's what I propose and why — circulated to teammates for feedback. For a significant change (a new service, a database migration, a platform choice), an RFC surfaces objections and better ideas while they're still cheap, builds shared understanding, and creates a written trail. The flow is natural: RFC to debate and align → decision made → ADR to record the outcome tersely.

Both are the same durable instinct: make tradeoffs explicit and written, not implicit and forgotten. The act of writing forces clearer thinking, and the artifact pays dividends for years. This is also a soft-skill multiplier — the engineer who writes the clear RFC drives the decision and gets visible, which matters for the leveling discussed in the next lesson.

DORA: is our engineering actually getting better?

You can make great architecture decisions and still not know whether your delivery is healthy. The durable, research-backed answer is the DORA metrics (from the multi-year DevOps Research and Assessment program). Four numbers, measuring the two things that matter — speed and stability — and the deep finding is that the best teams are not forced to trade one for the other; they get both.

Deployment Frequency — how often you ship to production. Elite teams deploy many times a day.
Lead Time for Changes — how long from a code commit to it running in production. Shorter = a faster, healthier pipeline.
Change Failure Rate (CFR) — what fraction of deployments cause a failure needing remediation. Lower = more stability.
Time to Restore Service (MTTR) — when something breaks, how fast you recover. This is the operational payoff of the SRE practices in Chapter 6.

Why DORA is durable and not just another dashboard: it ties your concrete technical work to outcomes. Better CI/CD (Chapter 5) lifts deployment frequency and lead time; better testing and progressive delivery lower change failure rate; better observability and incident response (Chapter 6) lower MTTR. DORA is the frame that lets you say "this investment made us measurably better," and it's the language platform and engineering leaders use. (A newer companion frame, DevEx, broadens this to developer experience — flow, feedback loops, cognitive load — but DORA remains the durable backbone.)

:::tip Use DORA to justify the boring work Reliability work, paying down tech debt, improving the pipeline — these are hard to sell because they're invisible. DORA makes them visible: "our lead time dropped from 3 days to 2 hours and change-failure rate halved." Tie your technical proposals (in those RFCs) to a DORA metric and they stop being "engineering wants to tinker" and become "this improves how fast and safely we ship." That translation is a senior skill. :::

Why it matters

A cloud engineer's value is in deciding well and recording why. The recurring calls — build vs buy, managed vs self-hosted, monolith vs microservices, single- vs multi-cloud — each default to the simpler, lower-operational-cost option, and you deviate only on a specific, articulated trigger (and not, for multi-cloud, on the lock-in myth). The discipline that makes good decisions durable is writing them down: an ADR captures one decision (what, why, trade-offs) so the reasoning survives turnover, and an RFC aligns the team before a big change. Finally, DORA's four metrics — deployment frequency, lead time, change failure rate, and MTTR — are the durable, research-backed frame for whether your engineering is actually improving, and the language to justify the unglamorous reliability work. With decision-making and measurement in hand, the final lesson maps where these skills lead: the cloud-engineering career.

Next: 11.5 The cloud-engineering career →

The recurring decisions, with a rule for each​

Writing it down: Architecture Decision Records (ADRs)​

RFCs: deciding together, before you build​

DORA: is our engineering actually getting better?​

Why it matters​

The recurring decisions, with a rule for each

Writing it down: Architecture Decision Records (ADRs)

RFCs: deciding together, before you build

DORA: is our engineering actually getting better?

Why it matters