March 30, 2026
· 10 min readThe CAP Theorem Lied to You (And Its Inventor Admitted It)
You've been drawing the CAP triangle in interviews for years. But in 2012, Eric Brewer himself said the pick-two framing is misleading. This post covers what the original theorem gets wrong, what PACELC adds, and how real systems — ATMs, CRDTs, Sagas — answer the hard questions CAP never could.

TL;DR
- The CAP pick-two framing is useful but incomplete — Brewer himself said so in 2012.
- All three CAP properties are spectrums, not booleans. Real systems live inside the triangle.
- CAP only describes behavior during partitions. It says nothing about the majority of time when everything is fine.
- PACELC fills the gap: even without a partition, you're trading Latency vs Consistency.
- Design for three states: normal, partition, and recovery — not just two.
- CRDTs mathematically eliminate the C vs A trade-off for certain data types.
- The Saga pattern (1987) is the formal answer for cross-service operations.
Why CAP Misleads You
You've drawn the triangle. You said "CAP theorem." You picked two corners. The interviewer nodded.
That answer is built on a framing that the inventor of CAP himself described as misleading. Not wrong — incomplete.
In 2012, twelve years after introducing CAP, Eric Brewer published a follow-up paper identifying three specific ways the pick-two model causes engineers to build systems worse than they need to.
Problem 1: All three properties are treated as binary.
Availability isn't on or off. Production systems target 99.9%, 99.95%, 99.99% — different points on a spectrum. Consistency is also a spectrum, from linearizability (every operation appears instantaneous, globally ordered) to eventual consistency (writes propagate everywhere, but no bound on when). Even partition tolerance isn't clean — a 200ms hiccup vs. 10 seconds depends entirely on your timeout. You don't detect partitions. You define them.
The triangle collapses all of this nuance into three dots. There aren't three valid system positions. There are thousands.
Problem 2: CAP only describes behavior during partitions.
Partitions are rare. The vast majority of your uptime, the network is fine. CAP says nothing about those hours, weeks. Brewer's updated framing: maximize both during normal operation, make a trade-off only when a partition is actually detected.
That shifts CAP from a permanent identity ("we're AP") into a runtime decision made when specific conditions occur.
Problem 3: The CA corner is a trap for distributed systems.
Calling your distributed system CA implies you've given up partition tolerance. But network failures aren't a design parameter you choose — they happen. Labelling your system CA just means when a partition arrives (and eventually one will), you'll be making decisions under pressure that you could have made deliberately.
The Missing Variable: Latency
Something is entirely absent from the CAP theorem: latency.
The essence of CAP plays out during a timeout. Node A writes data, sends a confirmation request to Node B, and waits. Node B is slow — high load, congestion, or a real failure. From Node A's side, there's no way to know which.
At some point, Node A has to decide:
- Cancel or keep waiting → Availability suffers
- Proceed without confirmation → Consistency is at risk
There's no option C. Waiting indefinitely is just choosing consistency with infinite latency. A partition isn't a separate failure category — it's what you call it when you've decided to stop waiting. It's a timeout with a name.
CAP's silence on latency is what motivated Daniel Abadi to propose PACELC, first sketched in 2010 and formalized in 2012:
If Partition → choose between Availability and Consistency
Else (normal operation) → choose between Latency and ConsistencyThat second line is what CAP was missing. Even when everything is healthy, there's still a trade-off between how fast you respond and how confident you are in data freshness.
DynamoDB makes this unavoidably literal: strongly consistent reads cost twice as many capacity units and are slower. Amazon is charging you for consistency. That's PACELC expressed as a line item on your cloud bill.
| System | Trade-off | Example Setting |
|---|---|---|
| DynamoDB | Strong consistency costs 2x RCUs | ConsistentRead: true |
| MongoDB | Read preference controls staleness | readPreference: "secondaryPreferred" |
| Cassandra | Consistency level per query | QUORUM vs ONE vs LOCAL_ONE |
| PostgreSQL | Sync vs async replication | synchronous_commit = on/off |
Every read preference, every Cassandra consistency level, every async replication choice — you've been making PACELC decisions without the vocabulary for it.
Design for Three States, Not Two
Most distributed systems are engineered for exactly two states:
- Normal operation — load balancers, read replicas, caches, health checks, all tuned for the common case
- Broken state — retry logic, generic errors, an on-call page
What's missing is a third state: not generic failure, but a specific, designed partition mode with its own rules.
⚠️ Warning: Recovery mode is not a return to normal mode. Both sides have been running independently. Their states have diverged. Some of those operations may have already been shown to users. You can't quietly undo them.
How to design partition mode intentionally: go through your operations one by one and ask: what should happen if this runs without access to the rest of the system?
| Operation | During Partition | Reasoning |
|---|---|---|
| Read user profile | ✅ Serve from local cache | Nothing breaks |
| Register new email | ❌ Block | Must be globally unique, can't verify |
| ATM withdrawal | ✅ Allow up to bounded limit | Risk is bounded and reversible |
| Charge credit card | 🕐 Queue for after recovery | External action, can't undo |
| Increment view counter | ✅ Allow locally, merge later | Commutative, no ordering needed |
Writing this table for your actual system almost always surfaces decisions your team has been making implicitly for years. That's usually worth more than the document itself.
The ATM: Bounded Risk in Production
When was the last time an ATM refused your withdrawal because it couldn't reach the bank? Probably never. That's not luck — it's a deliberate architecture.
An ATM has one core invariant: your balance should never go below zero. Strict consistency says: whenever the ATM can't verify your balance, refuse the transaction. For a machine that exists to give you convenient cash access, that's a problem.
Instead, ATMs enter standin mode:
- Stop trying to achieve full consistency
- Apply a simpler rule: allow withdrawals up to a bounded limit
- Record every transaction locally
- Keep serving
When the bank connection restores, the ATM uploads its log. The bank reconciles. If a balance went negative, an overdraft fee is charged.
💡 Tip: The bank doesn't prevent every mistake. It bounds how large a mistake can happen, detects violations on reconciliation, and compensates afterward. The overdraft fee is the architecture.
You can see the same bet in every consumer app:
- Ride-sharing confirms your booking before the driver accepts
- E-commerce accepts payment before inventory is confirmed reserved
- Airlines issue boarding passes before downstream systems catch up
- Food delivery confirms your order before the restaurant accepts
Accept the operation. Bound the risk. Detect the problem. Make it right. That's the design.
CRDTs: When the Math Eliminates the Trade-off
For certain kinds of data, you can design the data structure so concurrent updates from multiple nodes always converge to the same result — regardless of order, without coordination.
The keyword is converge. Not agree in advance. Not lock. Converge after the fact, always.
Example: A YouTube-style view counter.
Multiple servers in different regions incrementing simultaneously, no locking, no coordination, yet it never loses a count. Why? Addition is commutative. Order doesn't matter. Each node tracks its own increments. Merge by summing — the race condition is mathematically eliminated.
For a data structure to guarantee convergence:
- Commutative — order doesn't matter
- Idempotent — duplicate messages don't corrupt state
Guarantee both, and the C vs A trade-off disappears for that data type.
| CRDT Type | Description | Example Use Case |
|---|---|---|
| PN Counter | Increment/decrement counter via two G-Counters | View counts, likes, inventory deltas |
| G-Set | Grow-only set, no deletions | Unique visitor tracking |
| LWW Register | Last-Write-Wins with timestamp | User profile last update |
| OR-Set | Observed-Remove Set, supports delete | Shopping cart, tag lists |
⚠️ Warning: Figma's canvas is often described as CRDT-based — but it's more accurate to say it's CRDT-influenced. Their server is a central authority; they borrow LWW register semantics and conflict resolution ideas, not full decentralized convergence. True CRDTs (Redis Enterprise for geo-replication, GoodNotes using Automerge) require no central coordinator.
Sagas: The 1987 Answer to Cross-Service Failures
The ATM pattern — accept, bound, log, compensate — has a formal name. It's been running in distributed systems for decades. Garcia-Molina and Salem described it in 1987. They called it the Saga.
Traditional database transactions assume you can hold a lock on everything until the whole operation commits or rolls back. Across multiple services over a network, that creates a reliability problem:
Service A holds lock → waiting on B → B waiting on C → C is slow → everything stallsUnder production traffic, this cascades.
The Saga breaks this. Instead of one distributed transaction, a sequence of small local transactions — each committing immediately:
Each step commits immediately. No distributed locks. When a step fails, compensating transactions run backward. The correctness guarantee isn't all-or-nothing — it's that the net effect, including compensations, leaves the system in an acceptable state.
A refund is not the same as never having been charged. But for the business, it's equivalent enough.
Every e-commerce checkout, every loan approval across multiple services, every subscription sign-up that creates accounts, charges cards, and provisions access — these are all Sagas. The pattern predates microservices by decades because the problem predates microservices by decades.
Local-First: Partition Mode as the Default
What if partition mode wasn't the edge case you designed for — but the state your app always runs in?
Most applications treat the server as the source of truth. Every read, every write, every operation goes through it. The network is in the critical path for everything. When the network is slow, the app is slow. When it's gone, the app is broken.
Local-first software inverts this:
- Data lives on your device
- The server is a sync coordinator, not an authority
- Every read and write hits a local database — no round trip, no spinner, instant
- The application works completely offline because there's no difference between offline and normal
When devices reconnect, CRDT-based sync handles diverged state in the background. Recovery becomes routine infrastructure, not an incident.
Tools like Y.js, Automerge, and purpose-built sync engines are built for exactly the partition recovery problem Brewer was describing in 2012, now solved at the library level.
Your notes app working at 35,000 feet isn't a pleasant UX feature. That's partition recovery built into the architecture from the start.
Production Checklist
- Stop treating CAP as a permanent identity. "We're AP" labels one property under one failure mode. Design your system to maximize both during normal operation.
- Write the operations table. For every critical operation: what happens during a partition? Allow, block, queue, or bound?
- Design recovery mode explicitly. When a partition heals, nodes have divergent state. Recovery is not a return to normal — it's a merge. Design the merge before the partition occurs.
- Apply PACELC to every replication setting. Every consistency level, read preference, and sync/async choice is a latency vs. consistency trade-off. Name it deliberately.
- Identify CRDT candidates. Counters, sets, and append-only logs often don't need coordination. Mathematical convergence eliminates the trade-off entirely.
- Model compensating transactions. For cross-service workflows, every forward step should have a defined compensation. If you can't define the compensation, reconsider the operation.
- Never call your distributed system CA. Partitions aren't optional. Design for them or discover them at 3 a.m.
Conclusion
The CAP theorem is still useful. The triangle still belongs in system design interviews. But knowing what it describes — and what it doesn't — makes you a better engineer than the label alone ever could.
Saying "we're AP" describes one property of your system under one failure mode. It doesn't answer what operations run during a partition, how you recover, which invariants you protect, or what happens when something inconsistent has already reached a user.
Consistency isn't a global dial. Cassandra lets you set it per query. DynamoDB prices it per operation. A user's profile photo doesn't need the same guarantee as a financial write. Your architecture should reflect that granularity.
The CAP theorem forces you to ask the right questions. They don't have clean answers — but it's far better to work through them in a design review than to discover the answers during an incident.
References: Brewer, E. (2012). "CAP Twelve Years Later: How the 'Rules' Have Changed." IEEE Computer. Abadi, D. (2012). "Consistency Tradeoffs in Modern Distributed Database System Design." IEEE Computer. Garcia-Molina, H. & Salem, K. (1987). "Sagas." ACM SIGMOD.
FAQ
What is the CAP theorem in simple terms?
CAP states that a distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance. Most engineers learn to pick CP or CA at design time.
Why is the CAP theorem considered misleading?
Eric Brewer himself published a follow-up in 2012 noting that all three properties are treated as binary when they're actually spectrums, CAP only describes behavior during partitions (which are rare), and it ignores latency entirely.
What is PACELC?
PACELC extends CAP by adding a second trade-off: even when no partition exists, you must choose between Latency and Consistency. This is the trade-off behind every read preference, replication setting, and consistency level you configure.
What is a CRDT and when should I use it?
A CRDT (Conflict-free Replicated Data Type) is a data structure where concurrent updates from multiple nodes always converge to the same result without coordination. Use it for counters, sets, and collaborative editing where mathematical commutativity and idempotency can eliminate the C vs A trade-off entirely.
What is the Saga pattern in distributed systems?
A Saga replaces a single distributed transaction with a sequence of small local transactions, each committing immediately. When a step fails, compensating transactions run in reverse. It's the formal solution to long-running cross-service workflows — described in a 1987 paper, foundational to modern microservices.
What does partition mode mean in system design?
Partition mode is an intentional third operational state — separate from normal operation and full failure — where a node is unreachable and your system applies different, pre-designed rules per operation: allow, block, queue, or bound.
Is the CA corner of CAP valid for distributed systems?
No. Calling a distributed system CA implies you've given up partition tolerance — but network partitions aren't a design parameter you choose. CA on a distributed system just means you'll be making trade-off decisions under pressure instead of deliberately.