If you’ve spent any time in the trenches of backend engineering, you’ve probably heard of Jepsen. Or maybe you just remember that catchy Carly Rae Jepsen song from 2012. For a solid decade, those two things have been inextricably linked in the world of distributed systems. Honestly, it started as a bit of a joke by security researcher Kyle Kingsbury—better known by his handle "aphyr"—but it quickly turned into the most feared and respected testing suite in the software industry.
The premise was simple: "I just met you, and this is crazy, but here’s my data, so persist it maybe?"
Kingsbury wasn't just trolling. He was systematically breaking the world's most popular databases. He took systems that claimed to be "CP" (Consistent and Partition-tolerant) under the CAP theorem and showed that, in reality, they were often neither. From MongoDB to Redis, and Cassandra to Hazelcast, almost nobody survived the first round of the Jepsen Call Me Maybe series without some embarrassing data loss.
The Day the Marketing Died
Before Jepsen, we mostly took a database vendor's word for it. If a manual said a write was "guaranteed," we believed it. Then Kingsbury started introducing "The Nemesis."
🔗 Read more: Why Your Cordless Professional Barber Clippers Keep Dying (and How to Pick the Right Pair)
The Nemesis is a specialized part of the Jepsen framework that does exactly what you'd fear in a production environment. It cuts network cables (virtually). It kills nodes. It creates "split-brain" scenarios where two parts of a cluster both think they're the boss. It causes clock skew so bad that "now" becomes a matter of opinion rather than a fact.
When he applied these faults to MongoDB in his 2013 "Call Me Maybe" post, the results were catastrophic. Despite the documentation's promises, the system dropped writes like they were hot. Specifically, under a network partition, the primary node would keep accepting writes that it could never replicate to the rest of the cluster. Once the network healed, those writes were just... gone. Rolled back into the ether.
It wasn't just Mongo.
Redis Sentinel, which was supposed to handle failover, got caught in a loop where it couldn't actually guarantee that a new leader was elected safely. Even PostgreSQL, the gold standard for many, had issues when it came to certain distributed configurations.
Why "Maybe" Matters
You might think, "Well, I don't have network partitions every day."
You're wrong. You definitely do.
In a distributed system, a "partition" isn't always a backhoe cutting a fiber line. It can be a long Garbage Collection (GC) pause that makes a node stop responding for 10 seconds. It can be a congested top-of-rack switch. To the rest of the cluster, a slow node is a dead node.
The Jepsen Call Me Maybe series proved that most databases handle these "slight" delays incredibly poorly. They default to performance over safety. Most people use the default settings, which means most people are running systems that will silently lose data the moment the network gets a little jittery.
The Anatomy of a Jepsen Test
How does this actually work? Kingsbury uses Clojure to build a control node that manages a cluster of "worker" nodes (usually Debian VMs).
- The Client: A set of processes that perform operations like "put" and "get" on a register or a set.
- The History: Jepsen records every single operation—when it started, what it tried to do, and whether the database said "OK" or "I failed."
- The Nemesis: While the clients are working, the Nemesis is busy breaking things.
- The Checker: This is the magic. After the chaos is over, Jepsen uses a library called Knossos to check the history against a consistency model like "linearizability."
If the database said a write was successful, but a subsequent read can't find it, the test fails. It sounds basic, but you’d be surprised how many million-dollar database engines failed this "basic" test.
It’s Not Just About Failing
The most important impact of the Jepsen series wasn't just making developers look bad. It actually forced the industry to get better.
After the initial "Call Me Maybe" posts, database teams started hiring Kingsbury to audit their systems. This led to massive improvements in the safety of systems like etcd (the heart of Kubernetes), CockroachDB, and FoundationDB. They didn't want to be the next headline on aphyr.com showing a graph of 50% data loss.
For example, when Jepsen tested etcd and Consul, it helped refine how they handle leader elections. It pushed MongoDB to eventually change its default write concern from "unacknowledged" (which is basically throwing data into a black hole) to "majority."
How to Protect Your Own Data
So, what does this mean for you, the person actually building apps? You can't just assume the database "works."
First, read the Jepsen reports for the technology you use. They are dense, and honestly, a bit intimidating if you aren't into formal logic, but they contain the specific configurations you need to stay safe. If you're using Cassandra, you need to understand that "Consistency Level: ONE" is a gamble. If you're using Redis, you need to know that it is fundamentally an asynchronous system.
Second, check your defaults. Almost every database ships with "unsafe" defaults because they make the benchmarks look better. They want to show 100,000 writes per second. They don't mention that 1,000 of those might vanish if a switch hiccups.
Finally, embrace the reality of the network. The network is not reliable. Latency is not zero. Bandwidth is not infinite. This is the "Fallacies of Distributed Computing" 101, yet we forget it every time we see a shiny new database UI.
Actionable Steps for Reliability
If you want to ensure your system isn't just a "Call Me Maybe" disaster waiting to happen, start here:
- Audit your "Write Concern": Ensure you are writing to a quorum (N/2 + 1 nodes) if you care about durability.
- Verify your "Read Concern": In many systems, reading from a "Primary" doesn't guarantee you're seeing the latest data unless you use specific flags.
- Run Chaos Tests: You don't need the full Jepsen suite to start. Tools like Chaos Mesh or even simple
iptablesscripts can help you see how your app behaves when the database becomes unreachable. - Use Consensus where it counts: For critical metadata, use systems built on Raft or Paxos that have been vetted by the community and Jepsen audits.
The era of "maybe" should be over. By understanding the failures exposed by Jepsen, we can finally start building systems that actually do what they say on the tin.
For further reading, check out the official Jepsen archives to see if your favorite database has already been put through the wringer.
Next Step: You should review your current production database configuration to see if "Majority" write concerns are enabled for your most critical transactions.
Key Insight: Distributed consistency is a spectrum, not a toggle; always trade performance for safety explicitly rather than by accident.