Distributed Locks Are Dead, Long Live Distributed Locks

"Distributed locks aren't real," "Anyone who's trying to sell you a distributed lock is selling you sawdust and lies." This may sound rather bleak, but it doesn't say that locking, itself, is impossible in a distributed system: It's just that some like to remind us that all of the system's components must participate in the protocol. This blog post is the story of how we implemented a distributed locking protocol that gives your components a straightforward way of joining in.

The coordinating component of the protocol is an object that extends the semantics of java.util.concurrent.locks.Lock. We call it FencedLock, following the naming used in Martin Kleppmann's 2016 post "How to Do Distributed Locking."

Lessons Learned From 2+ Years of Nightly Jepsen Tests

Since the pre-1.0 betas of CockroachDB, we've been using Jepsen to test the correctness of the database in the presence of failures. We have re-run these tests every night as a part of our nightly test suite. Last fall, these tests found their first post-release bug. This blog post is a more digestible walkthrough of that discovery (many of the links here point to specific comments in that issue's thread to highlight the most important moments).

Two Years of Jepsen Testing

Running Jepsen tests in an automated fashion has been somewhat challenging. The tests run in a complex environment with network dependencies, spawn multiple cloud VMs, and have many potential points of failure, so they turn out to be rather flaky. If you search our issue tracker for Jepsen failures you'll see a lot of issues, but before the bug we're discussing here, they were all benign — failures of our automation setup, not an inconsistency or bug in CockroachDB itself. By now, though, we've worked out the kinks and have the tests running reliably.