How Stability Targets Keep Your Apps on the Mark

Should applications be released when we don’t know how stable the user experience is? The obvious answer is the right one: of course not.

Application stability is the most important metric that software companies can use. Yet, many organizations don’t measure stability at all. Or, if they track errors, this information isn’t used by engineering and product organizations to make informed development decisions.

Lessons Learned From 2+ Years of Nightly Jepsen Tests

Since the pre-1.0 betas of CockroachDB, we've been using Jepsen to test the correctness of the database in the presence of failures. We have re-run these tests every night as a part of our nightly test suite. Last fall, these tests found their first post-release bug. This blog post is a more digestible walkthrough of that discovery (many of the links here point to specific comments in that issue's thread to highlight the most important moments).

Two Years of Jepsen Testing

Running Jepsen tests in an automated fashion has been somewhat challenging. The tests run in a complex environment with network dependencies, spawn multiple cloud VMs, and have many potential points of failure, so they turn out to be rather flaky. If you search our issue tracker for Jepsen failures you'll see a lot of issues, but before the bug we're discussing here, they were all benign — failures of our automation setup, not an inconsistency or bug in CockroachDB itself. By now, though, we've worked out the kinks and have the tests running reliably.