Approximation Algorithms for Your Database

In an earlier blog post, I wrote about how breaking problems down into a MapReduce style approach can give you much better performance. We’ve seen that Citus is orders of magnitudes faster than single node databases when we’re able to parallelize the workload across all the cores in a cluster. And while count (*) and avg is easy to break into smaller parts, I immediately got the question what about count distinct or the top from a list or median.

Exact distinct count is admittedly harder to tackle in a large distributed setup because it requires a lot of data shuffling between nodes. Count distinct is indeed supported within Citus, but at times can be slow when dealing with especially larger datasets. Median across any moderate to large size dataset can become completely prohibitive for end users. Fortunately, for nearly all of these, there are approximation algorithms that provide close-enough answers and do so with impressive performance characteristics.