Fun With SQL: Text and System Functions

SQL by itself is great and powerful, and Postgres supports a broad array of more modern SQL including things like window functions and common table expressions. But rarely do I write a query where I don't want to tweak or format the data I'm getting back out of the database. Thankfully, Postgres has a rich array of functions to help with converting or formatting data. These built-in functions save me from having to do the logic elsewhere or write my own functions. In other words, I do less work because Postgres has already done it for me, which I'm always happy about.

We've covered a set of functions earlier, and today we're going to look at some different categories of functions to dive deeper.Image title

Thinking in MapReduce, but With SQL

For those considering Citus, if your use case seems like a good fit, we often are willing to spend some time with you to help you get an understanding of the Citus database and what type of performance it can deliver. We commonly do this in a roughly two-hour pairing session with one of our engineers. We'll talk through the schema, load up some data, and run some queries. If we have time at the end, it is always fun to load up the same data and queries into single node Postgres and see how we compare. After seeing this for years, I still enjoy seeing performance speed ups of 10 and 20x over a single node database, and in cases as high as 100x

And the best part is it didn't take heavy re-architecting of data pipelines. All it takes is just some data modeling and parallelization with Citus. 

Approximation Algorithms for Your Database

In an earlier blog post, I wrote about how breaking problems down into a MapReduce style approach can give you much better performance. We’ve seen that Citus is orders of magnitudes faster than single node databases when we’re able to parallelize the workload across all the cores in a cluster. And while count (*) and avg is easy to break into smaller parts, I immediately got the question what about count distinct or the top from a list or median.

Exact distinct count is admittedly harder to tackle in a large distributed setup because it requires a lot of data shuffling between nodes. Count distinct is indeed supported within Citus, but at times can be slow when dealing with especially larger datasets. Median across any moderate to large size dataset can become completely prohibitive for end users. Fortunately, for nearly all of these, there are approximation algorithms that provide close-enough answers and do so with impressive performance characteristics.

Contributing to Postgres

About once a month, I get this question: “How do I contribute to Postgres?.” PostgreSQL is a great database with a solid code base, and for many of us, contributing back to open source is a worthwhile cause. The thing about contributing back to Postgres is you generally don’t just jump right in and commit code on day one. So figuring out where to start can be a bit overwhelming. If you’re considering getting more involved with Postgres, here are a few tips that you may find helpful.

Follow What’s Happening

The number one way to familiarize yourself with the Postgres development and the code community is to subscribe to the mailing lists. Even if you’re not considering contributing back, the mailing lists can be a great place to level up your knowledge and skills around Postgres. Fair warning: the mailing lists can be very active. But that’s ok, as you don’t necessarily need to read every email as it happens — daily digests work just fine. There is a long list of mailing lists you can subscribe to, but here are a few I think you should know about: