Apache Druid: Making 1000+ QPS for Analytics Look Easy

Analytics use cases are evolving with higher volume, low latency queries on the rise. But scaling analytics for high queries per second (QPS) needs some consideration. If your queries are retrieving single-rows in tables with few columns or rows or aggregating a small amount of data, then virtually any database can meet your QPS requirements.

But things start getting hard if you have an analytics application (or plan to build one) that executes lots and lots of aggregations and filters across high dimensional and high cardinality data at scale. The kind of application where lots of users should be able to ask any question and get their answers instantly without constraints to the type of queries or shape of the data.

Things to Consider When Scaling Analytics for High QPS

For some, the thought of analytics and high QPS (queries per second) together may seem unlikely. After all, we typically think of analytics as the occasional report and dashboard into business metrics.

But, analytics use cases are evolving with high volume, low latency queries on the rise. Companies like Confluent, Target, and Pinterest use analytics for much more than weekly executive summaries. They’re making analytics available across their organizations; their teams are exploring high-dimensional raw data in a free-flowing, ad-hoc nature; and they’re powering analytics applications and data products for 1000s to millions of external users and customers. 

Constructing Real-Time Analytics: Fundamental Components and Architectural Framework — Part 2

In Part 1, I discussed the growing demand for real-time analytics in today's fast-paced world, where instant results and immediate insights are crucial. It compared real-time analytics with traditional analytics, highlighting the freshness of data and the speed of deriving insights as key features. The article emphasized the need for selecting the appropriate data architecture for real-time analytics and raised considerations such as events per second, latency, dataset size, query performance, query complexity, data stream uptime, joining multiple event streams, and integrating real-time and historical data. And I teased the following Part 2 of the article, which delves into designing an appropriate architectural solution for real-time analytics.

Building Blocks 

To effectively leverage real-time analytics, a powerful database is only part of the equation. The process begins with the capacity to connect, transport, and manage real-time data. This introduces our first foundational component: event streaming.

Constructing Real-Time Analytics: Fundamental Components and Architectural Framework — Part 1

The old adage "patience is a virtue" seems to have lost its relevance in the fast-paced world of today. In an era of instant gratification, nobody is inclined to wait. If Netflix buffering takes too long, users won't hesitate to switch platforms. If the closest Lyft seems too distant, users will readily opt for an alternative.

The demand for instant results is pervading the realm of data analytics as well, particularly when dealing with large datasets. The capacity to provide immediate insights, make swift decisions, and respond to real-time data without any delay is becoming crucial. Companies like Netflix, Lyft, Confluent, and Target, along with thousands of others, have managed to stay at the forefront of their industries partially due to their adoption of real-time analytics and the data architectures that facilitate such instantaneous, analytics-driven operations.

A Data Warehouse Alone Won’t Cut It for Modern Analytics

The world of data warehouses as we know it can be traced back to 1993 when Dr E.F. Codd conceived the idea of OLAP. A marvel for its time, the then-new processing paradigm for decision support systems paved the way for modern day data warehouses. Because of Codd's influence the idea of a purpose-built data system to  process analytics queries and aggregations on large data sets was conceived. It was clear that having a separate relational database for transactions and analytics made sense.

Since OLAP came about, the data warehouse market has seen a bit of an evolution. In the last 20 years we saw the market adopt columnar storage with Vertica, then it became cloudy with Snowflake,and now we see the market behind to morph into lakehouses. However, despite all this change in technology, they all address the same use case: the classic business intelligence and reporting workflow.

Analytics Apps That Will Take Center Stage

Developers are increasingly at the forefront of analytics innovation, driving an evolution in analytics beyond traditional BI and reporting to modern analytics applications. These applications—fueled by the digitization of businesses—are being built for real-time observability at scale for cloud products and services, next-gen operational visibility for security and IT, revenue-impacting insights and recommendations, and for extending analytics to external customers. And Apache Druid has been the database of choice for analytics applications trusted by developers of 1000+ companies, including Netflix, Confluent, and Salesforce.

We are at the forefront of an analytics evolution, moving beyond traditional BI and reporting to modern analytics applications.