Frank Z | The Blog Pros

June 24, 2024

Steps To Industry-Leading Query Speed: Evolution of the Apache Doris Execution Engine

What makes a modern database system? The three key modules are query optimizer, execution engine, and storage engine. Among them, the role of execution engine to the DBMS is like the chef to a restaurant. This article focuses on the execution engine of the Apache Doris data warehouse, explaining the secret to its high performance.

To illustrate the role of the execution engine, let's follow the execution process of an SQL statement:

June 17, 2024

Another Lifesaver for Data Engineers: Apache Doris Job Scheduler for Task Automation

Job scheduling is an important part of data management as it enables regular data updates and cleanups. In a data platform, it is often undertaken by workflow orchestration tools like Apache Airflow and Apache Dolphinscheduler. However, adding another component to the data architecture also means investing extra resources for management and maintenance. That's why Apache Doris 2.1.0 introduces a built-in Job Scheduler. It is strategically more tailored to Apache Doris and brings higher scheduling flexibility and architectural simplicity.

The Doris Job Scheduler triggers the pre-defined operations at specific time points or intervals, thus allowing for efficient and reliable task automation. Its key capabilities include:

March 18, 2024

Breaking Down Data Silos With a Unified Data Warehouse: An Apache Doris-Based CDP

The data silos problem is like arthritis for online businesses because almost everyone gets it as they grow old. Businesses interact with customers via websites, mobile apps, H5 pages, and end devices. For one reason or another, it is tricky to integrate the data from all these sources. Data stays where it is and cannot be interrelated for further analysis. That's how data silos come to form. The bigger your business grows, the more diversified customer data sources you will have, and the more likely you are trapped by data silos.

This is exactly what happens to the insurance company I'm going to talk about in this post. By 2023, they have already served over 500 million customers and signed 57 billion insurance contracts. When they started to build a customer data platform (CDP) to accommodate such a data size, they used multiple components.

February 26, 2024

A Financial Anti-Fraud Solution Based on the Apache Doris Data Warehouse

Financial fraud prevention is a race against time. Implementation-wise, it relies heavily on the data processing power, especially under large datasets. Today, I'm going to share with you the use case of a retail bank with over 650 million individual customers. They have compared analytics components, including Apache Doris, ClickHouse, Greenplum, Cassandra, and Kylin. After five rounds of deployment and comparison based on 89 custom test cases, they settled on Apache Doris because they witnessed a six-fold writing speed and faster multi-table joins in Apache Doris as compared to the mighty ClickHouse.

I will get into details about how the bank builds its fraud risk management platform based on Apache Doris and how it performs.

February 1, 2024

How Inverted Index Accelerates Text Searches by 40 Times

As an open-source real-time data warehouse, Apache Doris provides a rich choice of indexes to speed up data scanning and filtering. Based on user involvement, they can be divided into built-in smart indexes and user-created indexes. The former is automatically generated by Apache Doris on data ingestion, such as ZoneMap index and prefix index, while the latter is the index users choose for various use cases, including inverted index and NGram BloomFilter index.

This post is a deep dive into the inverted index and NGram BloomFilter index, providing a hands-on guide to applying them for various queries.

January 11, 2024

Fast, Secure, and Highly Available Real-Time Data Warehousing Based on Apache Doris

This is a whole-journey guide for Apache Doris users, especially those from the financial sector, which requires a high level of data security and availability. If you don't know how to build a real-time data pipeline and make the most of the Apache Doris functionalities, start with this post, and you will be loaded with inspiration after reading.

This is the best practice of a non-banking payment service provider that serves over 25 million retailers and processes data from 40 million end devices. Data sources include MySQL, Oracle, and MongoDB. They were using Apache Hive as an offline data warehouse but feeling the need to add a real-time data processing pipeline. After introducing Apache Doris, they increase their data ingestion speed by 2~5 times, ETL performance by 3~12 times, and query execution speed by 10~15 times.

December 29, 2023

Apache Doris Speeds Up Data Reporting, Tagging, and Data Lake Analytics

As much as we say Apache Doris is an all-in-one data platform that is capable of various analytics workloads, it is always compelling to demonstrate that by real use cases. That's why I would like to share this user story with you. It is about how they leverage the capabilities of Apache Doris in reporting, customer tagging, and data lake analytics and achieve high performance.

This fintech service provider is a long-term user of Apache Doris. They have almost 10 clusters for production, hundreds of Doris backend nodes, and thousands of CPU Cores. The total data size is near 1 PB. Every day, they have hundreds of workflows running simultaneously, receive almost 10 billion new data records, and respond to millions of data queries.

December 20, 2023

From Elasticsearch to Apache Doris: Upgrading an Observability Platform

Observability platforms are akin to the immune system. Just like immune cells are everywhere in human bodies, an observability platform patrols every corner of your devices, components, and architectures, identifying any potential threats and proactively mitigating them. However, I might have gone too far with that metaphor, because till these days, we have never invented a system as sophisticated as the human body, but we can always make advancements.

The key to upgrading an observability platform is to increase data processing speed and reduce costs. This is based on two reasons:

November 2, 2023

Less Components, Higher Performance

This post is about building a unified OLAP platform. An insurance company tries to build a data warehouse that can undertake all its customer-facing, analyst-facing, and management-facing data analysis workloads. The main tasks include:

Self-service insurance contract query: This is for insurance customers to check their contract details by their contract ID. It should also support filters such as coverage period, insurance types, and claim amount.
Multi-dimensional analysis: Analysts develop their reports based on different data dimensions as they need so they can extract insights to facilitate product innovation and their anti-fraud efforts.
Dashboarding: This is to create a visual overview of the insurance sales trends and the horizontal and vertical comparison of different metrics.

Component-Heavy Data Architecture

The user started with Lambda architecture, splitting their data pipeline into a batch processing link and a stream processing link. For real-time data streaming, they apply Flink CDC; for batch import, they incorporate Sqoop, Python, and DataX to build their own data integration tool named Hisen.

October 13, 2023

Auto-Synchronization of a Whole MySQL Database for Data Analysis

Flink-Doris-Connector 1.4.0 allows users to ingest a whole database (MySQL or Oracle) that contains thousands of tables into Apache Doris, a real-time analytic database, in one step.

With built-in Flink CDC, the Connector can directly synchronize the table schema and data from the upstream source to Apache Doris, which means users no longer have to write a DataStream program or pre-create mapping tables in Doris.

October 2, 2023

Migrating From ClickHouse to Apache Doris: What Happened?

Migrating from one OLAP database to another is huge. Even if you're unhappy with your current data tool and have found some promising candidates, you might still hesitate to do the big surgery on your data architecture because you're uncertain about how things are going to work. So you need experience shared by someone who has walked the path.

Luckily, a user of Apache Doris has written down their migration process from ClickHouse to Doris, including why they need the change, what needs to be taken care of, and how they compare the performance of the two databases in their environment.

August 16, 2023

Choosing an OLAP Engine for Financial Risk Management: What To Consider?

From a data engineer's point of view, financial risk management is a series of data analysis activities on financial data. The financial sector imposes its unique requirements on data engineering. This post explains them with a use case of Apache Doris and provides a reference for what you should take into account when choosing an OLAP engine in a financial scenario.

Data Must Be Combined

The financial data landscape is evolving from standalone to distributed, heterogeneous systems. For example, in this use case scenario, the fintech service provider needs to connect the various transaction processing (TP) systems (MySQL, Oracle, and PostgreSQL) of its partnering banks. Before they adopted an OLAP engine, they were using Kettle to collect data. The ETL tool did not support join queries across different data sources, and it could not store data. The ever-enlarging data size at the source end was pushing the system toward latency and instability. That's when they decided to introduce an OLAP engine.

July 26, 2023

Is Your Latest Data Really the Latest? Check Your Data Update Mechanism

In databases, data update is to add, delete, or modify data. Timely data update is an important part of high-quality data services.

Technically speaking, there are two types of data updates: you either update a whole row (Row Update) or just update part of the columns (Partial Column Update). Many databases support both of them but in different ways. This post is about one of them, which is simple in execution and efficient in data quality guarantee.

July 12, 2023

For Entry-Level Data Engineers: How To Build a Simple but Solid Data Architecture

This article aims to provide a reference for non-tech companies who are seeking to empower their business with data analytics. You will learn the basics about how to build an efficient and easy-to-use data system, and I will walk you through every aspect of it with a use case of Apache Doris, an MPP-based analytic data warehouse.

What You Need

This case is about a ticketing service provider who wants a data platform that boasts quick processing, low maintenance costs, and ease of use, and I think they speak for the majority of entry-level database users.

A prominent feature of ticketing services is the periodic spikes in ticket orders, you know before the shows go on. So, from time to time, the company has a huge amount of new data rushing in and requires real-time processing of it so that they can make timely adjustments during the short sales window. But at other times, they won't want to spend too much energy and funds on maintaining the data system. Furthermore, for a beginner in digital operation who only require basic analytic functions, it is better to have a data architecture that is easy to grasp and user-friendly. After research and comparison, they came to the Apache Doris community, and we help them build a Doris-based data architecture.

June 28, 2023

Hot-Cold Data Separation: What, Why, and How?

Apparently, hot-cold data separation is hot now. But first of all:

What Is Hot/Cold Data?

In simple terms, hot data is the frequently accessed data, while cold data is the one you seldom visit but still need. Normally in data analytics, data is "hot" when it is new and gets "colder" and "colder" as time goes by.

June 23, 2023

Database Dissection: How Are Fast Data Queries Implemented?

In data analytics, fast query performance is more of a result than a guarantee. What's more important than the result itself is the architectural design and mechanism that enables quick performance. This is exactly what this post is about. I will put you into context with a typical use case of Apache Doris, an open-source MPP-based analytic database.

The user, in this case, is an all-category Q&A website. As a billion-dollar listed company, they have their own data management platform. What Doris does is support the data filtering, packaging, analyzing, and monitoring workloads of that platform. Based on their huge data size, the user demands quick data loading and quick response to queries.

June 14, 2023

Say Goodbye to OOM Crashes

What guarantees system stability in large data query tasks? It is an effective memory allocation and monitoring mechanism. It is how you speed up computation, avoid memory hotspots, promptly respond to insufficient memory, and minimize OOM errors.

From a database user's perspective, how do they suffer from bad memory management? This is a list of things that used to bother our users:

June 6, 2023

Understanding Data Compaction in 3 Minutes

What is compaction in the database? Think of your disks as a warehouse: The compaction mechanism is like a team of storekeepers (with genius organizing skills like Marie Kondo) who help put away the incoming data.

In particular, the data (which is the inflowing cargo in this metaphor) comes in on a "conveyor belt," which does not allow cutting in line. This is how the LSM-Tree (Log Structured-Merge Tree) works: In data storage, data is written into MemTables in an append-only manner, and then the MemTables are flushed to disks to form files. (These files go by different names in different databases. In my community, we call them Rowsets). Just like putting small boxes of cargo into a large container, compaction means merging multiple small rowset files into a big one, but it does much more than that. As I said, the compaction mechanism is an organizing magician: