Improving Query Speed to Make the Most Out of Your Data

The world is getting more and more value out of data, as exemplified by the currently much-talked-about ChatGPT, which I believe is a robotic data analyst. However, in today's era, what's more, important than the data itself is the ability to locate your wanted information among all the overflowing data quickly. So in this article, I will talk about how I improved overall data processing efficiency by optimizing the choice and usage of data warehouses.

Too Much Data on My Plate

The choice of data warehouses was never high on my worry list until 2021. I have been working as a data engineer for a Fintech SaaS provider since its incorporation in 2014. In the company's infancy, we didn't have too much data to juggle. We only needed a simple tool for OLTP and business reporting, and the traditional databases would cut the mustard.
Data Sources, Data Warehouse and Application
But as the company grew, the data we received became overwhelmingly large in volume and increasingly diversified in sources. Every day, we had tons of user accounts logging in and sending myriads of requests. It was like collecting water from a thousand taps to put out a million scattered pieces of fire in a building, except that you must bring the exact amount of water needed for each fire spot. Also, we got more and more emails from our colleagues asking if we could make data analysis easier for them. That's when the company assembled a big data team to tackle the beast.

The first thing we did was to revolutionize our data processing architecture. We used DataHub to collect all our transactional or log data and ingest it into an offline data warehouse for data processing (analyzing, computing. etc.). Then the results would be exported to MySQL and then forwarded to QuickBI to display the reports visually. We also replaced MongoDB with a real-time data warehouse for business queries.


Data Ingestion, ETL/ELT and Application

This new architecture worked, but there remained a few pebbles in our shoes:
  • We wanted faster responses. MySQL could be slow in aggregating large tables, but our product guys requested a query response time of fewer than five seconds. So first, we tried to optimize MySQL. Then we also tried to skip MySQL and directly connect the offline data warehouse with QuickBI, hoping that the combination of query acceleration capability of the former and caching of the latter would do the magic. Still, that five-second goal seemed to be unreachable. There was a time when I believed the only perfect solution was for the product team to hire people with more patience.
  • We wanted less pain in maintaining dimension tables. The offline data warehouse conducted data synchronization every five minutes, making it not applicable for frequent data updates or deletions scenarios. If we needed to maintain dimension tables in it, we would have to filter and deduplicate the data regularly to ensure data consistency. Out of our trouble-averse instinct, we chose not to do so.
  • We wanted support for point queries of high concurrency. The real-time database that we previously used required up to 500ms to respond to highly concurrent point queries in both columnar storage and row storage, even after optimization. That was not good enough.

Hit It Where It Hurts Most

In March 2022, we started our hunt for a better data warehouse. To our disappointment, there was no one-size-fits-all solution. Most of the tools we looked into were only good at one or a few of the tasks, but if we gathered the best performer for each usage scenario, that would add up to a heavy and messy toolkit, which was against instinct.

So we decided to solve our biggest headache first: slow response, as it was hurting both the experience of our users and our internal work efficiency. To begin with, we tried to move the largest tables from MySQL to Apache Doris, a real-time analytical database that supports MySQL protocol. That reduced the query execution time by a factor of eight. Then we tried and used Doris to accommodate more data. 

SAP S/4HANA, Microsoft SQL Integration and Hard Deletion Handling

This article will demonstrate the heterogeneous systems integration and building of the BI system and mainly talk about the DELTA load issues and how to overcome them. How can we compare the source table and target table when we cannot find a proper way to identify the changes in the source table using the SSIS ETL Tool?

Systems Used

  • SAP S/4HANA is an Enterprise Resource Planning (ERP) software package meant to cover all day-to-day processes of an enterprise, e.g., order-to-cash, procure-to-pay, finance & controlling request-to-service, and core capabilities. SAP HANA is a column-oriented, in-memory relational database that combines OLAP and OLTP operations into a single system.
  • SAP Landscape Transformation (SLT) Replication is a trigger-based data replication method in the HANA system. It is a perfect solution for replicating real-time data or schedule-based replication from SAP and non-SAP sources.
  • Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the management functions offered by the database, including backups, patching, upgrading, and monitoring, with minimal user involvement.
  • SQL Server Integration Services (SSIS) is a platform for building enterprise-level data integration and transformation solutions. SSIS is used to integrate and establish the pipeline for ETL and solve complex business problems by copying or downloading files, loading data warehouses, cleansing, and mining data.
  • Power BI is an interactive data visualization software developed by Microsoft with a primary focus on business intelligence.

Business Requirement

Let us first talk about the business requirements. We have more than 20 different Point-of-Sale (POS) data from other online retailers like Target, Walmart, Amazon, Macy's, Kohl's, JC Penney, etc. Apart from this, the primary business transactions will happen in SAP S/4HANA, and business users will require the BI reports for analysis purposes.

Building a Data Warehouse, Part 5: Application Development Options

see also:

in part i we looked at the advantages of building a data warehouse independent of cubes/a bi system and in part ii we looked at how to architect a data warehouse’s table schema. in part iii, we looked at where to put the data warehouse tables. in part iv, we are going to look at how to populate those tables and keep them in sync with your oltp system. today, our last part in this series, we will take a quick look at the benefits of building the data warehouse before we need it for cubes and bi by exploring our reporting and other options.

HTAP: One Size Fits All?

An important idea in the database world is that specialized databases will outperform general-purpose databases. Michael Stonebraker, an A. M. Turing Award Laureate and one of the most influential people in the database world, also discussed this in his paper, One Size Fits All: An Idea Whose Time Has Come and Gone.

This is a rational judgment because it's tough enough to build a database that supports either Online Transactional Processing (OLTP) or Online Analytical Processing (OLAP) workloads, let alone one that supports both at the same time. But the dilemma is, that today, many users are facing increasing demands with mixed OLTP and OLAP workloads. How do we crack this then?

Improving Backend Performance Part 2/3: Using Database Indexes

Database indexes are a concern of the developers. They have the potential to improve the performance of search and filter features that use an SQL query in the backend. In the second part of this series of articles, I'll show the impact that a database index has in speeding up filters using a Java web application developed with Spring Boot and Vaadin.

Read part 1 of this series if you want to learn how the example application that we'll use here works. You can find the code on GitHub. Also, and if you prefer, I recorded a video version of this article:

Comparing Apache Hive and Spark

Introduction

Hive and Spark are two very popular and successful products for processing large-scale data sets. In other words, they do big data analytics. This article focuses on describing the history and various features of both products. A comparison of their capabilities will illustrate the various complex data processing problems these two products can address.

More on the subject:

Moving Data From Cassandra (OLTP) to Data Warehousing

Overview

Data should be streamed to analytics engines in real-time or near real-time in order to incrementally upload transnational data to a data warehousing system. In my case, my OLTP is Cassandra and OLAP is Snowflake. The OLAP system requires data from Cassandra on a periodic basis. Requirements pertaining to this scenario are:

  1. The frequency of the data copy needs to be reduced drastically. 
  2. Data has to be consistent. Cassandra and Snowflake should be in sync.
  3. In a few cases, all mutations have to be captured
  4. Currently, production cluster data size is in petabytes; hourly at least 100 gigabytes of data is generated.

With such a granularity requirement, one should not copy the data from an OLTP system to OLAP, as it would be invasive to read the path, and writing the path of Cassandra would result in an impinge on TPS. Thus, we are required to provide a different solution for copying the Cassandra data to Snowflake.

Applying Graph Analytics to Game of Thrones

In this post, we review how organizations are integrating graph transactions and analytic processing and then dive deeper into graph algorithms. We’ll provide examples of using graph algorithms on Game of Thrones data to illustrate how to get started. Note that portions of this content have been taken from our O’Reilly book, Graph Algorithms: Practical Examples in Apache Spark and Neo4j, which you can download for free.

Neo4j provides native graph storage, compute, and analytics in a unified platform. Our goal is to help organizations reveal how people, processes, locations, and systems are interrelated using a connections-first approach. The Neo4j Graph Platform powers applications tackling artificial intelligence, fraud detection, real-time recommendations, and master data.