data engineering | The Blog Pros

January 13, 2023

SQL Extensions for Time-Series Data in QuestDB

In this tutorial, you are going to learn about QuestDB SQL extensions which prove to be very useful with time-series data. Using some sample data sets, you will learn how designated timestamps work and how to use extended SQL syntax to write queries on time-series data.

Introduction

Traditionally, SQL has been used for relational databases and data warehouses. However, in recent years there has been an exponential increase in the amount of data that connected systems produce, which has brought about a need for new ways to store and analyze such information. For this reason, time-series analytics have proved critical for making sense of real-time market data in financial services, sensor data from IoT devices, and application metrics.

May 26, 2022

Data Lakes, Warehouses and Lakehouses. Which is Best?

Twenty years ago, your data warehouse probably wouldn’t have been voted hottest technology on the block. These bastions of the office basement were long associated with siloed data workflows, on-premises computing clusters, and a limited set of business-related tasks (i.e., processing payroll, and storing internal documents).

Now, with the rise of data-driven analytics, cross-functional data teams, and most importantly, the cloud, the phrase “cloud data warehouse” is nearly analogous to agility and innovation.

April 21, 2022

What Is a Data Reliability Engineer, and Do You Really Need One?

As software systems became increasingly complex in the late 2000s, merging development and operations (DevOps) was a no-brainer.

One-half software engineer, one-half operations admin, and the DevOps professional are tasked with bridging the gap between building performant systems and making them secure, scalable, and accessible. It wasn’t an easy job, but someone had to do it.

April 10, 2022

What is Data Ingestion? The Definitive Guide

What Is Data Ingestion?

Data ingestion is an essential step of any modern data stack. At its core data ingestion is the process of moving data from various data sources to an end destination where it can be stored for analytics purposes. This data can come in multiple different formats and be generated from various external sources (e.g., website data, app data, databases, SaaS tools, etc.)

Why Is Data Ingestion Important?

The data ingestion process is important because it moves data from point A to B. Without a data ingestion pipeline, data is locked in the source it originated in and this isn’t actionable. The easiest way to understand data ingestion is to think of it as a pipeline. In the same way that oil is transported from the well to the refinery, data is transported from the source to the analytics platform. Data ingestion is important because it gives business teams the ability to extract value from data that would otherwise be inaccessible.

April 9, 2022

The Internet of Things in Solutions Architecture

As internet connectivity increases, an increasing number of small devices now exist with small memory and compute capacities. These sensors connect various physical entities, such as your home alarm, thermal sensors, and car. The data from millions of these connected devices needs to be collected and analyzed. For example, weather data collected from multiple sensors can be utilized to forecast weather for wind energy and farming. There are billions of connected devices in homes, factories, oil wells, hospitals, cars, and thousands of other places that are fueling digital transformation, generating huge volumes of data and growing exponentially.

As IoT has become very common in the manufacturing industry for handling machine data and optimizing production, the concept of Industrial IoT (IIoT) was developed. Let’s learn more about this now.

February 16, 2022

Redshift vs. Snowflake: The Definitive Guide

What Is Snowflake?

At its core Snowflake is a data platform. It's not specifically based on any cloud service which means it can run any of the major cloud providers like Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP). As a SaaS (Software-as-a-Service) solution, it helps organizations consolidate data from different sources into a central repository for analytics purposes to help solve Business Intelligence use cases.

Once data is loaded into Snowflake, data scientists, engineers, and analysts can use business logic to transform and model that data in a way that makes sense for their company. With Snowflake users can easily query data using simple SQL. This information is then used to power reports and dashboards so business stakeholders can make key decisions based on relevant insights.

February 14, 2022

Secrets Detection: Optimizing Filter Processes

While increasing both the precision and the recall of our secrets detection engine, we felt the need to keep a close eye on speed. In a gearbox, if you want to increase torque, you need to decrease speed. So it wasn’t a surprise to find that our engine had the same problem: more power, less speed. At roughly 10 thousand public documents scanned every minute, this eventually led to a bottleneck.

In a previous article, we explained how we built benchmarks to keep track of those three metrics: precision, recall, and the most important here, speed. These benchmarks taught us a lot about the true internals of our engine at runtime and led to our first improvements.

January 22, 2022

Should We Target Zero False Positives?

In an ideal world, secret detection tools would spot all leaked secrets and never report false positives.

Unfortunately -or maybe fortunately...- we do not live in an ideal world: secret detection tools are not perfect, sometimes they report false positives. But would it really be better if they did not?

September 27, 2021

What Is Data Engineering? Skills and Tools Required

In the last decade, as most organizations began receiving advanced change, data scientists and data engineers have developed into two separate jobs, obviously, with specific covers. The business generates data constantly from people and products. Every event is a snapshot of company functions (and dysfunctions) such as revenue, losses, third-party partnerships, and goods received. But if the data isn't explored, there will be no insights gained. The intention of data engineering is to help the process and make it workable for buyers of data. In this article, we’ll explore the definition of data engineering, data engineering skills, what data engineers do and their responsibilities, and the future of data engineering.

Data Engineering: What Is It?

In the world of data, a data scientist is just comparable to the information or data they approach. Most companies store their information or data in an assortment of arrangements across data sets and text formats. This is the situation where data engineering enters. In simple form, data engineering means organizing and designing the data, which is done by the data engineers. They construct data pipelines that change that information, organize them, and make them useful. Data engineering is similarly as significant as data science. However, data engineering requires realizing how to get an incentive form of data, just as the commonsense designing abilities to move data from guide A toward point B without defilement.

July 20, 2021

Shareable Data Analyses Using Templates

Photo by Joanna Kosinska / Unsplash

Our friend Benn Stancil recently wrote a great post about templates—his term for sharable, pre-built dashboards and reports. Do yourself a favor and read it. The basic idea is that shared, reusable analyses for data has been a pipe dream for years and aren't yet on their way:

Even though our data is the same, and our companies are the same, there’s no one-click way to spin out an entire suite of dashboards

Templates do seem inevitable: the concept of reusable code is something software developers have relied on for literally decades. It's fundamental to how all software is built. The data community has been borrowing best practices from the software world since the beginning, from version control in Git to staging environments to testing. But we still can't use their single most powerful technique.

June 1, 2021

Fast JMS for Apache Pulsar: Modernize and Reduce Costs with Blazing Performance

Written by: Chris Bartholomew

DataStax recently announced the availability of Fast JMS for Apache Pulsar, a JMS 2.0 API. By combining the industry-standard Java Messaging Service (JMS) API with the cloud-native and horizontally scalable Apache Pulsar™ streaming platform, DataStax is providing a powerful way to modernize your JMS infrastructure, improve performance, and reduce costs. Fast JMS is open source and is included in DataStax’s Luna Streaming Enterprise support of Apache Pulsar.

May 22, 2021

Delight: The New and Improved Spark UI and Spark History Server

Delight is a free, hosted, cross-platform monitoring dashboard for Apache Spark. It's a great complement to the Spark UI that can help you understand and improve the performance of your Spark applications.

Delight gives you access to:

May 19, 2021

Best Practices for Data Pipeline Error Handling in Apache NiFi

According to a McKinsey report, ”the best analytics are worth nothing with bad data”. We as data engineers and developers know this simply as "garbage in, garbage out". Today, with the success of the cloud, data sources are many and varied. Data pipelines help us to consolidate data from these different sources and work on it. However, we must ensure that the data used is of good quality. As data engineers, we mold data into the right shape, size, and type with high attention to detail.

Fortunately, we have tools such as Apache NiFi, which allow us to design and manage our data pipelines, reducing the amount of custom programming and increasing overall efficiency. Yet, when it comes to creating them, a key and often neglected aspect is minimizing potential errors.

February 18, 2021

How to Become a Data Engineer: A Hype Profession or a Necessary Thing

“Big Data is the profession of the future” is all over the news. I will say even more: data engineering skills for a developer is an urgent need. Before 2003, we had created as many petabytes of data as we do today every two days. Gartner analysts named cloud services and cybersecurity among the top techno trends of 2021.

The trend is easily explained. Huge arrays of Big Data need to be stored securely and processed to obtain useful information. When the companies moved to remote work, these needs have become even more tangible. E-commerce, Healthcare, EdTech — all these industries want to know everything about their online consumers. While the data is only stored on the servers, there is no sense in it at all.

August 27, 2020

Model Experiments, Tracking, and Registration Using MLflow on Databricks and StreamSets

Learn how StreamSets, a modern data integration platform for DataOps, can help expedite operations at some of the most crucial stages of machine learning lifecycle and MLOps.

Data Acquisition and Preparation

Machine learning models are only as good as the quality of data and the size of datasets used to train the models. Data has shown that data scientists spend around 80% of their time on preparing and managing data for analysis and 57% of the data scientists regard cleaning and organizing data as the least enjoyable part of their work. This further validates the idea of MLOps and the need for collaboration between data scientists and data engineers.

August 11, 2020

Reducing Large S3 API Costs Using Alluxio

I. Introduction

Previous Works

There have been numerous articles and online webinars dealing with the benefits of using Alluxio as an intermediate storage layer between the S3 data storage and the data processing system used for ingestion or retrieval of data (i.e. Spark, Presto), as depicted in the picture below:

To name a few use cases:

June 20, 2020

Deep Learning at Alibaba Cloud With Alluxio – Running PyTorch on HDFS

Google’s TensorFlow and Facebook’s PyTorch are two Deep Learning frameworks that have been popular with the open source community. Although PyTorch is still a relatively new framework, many developers have successfully adopted it due to its ease of use.

By default, PyTorch does not support Deep Learning model training directly in HDFS, which brings challenges to users who store data sets in HDFS. These users need to either export HDFS data at the start of each training job or modify the source code of PyTorch to support reading from HDFS. Both approaches are not ideal because they require additional manual work that may introduce additional uncertainties to the training job.

December 26, 2019

Accelerated Extract-Load-Transform Data Pipelines

As a columnar database with both strong CPU and GPU performance, the OmniSci platform is well suited for Extract-Load-Transform (ELT) pipelines (as well as the data science workloads we more frequently demonstrate). In this blog post, I’ll demonstrate an example ELT workflow, along with some helpful tips when merging various files with drifting data schemas. If you’re not familiar with the two major data processing workflows, the next section briefly outlines the history and reasoning for ETL-vs-ELT; if you’re just interested in the mechanics of doing ELT in OmniSci, you can skip to the “Baywheels Bikeshare Data” section.

A Brief History of ETL vs. ELT for Loading Data

From the first computerized databases in the 1960s, the Extract-Transform-Load (ETL) data processing methodology has been an integral part of running a data-driven business. Historically, storing and processing data was too expensive to be accumulating data without knowing what you were going to do with it, so a process, such as the following. would occur each day: