data engineering | The Blog Pros

October 31, 2019

Optimized File Formats – Reduce Overall System Latency

Since Optimized columnar file formats helped Big data ecosystem to have SQL query features, Organizations are now able to retrain their existing data warehouse or Database developers quickly in Big data technology and migrate their analytics applications to on-premise Hadoop clusters or cheap object storage in the cloud.

When Columnar file formats were first proposed in the early 2010s, the intention was to enable faster query execution engines on top of the Hadoop file system. The columnar format was explicitly designed to give much-improved query performance than conventional row-based file formats. Columnar file formats give much better performance than row-based file formats (used in conventional Databases and data warehouses) when a partial set of columns from a table are queried.

July 15, 2019

Easier Data Science Development With Prodmodel

Data science development is an experimental and iterative process. It involves a lot of trial and error and it's easy to lose track of what's been tested and what hasn't. The following examples show how Prodmodel — an open-source data engineering tool I developed — helps to solve some of those problems. It works with Python 3.5 or above.

The idea behind Prodmodel is to structure your modeling pipeline as Python function calls. The tool then versions, caches, and reuses the objects returned by these functions. This way you don't have to keep in mind the various data or model files or pieces of codes you're experimenting with.

July 10, 2019

Turn Cloud Storage or HDFS Into Your Local File System for Faster AI Model Training With TensorFlow

Users today have a variety of options of cost-effective and scalable storage for their Big Data or machine learning applications, from the distributed storage system like HDFS, Ceph to cloud storage like AWS S3, Azure Blob store, and Google Cloud Storage. These storage technologies have their own APIs. This means that developers need to constantly learn new storage APIs and develop their code using these APIs. In some cases, for example, in machine learning/deep learning workloads, the frameworks don’t have integrations to all the needed storage-level APIs, and a lot of data engineering needs to be done to move the data around. It has become common practice to move data sets from the HDFS data lake to the local compute instances of the data scientist to achieve data locality and access data via the local file system.

This article aims to provide a different approach to help connect and make distributed files systems like HDFS or cloud storage systems look like a local file system to data processing frameworks: the Alluxio POSIX API. To explain the approach better, we used the TensorFlow + Alluxio + AWS S3 stack as an example.

May 8, 2019

The Types of Data Engineers

Overview

We all know that in the last few years the position of data engineer, together with data science, has been in high demand on the market.

However, we can still observe in the market a certain discrepancy in the technical profile of a data engineer. I’m talking about this point specifically for the Latin American region, maybe elsewhere in the world this is more advanced.

April 26, 2019

Things to Understand Before Implementing ETL Tools

Data warehouses, databases, data lakes, or data hubs have become key growth drivers for technology-driven businesses of all sizes. There are several factors that contribute to the successful building and management of each of these data systems. The ETL (Extract, Transform, Load) strategy is the most important of them all. Nowadays, there are several best ETL tools in the market which allow businesses to design robust data systems. They are differentiated into open source and enterprise ETL tools on the basis of their implementation. This post is not focused on the best ETL tools in the market, nor does it compare ETL tools. What should you expect then? This post intends to build your understanding of the ETL processing and parameters to be checked before investing in an ETL tool.

Understanding the Basics of the ETL Processing

When developing a database, it becomes important to prepare and store data in comprehensible formats. ETL comprises three distinct functions (Extract, Transform, and Load) that are integrated in a single tool, which aids in the data preparation and storage required for database management.