data warehouse | The Blog Pros

January 25, 2021

How Has COVID-19 Impacted Data Science?

The COVID-19 pandemic disrupted supply chains and brought economies around the world to a standstill. In turn, businesses need access to accurate, timely data more than ever before. As a result, the demand for data analytics is skyrocketing as businesses try to navigate an uncertain future. However, the sudden surge in demand comes with its own set of challenges.

Here is how the COVID-19 pandemic is affecting the data industry and how enterprises can prepare for the data challenges to come in 2021 and beyond.

January 15, 2021May 12, 2021

What Is Chaos Engineering?

In the past, software systems ran in highly controlled environments on-premise and managed by an army of sysadmins. Today, migration to the cloud is relentless; the stage has completely shifted. Systems are no longer monolithic and localized; they depend on many globalized uncoupled systems working in unison, often in the form of ethereal microservices.

It is no surprise that Site Reliability Engineers have risen to prominence in the last decade. Modern IT infrastructure requires robust systems thinking and reliability engineering to keep the show on the road. Downtime is not an option. A 2020 ITIC Cost of Downtime survey indicated that 98% of organizations said that a single hour of downtime costs more than $150,000. 88% showed that 60 minutes of downtime costs their business more than $300,000. And 40% of enterprises reported that one hour of downtime costs their organizations $1 million to more than $5 million.

November 17, 2020

What Is ETLT? Merging the Best of ETL and ELT Into a Single ETLT Data Integration Strategy

Data integration solutions typically advocate that one approach – either ETL or ELT – is better than the other. In reality, both ETL (extract, transform, load) and ELT (extract, load, transform) serve indispensable roles in the data integration space:

ETL is valuable when it comes to data quality, data security, and data compliance. It can also save money on data warehousing costs. However, ETL is slow when ingesting unstructured data, and it can lack flexibility.
ELT is fast when ingesting large amounts of raw, unstructured data. It also brings flexibility to your data integration and data analytics strategies. However, ELT sacrifices data quality, security, and compliance in many cases.

Because ETL and ELT present different strengths and weaknesses, many organizations are using a hybrid “ETLT” approach to get the best of both worlds. In this guide, we’ll help you understand the “why, what, and how” of ETLT, so you can determine if it’s right for your use-case.

November 6, 2020

Four Ways to Filter a Large-Sized Spark Dataset Against a Data Collection

Let us assume there is a large-sized Dataset, ‘A’, having the following schema:

    Java
   
          x
         
root:
| — empId: Integer
| — sal: Integer
| — name: String
| — address: String
| — dept: Integer

November 3, 2020

Benefits of Hybrid Cloud for Data Warehouse

In today’s market reliable data is worth its weight in gold, and having a single source of truth for business-related queries is a must-have for organizations of all sizes. For decades companies have turned to data warehouses to consolidate operational and transactional information, but many existing data warehouses are no longer able to keep up with the data demands of the current business climate. They are hard to scale, inflexible, and simply incapable of handling the large volumes of data and increasingly complex queries.

These days organizations need a faster, more efficient, and modern data warehouse that is robust enough to handle large amounts of data and multiple users while simultaneously delivering real-time query results. And that is where hybrid cloud comes in. As increasing volumes of data are being generated and stored in the cloud, enterprises are rethinking their strategies for data warehousing and analytics. Hybrid cloud data warehouses allow you to utilize existing resources and architectures while streamlining your data and cloud goals.

February 13, 2020

Learn How Data Mapping Supports Data Transformation and Data Integration

Data mapping is an essential component of data processes. One error in data mapping can cause ripples in the organization, bringing it to ruins through replicated errors and inaccurate analysis. So, if you fail to understand the significance of data mapping or how it’s implemented, you are minimizing the chances of your business becoming a success.

In this article post, you’ll become aware of what data mapping is and how it can be done.

December 13, 2019

Scratching the Surface of Data Virtualization

What Is Data Virtualization?

Data Virtualization is an advanced approach to data integration. It is an easier way to integrate, federate, and transform data from multiple data sources into a single, unified environment in real-time. With Data Virtualization, you’re not just collecting different data sources (such as ETL, ESB, and other middleware) but connecting them and leveraging existing Data Warehouse, Big Data lakes, or different data infrastructures already in place.

Therefore, Data Virtualization has the ability to provide a holistic view of business operations and quickly help identify new value possibilities. Please note, the data can easily be accessed or shared by other applications, without replicating the data through the “virtual” metadata layer.

November 1, 2019

Data Lake vs Data Warehouse: Do You Need Both?

Most enterprises today have a data warehouse in place that is accessed by a variety of BI tools to aid in the decision-making process. These have been in use for several decades now and have served enterprise data requirements quite well.

However, as the volume and types of data being collected expand, there’s also a lot more that can be done with that data. Most of these are use cases that an enterprise might not even have identified yet, and they won’t be able to do that until they have had a chance to actually play around with the data.

July 15, 2019

Starting a Data Model With Repods

Repods is a data platform that can create and manage data pods. These pods are compact data warehouses with flexible storage, vCores, memory, and all required tooling. You can manage personal data projects, work together in a private team, or collaborate on open data in public data pods.

Before we start

Before creating a data pod, it is important to be aware of the scope of information that we have and need for our analysis. The goal is to create a data model that closely reflects the business entities of the subject area, without focusing on how reports are going to be created or how we are going to fill this data model with the given data. A good place to start is by answering the following questions:

June 28, 2019

Navigating Data Marts, Lakes, Warehouses and Vaults

Throughout the past several years, everyone has been talking about big data. Businesses looking to be more data-driven have to incorporate a whole range of different infrastructures. However, it can be difficult to understand where your data lakes and warehouses meet, and why you might even need a data vault.

Quite simply, each of these concepts boils down to finding ways to ingest and manage your data in an effective way for today's data analytics driven decision-making. Below is a breakdown of the options, how they relate and what they are used for.

June 27, 2019May 13, 2020

How to Evaluate Data Platforms for Your Organization

Introduction

Companies and organizations generate data and are increasingly using this data to generate additional values. While traditionally this was a task for business administration analysts, today data plays an important role in all aspects and divisions of the organizations. To enable companies for this change, an efficient and long term data architecture is required. Here we are going to discuss many aspects and technical challenges that need to be addressed to build such a data architecture.

The motivation for this article came from the observation that data platforms often are reduced to the database component which is a huge oversimplification of the whole data lifecycle. We want to illustrate the amount of functionality that is required for a basic, long term data strategy in a company.

June 14, 2019

Should a Graph Database Be in Your Next Data Warehouse Stack? [Slideshare]

In our webinar "Should a Graph Database Be in Your Next Data Warehouse Stack?" AnzoGraph's graph database guru Barry Zane and data governance author Steve Sarsfield explore the trend of companies considering multiple analytical engines. First, they talk about how graph databases fit into the data warehouse modernization trend. Then, they explore how certain workloads can be better served with an analytical graph database and wrap up with some insightful Q&A.

Here are the slides from their webinar.

May 23, 2019

Snowflake Performance Tuning: Top 5 Best Practices

How do you tune the Snowflake data warehouse when there are no indexes, and few options available to tune the database itself?

Snowflake was designed for simplicity, with few performance tuning options. This article summarizes the top five best practices to maximize query performance.

May 2, 2019

Data Warehouses: Past, Present and Future

In today’s world, data is being generated at a rapid pace, especially as enterprises across virtually every industry undergo digital transformation. We’re also seeing unprecedented demand to equip every business decision maker with access to real-time data so that they can make the best-informed decisions for the business. More than ever, global companies are dispersing virtual teams across the world, empowering them with the ability and tooling to make informed business decisions using all available data. For instance, retailers seek to make purchasing recommendations not just on past purchases and browsing history, but by using all publicly available information about the customer, such as their profession and employer, their viewing and listening interests, sports and hobbies, travel patterns and restaurants frequented. But providing this holistic view of the customer requires bringing a variety of data from a multitude of sources together, and managing and analyzing that data at scale can be challenging.

In order to make data actionable and useful for business, companies need a way to store, label, and analyze it in an efficient and cost-effective way. Enter the data warehouse.

April 16, 2019

The Process of ETL Testing: How it Maintains Data Integrity and Consistency

First, let's understand what is ETL. This notation stands for Extract-Transform-Load. For large-scale firms, initially, the data is extracted from the source systems and then transformed into specific data types and, ultimately, loaded into a distinct repository. And this process should be tested efficiently to make sure that the data is managed properly in the warehouse.

What Does Testing of ETL Refer To?

It is a procedure that tests the withdrawal of data for further transformation, authentication of data during the transformation stages, and loading or filling of data in the endpoint.

April 9, 2019

Agile Data Warehouse Development: Attack of the Clones

One of the greatest data management and data warehouse design challenges I faced was while working as a designer and DBA of a multi-terabyte Oracle project for a Tier 1 Investment bank. The project encompassed over a hundred designers, developers, and testers, all running in three parallel development streams, capped off with several System, and User Acceptance Test (UAT) projects in parallel. It was a nightmare to manage.

One of my responsibilities was to help design the procedures to manage fifteen multi-terabyte warehouse environments, ensuring everyone was running with the correct code version, database changes were correctly applied, and every platform loaded with the correct data.

March 25, 2019

Top 5 Enterprise ETL Tools

With the ever-growing amounts of data, enterprises create an increasing demand for data warehousing projects and systems for advanced analytics. ETL is their essential element. It ensures successful data integration within various databases and applications. In this ETL tools comparison, we will look at:

Apache NiFi
Apache StreamSets
Apache Airflow
AWS Data Pipeline
AWS Glue

They are among the most popular ETL tools 2019. Let's compare the pros and cons to find out the best solution for your project.

March 25, 2019

Why a Snowflake Computing Warehouse Should Be Part of Your Next Data Platform

The traction for serverless services, including data warehouses, has gained momentum over the past couple of years for big data and small data alike. Scalable performance, along with removing set up or management of infrastructure, has proven attractive. Also, a model of just paying for “run-time” resources is equally as attractive.

Why Snowflake Warehouse?

When we find products that embrace zero data management, data warehouse-as-a-service, we are in. This is why we have taken a closer look at Snowflake Computing. Building on this industry serverless trend is the data warehouse offering from Snowflake Computing. Here is how Snowflake Computing describes their product: