data warehouses | The Blog Pros

September 15, 2022

Why Mutability Is Essential for Real-Time Data Analytics

Successful data-driven companies like Uber, Facebook, and Amazon rely on real-time analytics. Personalizing customer experiences for e-commerce, managing fleets and supply chains, and automating internal operations require instant insights into the freshest data.

To deliver real-time analytics, companies need a modern technology infrastructure that includes three things:

November 3, 2020

Benefits of Hybrid Cloud for Data Warehouse

In today’s market reliable data is worth its weight in gold, and having a single source of truth for business-related queries is a must-have for organizations of all sizes. For decades companies have turned to data warehouses to consolidate operational and transactional information, but many existing data warehouses are no longer able to keep up with the data demands of the current business climate. They are hard to scale, inflexible, and simply incapable of handling the large volumes of data and increasingly complex queries.

These days organizations need a faster, more efficient, and modern data warehouse that is robust enough to handle large amounts of data and multiple users while simultaneously delivering real-time query results. And that is where hybrid cloud comes in. As increasing volumes of data are being generated and stored in the cloud, enterprises are rethinking their strategies for data warehousing and analytics. Hybrid cloud data warehouses allow you to utilize existing resources and architectures while streamlining your data and cloud goals.

May 2, 2019

Intelligent Big Data Lake Governance

When you have data, and data which is flowing fast with variety into the ecosystem, the biggest challenge is governing that data. In traditional data warehouses, where data is strucured and the structure is always known, creating processes, methods, and frameworks is quite easy. But in a big data environment, where data flows fast while inferring run time schema, the need to govern data is at run time.

When I was working with my team to develop an ingestion pipeline and collecting ideas from the team and other stakeholders on how the ingestion pipeline should be, one idea was common: can we build a system where we can analyze what changed overnight in a feed structure. The second requirement was finding the pattern of the data, e.g. how could we find out that a data element was a SSN numer, a first name, etc., so that we can tag the sensitive information at run time?

April 9, 2019

Agile Data Warehouse Development: Attack of the Clones

One of the greatest data management and data warehouse design challenges I faced was while working as a designer and DBA of a multi-terabyte Oracle project for a Tier 1 Investment bank. The project encompassed over a hundred designers, developers, and testers, all running in three parallel development streams, capped off with several System, and User Acceptance Test (UAT) projects in parallel. It was a nightmare to manage.

One of my responsibilities was to help design the procedures to manage fifteen multi-terabyte warehouse environments, ensuring everyone was running with the correct code version, database changes were correctly applied, and every platform loaded with the correct data.

March 11, 2019

An Introduction to Data Virtualization and Its Use Cases

Data virtualization is a solution to address several issues. This type of solution is booming, with strong year-over-year growth. But let's start with a definition first.

Kezako?

Data virtualization is the process of inserting a layer of data access between data sources and data consumers to facilitate access. In practice, we have a kind of SQL requestor as a tool, which is able to query very heterogeneous data sources, ranging from the traditional SQL databases to a text or PDF files, or a streaming source like Kafka. In short, you have data, you can query it, and generate joins between this data. In practice, you can thus offer a unified and complete view of the data, even if it is "exploded" between several systems. On top of that, you have cache and a query optimizer that allows you to minimize the impact on source systems in terms of performance. And, of course, you have a data catalog, which helps you to find your way through all the data in your IT infrastructure. From this we can deduce two main use cases.

February 4, 2019

What Is Data Redundancy?

Data Redundancy Explained

Data redundancy occurs when the same piece of data is stored in two or more separate places. Suppose you create a database to store sales records, and in the records for each sale, you enter the customer address. Yet, you have multiple sales to the same customer so the same address is entered multiple times. The address that is repeatedly entered is redundant data.

How Does Data Redundancy Occur?

Data redundancy can be designed; for example, suppose you want to back up your company’s data nightly. This creates a redundancy. Data redundancy can also occur by mistake. For example, the database designer who created a system with a new record for each sale may not have realized that his design caused the same address to be entered repeatedly. You may also end up with redundant data when you store the same information in multiple systems. For instance, suppose you store the same basic employee information in Human Resources records and in records maintained for your local site office.

January 18, 2019

Identifying Data Warehouse Quality Issues During Staging and Loads to the DWH

This is the fourth blog in a series on Identifying Data Integrity Issues at Every DWH Phase.

Before looking into data quality problems during data staging, we need to know how the ETL system handles data rejections, substitutions, cleansing, and enrichment. To ensure success in testing data quality, include as many data scenarios as possible. Typically, data quality rules are defined during design. For example: