BI Testing: Identifying Quality Issues During the DWH Design Phase

Decisions in today's organizations have become increasingly data-driven and real-time, so the systems that support business decisions must be of exceptional quality. People sometimes confuse testing data warehouses that produce business intelligence (BI) reports with backend or database testing or with testing the BI reports themselves. Data warehouse testing is much more complex and diverse. Nearly everything in BI applications involves the data that "drives" intelligent decision making.

Data integrity can be compromised at all DWH/BI phases: when data is created, integrated, moved, or transformed. However, testing of data warehouses is usually deferred until late in the cycle. If testing is shortchanged (e.g., due to schedule overruns or limited resource availability), there's a high risk that critical data integrity issues may slip through the verification efforts. Even if thorough testing is performed, it's difficult and costly to address any data integrity issues exposed by this late-cycle testing. At this phase, the cause of the error can be anything from a data quality issue stemming from when the data enters the data warehouse, to a data processing issue caused by a malfunction of the business logic along the layers of the data warehouse and its BI components. This is a painstakingly tedious task and often consumes considerable resources.

What Are Data Silos?

A data silo is a collection of information in an organization that is isolated from and not accessible by other parts of the organization. Removing data silos can help you get the right information at the right time so you can make good decisions. And, you can save money by reducing storage costs for duplicate information.

How Do Data Silos Occur?

Data silos happen for three common reasons:

Use Materialized Views to Turbo-Charge BI, Not Proprietary Middleware

Query performance has always been an issue in the world of business intelligence (BI), and many BI users would be happy to have their reports load and render quicker. Traditionally, the best way to achieve this performance (short of buying a bigger database) has been to build and maintain aggregate tables at various levels to intercept certain groups of queries to prevent repeat queries of the same raw data. Also, many BI tools pull data out of databases into their own memory, into “cubes” of some sort, and run analyses off of those extracts.

Downsides of Aggregates and Cubes

Both of these approaches have the major downside of needing to maintain the aggregate or cube as new data arrives. In the past, that has been a daily event, but most warehouses are now being stream-fed in near real-time. It’s not practical to continuously rebuild aggregate tables or in-memory cubes every time a new row arrives or a historical row is updated.

How to Make Your Python Workers Scale Dynamically

We’ve recently had an interesting opportunity to experiment with deploying modern Python workers at Rainforest. We used to host most of our stack on Heroku[1], but it was not a good fit for this particular use case. This post explains the challenge we were facing and how we solved it, while also mentioning a bunch of cool tools that make development and deployment much less painful than it was a few years ago.

Dynamic Scaling for Python Workers

So what was our task? We wanted to run Python workers, each for some time between a few minutes and a few hours. We also wanted to have the flexibility to support thousands of workers running simultaneously during high-traffic periods without paying for the infrastructure in times of low demand – basically dynamic scaling. When you’re used to building web applications, where a request taking more than 1s is considered bad and your demand is a little more constant, you have to slightly change your perspective and probably the tools you’re using. Heroku can work great for some things, but dynamic scaling is not really one of them (things like HireFire can help, but we found that we wanted a bit more flexibility).

The Benefits of Combining Google BigQuery and BI

As businesses produce significantly larger amounts of data, it’s important to have the right tools in place to interact with it while deriving the insights you need more effectively and quickly. Simply storing it and organizing it is not enough, and it can become difficult to rapidly analyze millions of data points even in the most efficient data structures.

Google BigQuery, the search giant’s database analytics tool, is ideal for trawling through billions of rows of data to find the right data for each analysis. Thanks to its intelligent design and approach to columnar storage, it can create better aggregates and work across massive compute clusters. When paired with the right BI, it can be a powerful tool for any business. These are some of the top reasons why you should consider Google BigQuery for your BI tools.

What Is Data Loading?

One of the most important aspects of data analytics is that data is collected and made accessible to the user. Depending on which data loading method you choose, you can significantly speed up time to insights and improve overall data accuracy, especially as it comes from more sources and in different formats. ETL (Extract, Transform, Load) is an efficient and effective way of gathering data from across an organization and preparing it for analysis.

Data Loading Defined

Data loading refers to the "load" component of ETL. After data is retrieved and combined from multiple sources (extracted), cleaned and formatted (transformed), it is then loaded into a storage system, such as a cloud data warehouse.

10 Reasons to Learn Python in 2019

If you follow my blog regularly then you may be wondering why am I writing an article to tell people to learn Python? Didn’t I ask you to prefer Java over Python a couple of years ago? 

Well, things have changed a lot since then. In 2016, Python replaced Java as the most popular language in colleges and universities and has never looked back. 

How to Transition From Excel Reports to Business Intelligence Tools

If you are one of those people manually creating reports using Excel, you know it can be overwhelming to meet the organizational expectations for quality, insights, and velocity. The business demands of twenty-first-century data analysis using twentieth-century tools are the root of too much pain and frustration. 

If you decided to do a little research on the latest and greatest alternatives, you quickly see how many tools exist to solve those challenges. Reviewing the options you think, “Hurray! This is going to be a snap. These tools make it seem so easy!” Next, you decide to take the plunge with a trial of a preferred tool like Tableau, Microsoft Power BI, Looker, Amazon QuickSight, or Google Data Studio. Don't have a tool picked out to trial yet? You can check out business intelligence software options on G2

What Is Data Consolidation?

To the outside world, your business is a highly organized structure. But on the inside, it's a cauldron of raw material collected from databases, documents, and a multitude of other sources. This material - a.k.a. data - has all the potential in the world to help your business transform and grow, so long as you properly corral it all through a process called data consolidation.

Data Consolidation Defined

Data is generated from many disparate sources and in many different formats. Data consolidation is the process that combines all of that data wherever it may live, removes any redundancies, and cleans up any errors before it gets stored in one location, like a data warehouse or data lake.

How to Use Redis Streams in Your Apps

Data processing has been revolutionized in recent years, and these changes present tremendous possibilities. For example, if we consider a variety of use cases — from IoT and Artificial Intelligence to user activity monitoring, fraud detection and FinTech — what do all of these cases have in common? They all collect and process high volumes of data, which arrive at high velocities. After processing this data, these technologies then deliver them to all the appropriate consumers of data.

With the release of version 5.0, Redis launched an innovative new way to manage streams while collecting high volumes of data — Redis Streams. Redis Streams is a data structure that, among other functions, can effectively manage data consumption, persist data when consumers are offline with a data fail-safe, and create a data channel between many producers and consumers. It allows users to scale the number of consumers using an app, enables asynchronous communications between producers and consumers and efficiently uses main memory. Ultimately, Redis Streams is designed to meet consumers' diverse needs, from real-time data processing to historical data access, while remaining easy to manage.

Using Big Data to Improve Forecasting

Technologies such as AI and big data have become increasingly proficient at spotting trends in large data sets. Indeed, so proficient have they become, that the University of Toronto's Ajay Agrawal, Joshua Gans, Avi Goldfarb argue that lowering the cost of prediction will be the main benefit delivered by AI in the near term.

A sign of the progress being made comes via a recent paper from researchers at the University of Cordoba, which chronicles the work they've done in producing an accurate forecasting machine. They were able to provide accurate forecasts with less data than has been used in previous models.