data lineage | The Blog Pros

February 13, 2022

GitHub Is Bad for AI: Solving the ML Reproducibility Crisis

There is a crisis in machine learning that is preventing the field from progressing as fast as it could. It stems from a broader predicament surrounding reproducibility that impacts scientific research in general. A Nature survey of 1,500 scientists revealed that 70% of researchers have tried and failed to reproduce another scientist’s experiments, and over 50% have failed to reproduce their own work. Reproducibility, also called replicability, is a core principle of the scientific method and helps ensure the results of a given study aren’t a one-off occurrence but instead represent a replicable observation.

In computer science, reproducibility has a more narrow definition: Any results should be documented by making all data and code available so that the computations can be executed again with the same results. Unfortunately, artificial intelligence (AI) and machine learning (ML) are off to a rocky start when it comes to transparency and reproducibility. For example, take this response published in Nature by 31 scientists that are highly critical of a study from Google Health that documented successful trials of AI that detects signs of breast cancer.

November 10, 2021

What is Data Lineage and How Can It Ensure Data Quality?

Introduction

Are you spending too much time tracking down bugs for your C-level dashboards? Are different teams struggling to align on what data is needed throughout the organization? Or are you struggling with getting a handle on what the impact of a potential migration could be?

Data lineage could be the answer you need for data quality issues. By improving data traceability and visibility, a data lineage system can improve data quality across your whole data stack and simplify the task of communicating about the data that your organization depends on.

May 16, 2021

The Future of Automated Data Lineage in 2021

As 2021 is now upon us (finally!), businesses are gearing up their strategy based on learnings from the past year. While insights help inform future plans, such as where to place budget and effort, there is one essential tool that each company should have at its disposal. If you’ve read the title, this shouldn’t come to you as such a surprise. We’re speaking about automated data lineage. With the ability to fully understand how data flows from one place to another, data lineage allows business processes to become more efficient and focused.

Data Lineage is Like Oil

In the webinar titled, 'The Essential Guide to Data Lineage in 2021,' Malcolm Chisholm, an expert in the fields of data management and data governance, shares his predictions for the coming year. To kick off the talk, he compares data lineage pathways to an oil refinery (one of our favorite analogies). Without our understanding of what is flowing through the pipes, we can’t determine how hot the oil is, it’s pressure levels, or even where it is going. Data lineage is thought to be the same. If companies don’t have a handle on exactly the data that is flowing between systems, they won’t be able to explain numbers that end up in a report. Malcolm Chisholm states that "data lineage is not just an arrow between two boxes, it’s a good deal more complicated than that." The process requires knowledge of the data that the company has acquired an understanding of how it was stored or any obstacles that it encountered along the way. Additionally, ETL tools are more than just data movement, there is actually logic happening inside of them. With this component, you can understand data lineage overall.

December 7, 2020

How to Create Data Lineage With the Tableau GraphQL Metadata API

I love data. The ways it can be used to curate value and express relationships never ceases to amaze me. To this extent, visualizing data is often one of the most powerful ways to share insights and Tableau certainly is one of - if not the - most popular data visualization tools on the market. It's extremely simple for non-technical users to develop rich and meaningful graphs with a pretty intuitive UI and there are some really nice features under the hood that are used to speed up query performance when extracts are stored within Tableau.

My absolute favorite Tableau feature is that you can query your metadata using the same GraphQL API that Tableau itself uses. A portion of the metadata exposed includes the lineage for the fields, sheets, tables and data stores that exist within your Tableau Site. Exposing the metadata via an extensive API like this is a really forward thinking idea from the team behind Tableau.