Successful AI Requires Data Governance

As tech applications increasingly include artificial intelligence (AI) aspects, people involved in building or using them cannot overlook the need for data governance. It should address details such as:

  • Where does an AI product's data exist?

Data Governance and Data Management

Introduction

Enterprises that dont embrace data or are late to the party face serious consequences compared to early adopters. As to talking about good data practices, most people associate the word with only a few of the multitude of practices that constitute a successfully run, data-driven enterprise.   

Besides data analysis, data management is what readily comes to mind. Though equally universal — and perhaps are even more critical — data practice is the practice of data governance.   

8 Quick Tips to Improve Decision Making With Better Data Quality

The term "data quality" on the search engine results in six million pages, which clearly expresses the importance of data quality and its crucial role in the decision-making context. However, understanding the data helps classify and qualify it for effective use in the required scenario. 

Understanding the Quality of Data

Good quality data is accurate, consistent, and scalable. Data should also be helpful in decision-making, operations, and planning. On the other hand, lousy quality data can cause a delay in deploying a new system, damaged reputation, low productivity, poor decision-making, and loss of revenue. According to a report by The Data Warehousing Institute, poor quality customer data costs U.S. businesses approximately $611 billion per year. The research also found that 40 percent of firms have suffered losses due to insufficient data quality. 

The Future of Automated Data Lineage in 2021

Automated Data Lineage in 2021

As 2021 is now upon us (finally!), businesses are gearing up their strategy based on learnings from the past year. While insights help inform future plans, such as where to place budget and effort, there is one essential tool that each company should have at its disposal. If you’ve read the title, this shouldn’t come to you as such a surprise. We’re speaking about automated data lineage. With the ability to fully understand how data flows from one place to another, data lineage allows business processes to become more efficient and focused.

Data Lineage is Like Oil

In the webinar titled, 'The Essential Guide to Data Lineage in 2021,' Malcolm Chisholm, an expert in the fields of data management and data governance, shares his predictions for the coming year. To kick off the talk, he compares data lineage pathways to an oil refinery (one of our favorite analogies). Without our understanding of what is flowing through the pipes, we can’t determine how hot the oil is, it’s pressure levels, or even where it is going. Data lineage is thought to be the same. If companies don’t have a handle on exactly the data that is flowing between systems, they won’t be able to explain numbers that end up in a report. Malcolm Chisholm states that "data lineage is not just an arrow between two boxes, it’s a good deal more complicated than that." The process requires knowledge of the data that the company has acquired an understanding of how it was stored or any obstacles that it encountered along the way. Additionally, ETL tools are more than just data movement, there is actually logic happening inside of them. With this component, you can understand data lineage overall.

Snapshot: Data Governance and Security Mechanism in Distributed Data Storage System

We are very much aware that the traditional data storage mechanism is incapable to hold the massive volume of lightning speed generated data for further utilization even though perform vertical scaling. And going forward we have anticipated only one fuel which is nothing but DATA to accelerate the movement across all the sectors starting from business to natural resources including medical towards rapid growth. But the question is how to persist this massive volume of data to process? The answer is storing the data in a distributed manner in a multi-node cluster where it can be scaled linearly on demand. The former statement is made physically achievable by Hadoop distributed file system (HDFS). Using HDFS we can store data in a distributed manner (multi-node cluster where the number of nodes can be increased in the cluster linearly as data grows). Using hive, HBase we can organize the HDFS data and make it more meaningful as the data become queryable. To accelerate the movement towards growth as mentioned, the next hurdle is to govern the data and security implication of this huge volume of persisted data. In a single statement, data governance can be defined as the consolidation of managing data access, accountability, and security. By default, HDFS does not provide any strong security mechanism to achieve complete governance but with the additional combination to the following approach, we can proceed towards it.

  • Integration with LDAP – To secure read/write operation on the persisted data, appropriate authorization with proper authentication is mandatory. Authentication can be achieved in HDFS by integration with the LDAP server across all the nodes. LDAP is often used as a central repository for user information and as an authentication service. Organization/Company who has ingested huge data into Hadoop for analysis can define the security policy to avoid data theft, leak, misuse and ensure the right access to data inside HDFS directories, execute HIVE query, etc. User or team needs to get authenticated via the LDAP server before processing/query data from the cluster. LDAP integration with Hadoop can be done either by using OS-level configuration to read LDAP groups or explicitly configuring Hadoop to use LDAP-based group mapping.
  • Introducing Apache Knox gateway – Single access point with multi-node Hadoop clusters can be achieved by Apache Knox for all REST and HTTP interactions. With the complex configuration, the client-side library can be wiped out by using Apache Knox. Besides accessing data in the cluster, we can provide security for job execution in the cluster.
  • Kerberos for authentication – Kerberos network authentication protocol provides strong authentication for the 2-tier application (client and server). Kerberos server verifies identities for every request when the client wants to access the Hadoop cluster. Kerberos Database stores and controls all principles and realms. Kerberos uses secret-key cryptography to enhance strong authentication by providing user-to-server authentication. A Kerberos server, usually called Key Distribution Center (KDC) should be installed on one physical host and its database contains the user and service entries like user’s principal, maximum validity, maximum renewal time, password expiration, etc.
  • Apache Ranger for centralized and comprehensive data security – By Integrating Apache Ranger with a multi-node Hadoop cluster, many requirements mandatory for governance and security can be fulfilled. It has the capacity to manage all security-related tasks via centralized security administration in a central UI or using REST APIs. Besides, Apache Ranger can be utilized effectively to perform fine-grained authorization to do a specific action, standardize the authorization method across all Hadoop components. Apache Ranger has provided dynamic column masking as well as row-level data masking functionality with Ranger-specific policies to protect sensitive data from querying out from the HIVE table in real-time.

Enterprise Data Management: Stick to the Basics

Lots of people have increasing volumes of data and are trying to run data management programs to better sort it. Interestingly, people's problems are pretty much the same throughout different sectors of any industry, and data management helps them configure solutions.

The fundamentals of enterprise data management (EDM), which one uses to tackle these kinds of initiatives, are the same whether one is in the health sector, a telco travel company, or a government agency, and more! Therefore, the fundamental practices that one needs to follow to manage data are similar from one industry to another.

AI and BI Projects Get Bogged Down With Data Preparation Tasks

IBM is reporting that data quality challenges are a top reason why organizations are reassessing (or ending) artificial-intelligence (AI) and business intelligence (BI) projects.

Arvind Krishna, IBM’s senior vice president of cloud and cognitive software, stated in a recent interview with the Wall Street Journal, “about 80% of the work with an AI project is collecting and preparing data. Some companies are not prepared for the cost and work associated with that going in. And you say: ‘Hey, wait a moment, where’s the AI? I’m not getting the benefit.’ And you kind of bail on it.” [1]

Data Quality Testing Skills Needed For Data Integration Projects

The impulse to cut project costs is often strong, especially in the final delivery phase of data integration and data migration projects. At this late phase of the project, a common mistake is to delegate testing responsibilities to resources with limited business and data testing skills.

Data integrations are at the core of data warehousing, data migration, data synchronization, and data consolidation projects. 

Don’t Have Your Data Strategy? That’s a Mistake

Data Strategy

The Sins of AI Adopters

Artificial intelligence adoption may be tricky. This technology is different than any other you’ve implemented before. There are rules to follow and some of them incomprehensible to someone without extensive AI knowledge. There are certain challenges companies can face while implementing AI: data quality, model errors, lack of data science experts — many of them covered in the article 12 challenges of AI adoption. Some of these issues can be prevented, but others require preparation. However, many organizations are still dreamers when it comes to AI. There’s nothing wrong with having a vision to follow, but the way you follow it matters.

You may also like:  What You Need to Know About Adopting Big Data, AI, and Machine Learning

Should a Graph Database Be in Your Next Data Warehouse Stack? [Slideshare]

In our webinar "Should a Graph Database Be in Your Next Data Warehouse Stack?" AnzoGraph's graph database guru Barry Zane and data governance author Steve Sarsfield explore the trend of companies considering multiple analytical engines. First, they talk about how graph databases fit into the data warehouse modernization trend. Then, they explore how certain workloads can be better served with an analytical graph database and wrap up with some insightful Q&A.

Here are the slides from their webinar.

Intelligent Big Data Lake Governance

When you have data, and data which is flowing fast with variety into the ecosystem, the biggest challenge is governing that data. In traditional data warehouses, where data is strucured and the structure is always known, creating processes, methods, and frameworks is quite easy. But in a big data environment, where data flows fast while inferring run time schema, the need to govern data is at run time.

When I was working with my team to develop an ingestion pipeline and collecting ideas from the team and other stakeholders on how the ingestion pipeline should be, one idea was common: can we build a system where we can analyze what changed overnight in a feed structure. The second requirement was finding the pattern of the data, e.g. how could we find out that a data element was a SSN numer, a first name, etc., so that we can tag the sensitive information at run time?

Tom’s Tech Notes: What You Need to Know About Big Data [Podcast]

Welcome to our latest episode of Tom's Tech Notes! In this episode, we'll hear advice from a host of industry experts on the most important things you need to know about big data. Learn some tips around data quality, big data app development, data governance, and more.

The Tom's Tech Notes podcast features conversations that our research analyst Tom Smith has had with software industry experts from around the world as part of his work on our research guides. We put out new episodes every Sunday at 11 AM EST.

Technically Speaking, What Is Data Governance?

The term data governance has been around for decades, but only in the last few years have we begun to redefine our understanding of what the term means outside the world of regulatory compliance, and to establish data standards. This rapid evolution of data governance can be attributed to businesses looking to leverage massive amounts of data for analytics across the enterprise, while attempting to navigate the increasingly rugged terrain of worldwide regulatory requirements. 

Data governance is a critical data management mechanism. Most businesses today have a data governance program in place. However, according to a recent Gartner survey, “more than 87 percent of organizations are classified as having low business intelligence (BI) and analytics maturity,” highlighting how organizations struggle to develop governance strategies that do more than ensure regulatory compliance. 

Tom’s Tech Notes: The Big Concerns With Big Data [Podcast]

Welcome to our latest episode of Tom's Tech Notes! This week, we'll hear advice from 11 industry experts about their biggest concerns with the modern Big Data ecosystem. From poor governance to bad data quality to the removal of human beings from decision-making, check out what Tom's sources have to say about Big Data.

As a primer and reminder from our initial post, these podcasts are compiled from conversations our analyst Tom Smith has had with experts from around the world as part of his work on our research guides.

6 Dos and Dont’s of Data Governance – Part 1

Set Clear Expectations From the Start

One big mistake I see organizations make when starting out on their data governance journey is forgetting the rationale behind data. So don't just govern to govern. Whether you need to minimize risks or maximize your benefits, link your data governance projects to clear and measurable outcomes. As data governance is not a departmental initiative, but rather a company-wide initiative, you will need to prove its value from the start to convince leaders to prioritize and allocate some resources.

What Is Your "Emerald City"? Define Your Meaning of Success

In the Wonderful Wizard of Oz, the "Emerald City" is Dorothy's ultimate destination, the end of the famous yellow brick road. In your data governance project, success can take different forms: reinforcing data control, mitigating risks or data breaches, reducing time spent by business teams, monetizing your data or producing new value from your data pipelines. Meeting compliance standards to avoid penalties is crucial to be considered. Ensure you know where you are headed and where the destination is.