Data Lakes, Warehouses and Lakehouses. Which is Best?

Twenty years ago, your data warehouse probably wouldn’t have been voted hottest technology on the block. These bastions of the office basement were long associated with siloed data workflows, on-premises computing clusters, and a limited set of business-related tasks (i.e., processing payroll, and storing internal documents). 

Now, with the rise of data-driven analytics, cross-functional data teams, and most importantly, the cloud, the phrase “cloud data warehouse” is nearly analogous to agility and innovation. 

What Is a Data Reliability Engineer, and Do You Really Need One?

As software systems became increasingly complex in the late 2000s, merging development and operations (DevOps) was a no-brainer. 

One-half software engineer, one-half operations admin, and the DevOps professional are tasked with bridging the gap between building performant systems and making them secure, scalable, and accessible. It wasn’t an easy job, but someone had to do it. 

The Rise of the Data Reliability Engineer

With each day, enterprises increasingly rely on data to make decisions. This is true regardless of their industry: finance, media, retail, logistics, etc. Yet, the solutions that provide data to dashboards and ML models continue to grow in complexity. This is due to several reasons, including:

This need to run complex data pipelines with minimum rates of error in such modern environments has led to the rise of a new role: the Data Reliability Engineer. Data Reliability Engineering (DRE) addresses data quality and availability problems. Comprising practices from data engineering to system operations, DRE is emerging as its own field within the broader data domain.

Migrating to Snowflake, Redshift, or BigQuery? Avoid these Common Pitfalls

The Drive to Migrate Data to the Cloud

With data being valued more than oil in recent years, many organizations feel the pressure to become innovative and cost-effective when it comes to consolidating, storing, and using data. Although most enterprises are aware of big data opportunities, their existing infrastructure isn’t always capable of handling massive amounts of data.

By migrating to modern cloud data warehouses, organizations can benefit from improved scalability, better price elasticity, and enhanced security. But even with all these benefits, many businesses are still reluctant to make the move.

Using Datafold to Enhance DBT for Data Observability

Modern businesses depend on data to make strategic decisions, but today’s enterprises are seeing an exponential increase in the amount of data available to them. Churning through all this data to get meaningful business insights means there’s very little room for data discrepancies. How do we put in place a robust data quality assurance process?

Fortunately for today’s DataOps engineers, tools like Datafold and dbt (Data Build Tool) are simplifying the challenge of ensuring data quality. In this post, we’ll look at how these two tools work in tandem to bring observability and repeatability into your data quality process.

8 Quick Tips to Improve Decision Making With Better Data Quality

The term "data quality" on the search engine results in six million pages, which clearly expresses the importance of data quality and its crucial role in the decision-making context. However, understanding the data helps classify and qualify it for effective use in the required scenario. 

Understanding the Quality of Data

Good quality data is accurate, consistent, and scalable. Data should also be helpful in decision-making, operations, and planning. On the other hand, lousy quality data can cause a delay in deploying a new system, damaged reputation, low productivity, poor decision-making, and loss of revenue. According to a report by The Data Warehousing Institute, poor quality customer data costs U.S. businesses approximately $611 billion per year. The research also found that 40 percent of firms have suffered losses due to insufficient data quality. 

The Essential Data Cleansing Checklist

Data quality issues, such as missing, duplicate, inaccurate, valid, and inconsistent values, cause headaches in finding and using data sets. Having a suitable data cleansing procedure handles this bad data and makes it suitable for other people and systems.

A helpful data cleansing process standardizes data, fixes, or removes erroneous values, and formats records to be readable. You get these adequate results from data cleansing when you know your data’s original purpose and visualize the good data you require to meet new goals. You need to create a good foundation and run through the essential data cleansing checklist in this article to achieve your objectives.

Why Is Fuzzy Matching Software a Key for Deduplication?

Identifying golden and unique records across or within datasets is crucial to prevent identity theft, meet compliance regulations, and improve customer acquisition. Banks, government organizations, healthcare providers, and marketing companies all require matching algorithms to identify and deduplicate redundant entries to enrich their master database.

Fuzzy matching is a known set of algorithms for measuring the distance between two similar entities. But certain limitations hinder its effectiveness to quickly find matches for larger, disparate datasets. 

Using Machine Learning to Automate Data Cleansing

According to Gartner’s report, 40% of businesses fail to achieve their business targets because of poor data quality issues. The importance of utilizing high-quality data for data analysis is realized by many data scientists, and so it is reported that they spend about 80% of their time on data cleaning and preparation. This means that they spend more time on pre-analysis processes, rather than focusing on extracting meaningful insights.

Although it is necessary to achieve the golden record before moving on to the data analysis process, there must be a better way of fixing the data quality issues that reside in your dataset, rather than correcting each error manually.

Why Your Remote Team Needs a Single Source of Truth

By now, most businesses have completely adopted the remote work model, but few can sail a smooth ship. I’d know. Having worked for businesses with chaotic processes, I know for a fact that it’s difficult to manage a remote team if you don’t have the right tools, the right resources and a single version of truth for team members to access.

Plenty has been said on tools and resources; that won’t be the primary point of this article. What business leaders need to know in deeper detail is the “single source of truth,” a relatively new term in business jargon, borrowed from information design and theory.

Modern Cloud Data Management

What Is Cloud Data Management?

Cloud data management is the implementation of cloud data management platforms and tools, policies, and procedures that give organizations control of their business data, both in the cloud and in hybrid setups where data is stored or sourced in a combination of on-premises and cloud applications.

The ever-growing list of cloud applications and tools being adopted by enterprises is leading to an exponential growth in data — whether structured, unstructured or semi-structured. Because this data is a critical asset for modern enterprises, managing it has become a strategic imperative, especially as the numbers of data users increase, the quantity and types of data increase, and the types of business processes evolve.

Five Aspects That Affect Data Quality on CRM

The central requirement for a successful CRM/ERP undertaking is the quality of the stored data. Data like the customer’s contact details are pivotal in any CRM system. In this blog, we will shed some light on the data quality in CRM, its importance, and how you can ensure that the data is worth your time and effort.

You may also like: How to Integrate Enterprise Applications With CRM.

CRM Data Quality

Data quality is often understood to be just data accuracy. However, it’s more about the value it adds to your CRM goals. After making all the effort to enter the data, it’s pointless if the data is not making a difference to your business. Bad data makes it impossible to come up with meaningful data analysis, eventually forcing your CRM plans down the drain.

Issues With Machine Learning in Software Development

Machine learning transparency

To learn about the current and future state of machine learning (ML) in software development, we gathered insights from IT professionals from 16 solution providers. We asked, "What are the most common issues you see when using machine learning in the SDLC?" Here's what we learned:

You may also like:  6 Reasons Why Your Machine Learning Project Will Fail to Get Into Production

Data Quality

  • The most common issue when using ML is poor data quality. The adage is true: garbage in, garbage out. It is essential to have good quality data to produce quality ML algorithms and models. To get high-quality data, you must implement data evaluation, integration, exploration, and governance techniques prior to developing ML models. 
  • ML is only as good as the data you provide it and you need a lot of data. Accuracy of ML is driven by the quality of the data. Lacking a data science team and not designing the product in a way that’s applicable to data science. 
  • 1) Integrating models into the application. Spin up the infrastructure for models. 2) Debugging, people don’t know how to retrace the performance of the model. 3) Deterioration of model performance over time. People don’t think about data upfront. Do I have the right data to solve the problem, to create a model? 
  • Common issues include lack of good clean data, the ability to apply the correct learning algorithms, black-box approach, the bias in training data/algorithms, etc. Another issue we see is model maintenance. When you think about traditional and coded software, it becomes more and more stable over time, and as you detect bugs, you are able to make tweaks to fix it and make it better. With ML being optimized towards the outcomes, self-running and dependent on the underlying data process, there can be some model degradation that might lead to less optimal outcomes. Assuming ML will work faultlessly postproduction is a mistake and we need to be laser-focused on monitoring the ML performance post-deployment as well.

Transparency

  • The most common issue I find to be is the lack of model transparency. It is often very difficult to make definitive statements on how well a model is going to generalize in new environments. You have to often ask, “what are the modes of failure and how do we fix them.”
  • It’s a black box for most people. Developers like to go through the code to figure out how things work. Customers who instrument code with tracing before and after ML decision making can observe program flow around functions and trust them. Are decisions made in a deterministic way? Machine-based tools can mess with code (Kite example of automated code) injecting tracking code. Treat the machine-generated code and audit it as part of the process. 
  • As with any AI/ML deployment, the “one-size-fits-all” notion does not apply and there is no magical ‘“out of the box” solution. Specific products and scenarios will require specialized supervision and custom fine-tuning of tools and techniques. Additionally, assuming ML models use unsupervised and closed-loop techniques, the goal is that the tooling will auto-detect and self-correct. However, we have found AI/ML models can be biased. Sometimes the system may be more conservative in trying to optimize for error handling, error correction, in which case the performance of the product can take a hit. The tendency for certain conservative algorithms to over-correct on specific aspects of the SDLC is an area where organizations will need to have better supervision.

Manpower

  • Having data and being able to use it so does not introduce bias into the model. How organizations change how they think about software development and how they collect and use data. Make sure they have enough skillsets in the organization. More software developers are coming out of school with ML knowledge. Provide the opportunity to plan and prototype ideas. 
  • When you use a tool based on ML you have to take into account the accuracy of the tool and weigh the trust you put in the tool versus the effort in the event you miss something. When you are using a technology based on statistics, it can take a long time to detect and fix — two weeks. It requires training and dealing with a black box. When building software with ML it takes manpower, time to train, retaining talent is a challenge. How to test when it has statistical elements in it. You need to take different approaches to test products with AI. 
  • This is still a new space. There are always innovators with the skills to pick up these new technologies and techniques to create value. Companies using ML have a lot of self-help. The ecosystem is not built out. You will need to figure out how to get work done and get value. Talent is a big issue. The second is training data sets. We need good training data to teach the model. The value is in the training data sets over time. The third is data availability and the amount of time it takes to get a data set. It takes a Fortune 500 company one month to get a data set to a data scientist. That’s a lot of inefficiencies and it hurts the speed of innovation.

Other

  • The most common issue by far with ML is people using it where it doesn’t belong. Every time there’s some new innovation in ML, you see overzealous engineers trying to use it where it’s not really necessary. This used to happen a lot with deep learning and neural networks. Just because you can solve a problem with complex ML doesn’t mean you should.
  • We have to constantly explain that things not possible 20 years ago are now possible. You have to gain trust, try it, and see that it works.
  • If you have not done this before it requires a lot of preparation. You pull historical data to train the model but then you need a different preparation step on the deployment side. This is a major issue typical implementations run into. The solution is tooling to manage both sides of the equation. 
  • Traceability and reproduction of results are two main issues. For example, an experiment will have results for one scenario, and as things change during the experimentation process it becomes harder to reproduce the same results. Version control around the specific data used, the specific model, its parameters and hyperparameters are critical when mapping an experiment to its results. Often organizations are running different models on different data with constantly updated perimeters, which inhibits accurate and effective performance monitoring. Focusing on the wrong metrics and over-engineering the solution is also problems when leveraging machine learning in the software development lifecycle. The best approach we’ve found is to simplify a need to its most basic construct and evaluate performance and metrics to further apply ML.

Here’s who we heard from:

3 Takeaways From the 2019 Gartner Market Guide for Data Prep

Gartner has recently released its 2019 Market Guide for Data Preparation ([1]), its fourth edition of a guide that was first published in the early days of the market, back in 2015 when Data Preparation was mostly intended to support self-service uses cases. Compared to Magic Quadrants, the Market Guide series generally covers early, mature, or smaller markets, with less detailed information about competitive positioning between vendors, but more information about the market itself and how it evolves over time.

While everyone's priority with these kinds of documents might be to check the vendor profiles (where you'll find Talend Data Preparation listed with a detailed profile), I would recommend focussing on the thought leadership and market analysis that the report provides. Customers should consider the commentary delivered by the authors, Ehtisham Zaidi and Sharat Menon, on how to successfully expand the reach and value of Data Preparation within their organization.

Intelligent Big Data Lake Governance

When you have data, and data which is flowing fast with variety into the ecosystem, the biggest challenge is governing that data. In traditional data warehouses, where data is strucured and the structure is always known, creating processes, methods, and frameworks is quite easy. But in a big data environment, where data flows fast while inferring run time schema, the need to govern data is at run time.

When I was working with my team to develop an ingestion pipeline and collecting ideas from the team and other stakeholders on how the ingestion pipeline should be, one idea was common: can we build a system where we can analyze what changed overnight in a feed structure. The second requirement was finding the pattern of the data, e.g. how could we find out that a data element was a SSN numer, a first name, etc., so that we can tag the sensitive information at run time?

Tom’s Tech Notes: What You Need to Know About Big Data [Podcast]

Welcome to our latest episode of Tom's Tech Notes! In this episode, we'll hear advice from a host of industry experts on the most important things you need to know about big data. Learn some tips around data quality, big data app development, data governance, and more.

The Tom's Tech Notes podcast features conversations that our research analyst Tom Smith has had with software industry experts from around the world as part of his work on our research guides. We put out new episodes every Sunday at 11 AM EST.