Best Practices for Data Pipeline Error Handling in Apache NiFi

According to a McKinsey report, ”the best analytics are worth nothing with bad data”. We as data engineers and developers know this simply as "garbage in, garbage out". Today, with the success of the cloud, data sources are many and varied. Data pipelines help us to consolidate data from these different sources and work on it. However, we must ensure that the data used is of good quality. As data engineers, we mold data into the right shape, size, and type with high attention to detail. 

Fortunately, we have tools such as Apache NiFi, which allow us to design and manage our data pipelines, reducing the amount of custom programming and increasing overall efficiency. Yet, when it comes to creating them, a key and often neglected aspect is minimizing potential errors.

Modern Apache NiFi Load Balancing

In today's Apache NiFi, there is a new and improved means of load balancing data between nodes in a cluster. With the introduction of NiFi 1.8.0, connection load balancing has been added between every processor in any connection. You now have an easy to set option for automatically load balancing between your nodes. 

The legacy days of using Remote Process Groups to distribute the load between Apache NiFi nodes is over. For maximum flexibility, performance and ease, please make sure you upgrade your existing flows to use the built-in Connection Load Balancing.

Migrating Apache Flume Flows to Apache NiFi: Kafka Source to Multiple Sinks

The world of streaming is constantly moving... yes I said it. Every few years some projects get favored by the community and by developers. Apache NiFi has stepped ahead and has been the go-to for quickly ingesting sources and storing those resources to sinks with routing, aggregation, basic ETL/ELT, and security. I am recommending a migration from legacy Flume to Apache NiFi. The time is now.

Below, I walk you through a common use case. It's easy to integrate Kafka as a source or sink with Apache NiFi or MiNiFi agents. We can also add HDFS or Kudu sinks as well. All of this with full security, SSO, governance, cloud and K8 support, schema support, full data lineage, and an easy to use UI. Don't get fluming mad, let's try another great Apache project.

Arm Twisting Apache NiFi

Introduction

Apache NiFi, is a software project from Apache Software Foundation, designed to automate the flow of data between software systems.

Early this year, I created a generic, meta-data driven data offloading framework using Talend. While championing that tool, many accounts raised concerns regarding the Talend license. While some were apprehensive of the additional cost, many others questioned the tool itself, due to the fact that their account already had licenses for other competitive ETL tools like DataStage and Informatica (to name a few). A few accounts also wanted to know if the same concept of offloading could be made available using NiFi. Therefore, it was most logical to explore NiFi.

Apache NiFi Overview

What Is Apache NiFI?

Apache NiFi is a robust open-source Data Ingestion and Distribution framework and more. It can propagate any data content from any source to any destination.

NiFi is based on a different programming paradigm called Flow-Based Programming (FBP). I’m not going to explain the definition of Flow-Based Programming. Instead, I will tell how NiFi works, and then you can connect it with the definition of Flow-Based Programming.

Integration of Apache NiFi and Cloudera Data Science Workbench for Deep Learning Workflows

Summary

Now that we have shown that it is easy to do standard NLP, next up is Deep Learning. As you can see, NLP, Machine Learning, Deep Learning, and more are all in your reach for building your own AI as a Service using tools from Cloudera. These can run in public or private clouds at scale. Now you can run and integrate machine learning services, computer vision APIs, and anything you have created in-house with your own Data Scientists. The YOLO pre-trained model will download the image to /tmp from the URL to process it. The Python 3 script will also download the GLUONCV model for YOLO3.

Using Pre-trained Model:

Reading SUDO Logs With Apache NiFi

Log, Log, Log

Sudo logs have a lot of useful information on hosts, users, and auditable actions that may be useful for cybersecurity, capacity planning, user tracking, data lake population, user management, and general security.

Symbol Model 1

Apache NiFi

Using Cloudera Data Science Workbench With Apache NiFi

Using Deployed Models as a Function as a Service

Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP, but it will work for all CDSW regardless of install type.

In my simple example, I built a Python model that uses TextBlob to run sentiment against a passed sentence. It returns Sentiment Polarity and Subjectivity, which we can immediately act upon in our flow.