Fantastic ML Pipelines and Tips for Building Them

A machine learning (ML) pipeline is an automated workflow that operates by enabling the transformation of data, funneling them through a model, and evaluating the outcome. In order to cater to these requirements, an ML pipeline consists of several steps such as training a model, model evaluation, visualization after post-processing, etc. Each step is crucial towards the success of the whole pipeline, not only for the short-term but also in the long run. In order to ensure the sustainability of a pipeline in the longer run, ML engineers and organizations need to account for several ML-specific risk factors in the system design. The authors from Google pinpoint risk factors such as boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns [1]. In this article, we will be diving deep into the root causes of some of these risk factors.

Figure 1: Automated pipeline (source : 123.rf)

1. Boundary Erosion

If you are given an ML pipeline and if your data team approaches you with a change in the input feature such as increase/reduction in dimension, would you be able to ensure that it won't affect the entire pipeline? Mostly the answer would be no.

4 Ways the IoT Creates Intelligent Pipeline Monitoring

In the U.S. alone, pipeline operators maintain around 3 million miles of natural gas distribution mains and pipelines. They often run through stretches of remote wilderness, across difficult-to-access terrain, or deep underwater. While traditional monitoring tools like SCADA systems can be effective, they also have significant weak points.

American water infrastructure is similar in length, made up of around 2.2 million miles of pipes that can be difficult to monitor due to location. Innovations are helping to provide more effective pipeline monitoring solutions. Internet of Things (IoT) devices could have a major impact on how pipelines are monitored. Why is the IoT the foundation for a new kind of intelligent pipeline monitoring?

Comparing Container Pipelines

Introduction

Containers brought a monumental shift to DevOps by allowing teams to ship code faster than ever before. However, we still have to go through the process of building, packaging, and deploying those containers. That's why we use container pipelines.

However, there are many different choices when it comes to container pipelines. How do we know which one to use? In this article, we'll compare six choices and cover the configuration, benefits, limitations, and pricing of each.

Performance of Pipeline Architecture: The Impact of the Number of Workers

With the advancement of technology, the data production rate has increased. In numerous domains of application, it is a critical necessity to process such data, in real-time rather than a store and process approach. When it comes to real-time processing, many of the applications adopt the pipeline architecture to process data in a streaming fashion. The pipeline architecture is a parallelization methodology that allows the program to run in a decomposed manner. The pipeline architecture consists of multiple stages where a stage consists of a queue and a worker. Each stage of the pipeline takes in the output from the previous stage as an input, processes it, and outputs it as the input for the next stage. One key factor that affects the performance of pipeline is the number of stages. In this article, we will first investigate the impact of the number of stages on the performance. We show that the number of stages that would result in the best performance is dependent on the workload characteristics. 

Background

The pipeline architecture is a commonly used architecture when implementing applications in multithreaded environments. We can consider it as a collection of connected components (or stages) where each stage consists of a queue (buffer) and a worker.