Fantastic ML Pipelines and Tips for Building Them

A machine learning (ML) pipeline is an automated workflow that operates by enabling the transformation of data, funneling them through a model, and evaluating the outcome. In order to cater to these requirements, an ML pipeline consists of several steps such as training a model, model evaluation, visualization after post-processing, etc. Each step is crucial towards the success of the whole pipeline, not only for the short-term but also in the long run. In order to ensure the sustainability of a pipeline in the longer run, ML engineers and organizations need to account for several ML-specific risk factors in the system design. The authors from Google pinpoint risk factors such as boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns [1]. In this article, we will be diving deep into the root causes of some of these risk factors.

Figure 1: Automated pipeline (source : 123.rf)

1. Boundary Erosion

If you are given an ML pipeline and if your data team approaches you with a change in the input feature such as increase/reduction in dimension, would you be able to ensure that it won't affect the entire pipeline? Mostly the answer would be no.

Intro to Machine Learning for Developers

Welcome to the world of machine learning with scikit-learn. Machine learning can be overwhelming at times, and this is partly due to a large number of tools that are available on the market. This post will simplify this process of tool selection down to one — scikit-learn.

In this series, you will learn how to construct an end-to-end machine learning pipeline using some of the most popular algorithms that are widely used in industry and professional competitions, such as Kaggle.