Model Experiments, Tracking, and Registration Using MLflow on Databricks and StreamSets

Learn how StreamSets, a modern data integration platform for DataOps, can help expedite operations at some of the most crucial stages of machine learning lifecycle and MLOps.

Data Acquisition and Preparation

Machine learning models are only as good as the quality of data and the size of datasets used to train the models. Data has shown that data scientists spend around 80% of their time on preparing and managing data for analysis and 57% of the data scientists regard cleaning and organizing data as the least enjoyable part of their work. This further validates the idea of  MLOps and the need for collaboration between data scientists and data engineers.

StreamSets Transformer Extensibility: Spark and Machine Learning Part One

Apache Spark has been on the rise for the past few years, and it continues to dominate the landscape when it comes to in-memory and distributed computing, real-time analysis, and machine learning use cases. And with the recent release of StreamSets Transformer, a powerful tool for creating highly instrumented Apache Spark applications for modern ETL, you can quickly start leveraging all the benefits and power of Apache Spark with minimal operational and configuration overhead.

In this blog, you will learn how to extend StreamSets Transformer in order to train a Spark ML RandomForestRegressor model.

Preview and Snapshot Features in StreamSets Data Collector

Hello from your newly-appointed community champion and technical evangelist here at StreamSets! My name is Dash Desai and you will find me writing blog posts and cruising the community forums answering questions about StreamSets Data Collector as well as learning from community members. I will also be presenting at meetups and conferences so if you happen to be attending, please stop by and say hi. My first post for StreamSets, explaining the powerful Preview and Snapshot features in Data Collector, was inspired by one of the community members (Thank you, Edward).

Introduction

When creating data pipelines for big data projects and working with a diverse set of structured, semi-structured, and unstructured data sources, it is imperative that you get a true sense of the data transformations at every stage. Not just to ensure data integrity and data quality but also for debugging and audit trail purposes. So phrases like "Garbage in, Garbage out", " Fail fast, Fail often", and " Agile and Iterative development " are also applicable to creating dataflow pipelines.