Imputing Missing Data Using Sklearn SimpleImputer

In this post, you will learn about how to use Python's Sklearn SimpleImputer for imputing/replacing numerical and categorical missing data using different strategies. In one of the related articles posted sometime back, the usage of fillna method of Pandas DataFrame is discussed. Here is the link, Replace missing values with mean, median and mode. Handling missing values is a key part of data preprocessing and hence, it is of utmost importance for data scientists/machine learning engineers to learn different techniques in relation imputing / replacing numerical or categorical missing values with appropriate value based on appropriate strategies.

The following topics will be covered in this post:

Book Review: Machine Learning With Python for Everyone by Mark E. Fenner

Machine learning, one of the hottest tech topics of today, is being used more and more. Sometimes it's the best tool for the job, other times a buzzword that is mainly used as a way to make a product look cooler. However, without knowing what ML is and how it works behind the scenes, it’s very easy to get lost. But this book does a great job of guiding you all the way from very simple math concepts to some sophisticated machine learning techniques. 

Today, in the Python ecosystem, we have a plethora of powerful data science and machine learning related packages available, like Numpy, Pandas, Scikit-learn, and many others, which help to simplify a lot of its inherent complexity. In case you are wondering, in terms of Python packages, the great hero in this book is Scikit-learn, often abbreviated as  sklearn. Of course, the data wrangling is much easier and much faster using Numpy and Pandas, so these two packages are always covering sklearn’s back. Seaborn and Matplotlib, two of the most standard data visualization packages for Python, are also used here. In chapter 10, patsy makes a brief appearance, and in chapter 15, pymc3 is used in the context of probabilistic graphic models. 

Exporting Decision Trees in Textual Format With sklearn

In the past, we have covered Decision Trees showing how interpretable these models can be (see the tutorials here). In the previous tutorials, we exported the rules of the models using the function export_graphviz from sklearn and visualized the output of this function in a graphical way with an external tool which is not easy to install in some cases. Luckily, since version 0.21.2, scikit-learn offers the possibility to export Decision Trees in a textual format (I implemented this feature personally!) and in this post we will see an example how of to use this new feature.

Let's train a tree with two layers on the famous iris dataset using all the data and print the resulting rules using the brand new function export_text: