machine learning and artificial intelligence

November 26, 2021

The 10 Commandments for Performing a Data Science Project

In designing a data science project, establishing what we, or the users we are building models for, want to achieve is vital, but this understanding only provides a blueprint for success. To truly deliver against a well-established brief, data science teams must follow best practices in executing the project. To help establish what that might mean, I have come up with ten points to provide a framework that can be applied to any data science project.

1. Understand the Problem

The most fundamental part of solving any problem is knowing exactly what problem you are solving. Make sure that you understand what you are trying to predict, any constraints, and what the ultimate purpose for this project will be. Ask questions early on and validate your understanding with peers, domain experts, and end-users. If you find that answers are aligning with your understanding, you know that you are on the right path.

April 1, 2021

Predicting Housing Prices Using Google AutoML Tables

Overview of Problem

Tabular data is quite common in various business and engineering problems. Machine learning can be used to predict particular columns of the table we are interested in, using other columns as input features. We will take an example of using historical house sales data to predict sales prices for houses that come on the market in the future. The house prices dataset from Kaggle contains such data for Ames, Iowa. It contains predictive columns like house area, neighborhood area name, type of building, house style, condition, year last sold, etc., among a total of 79 such predictive features. Some of these features are categorical while others are numerical and our goal is to predict the Sale Price (a numeric column) of houses using these features.

Id	LotArea	Neighborhood	BldgType	Style	Cond	YrBuilt	1stFlrSF	2ndFlrSF	Fireplaces	YrSold	SalePrice
1	8450	CollgCr	1Fam	2Story	5	2003	856	854	0	2008	208500
2	9600	Veenker	1Fam	1Story	8	1976	1262	0	1	2007	181500
3	11250	CollgCr	1Fam	2Story	5	2001	920	866	1	2008	223500
4	9550	Crawfor	1Fam	2Story	5	1915	961	756	1	2006	140000
5	14260	NoRidge	1Fam	2Story	5	2000	1145	1053	1	2008	250000
6	14115	Mitchel	1Fam	1.5Fin	5	1993	796	566	0	2009	143000
7	10084	Somerst	1Fam	1Story	5	2004	1694	0	1	2007	307000
8	10382	NWAmes	1Fam	2Story	6	1973	1107	983	2	2009	200000
9	6120	OldTown	1Fam	1.5Fin	5	1931	1022	752	2	2008	129900
10	7420	BrkSide	2fmCon	1.5Unf	6	1939	1077	0	2	2008	118000
11	11200	Sawyer	1Fam	1Story	5	1965	1040	0	0	2008	129500
12	11924	NridgHt	1Fam	2Story	5	2005	1182	1142	2	2006	345000
13	12968	Sawyer	1Fam	1Story	6	1962	912	0	0	2008	144000

Overview of Google AutoML Tables

Google AutoML Tables enables quick and high accuracy training and subsequent hosting of ML models for such a problem. Users can import and visualize the data, train a model, evaluate it on a test set, iterate on improving model accuracy and then host the best model for online/offline predictions. All of the above functionality is available as a service without any ML expertise or hardware or software installation required from users.
AutoML table can train both regression and classification models depending on the type of column we are trying to predict.

September 24, 2019

Make Crucial Predictions as Data Comes

Flink: as fast as a squirrel

Walking by the hottest IT streets in these days means you've likely heard about achieving Streaming Machine Learning, i.e. moving AI towards streaming scenario and exploiting the real-time capabilities along with new Artificial Intelligence techniques. Moreover, you will also notice the lack of research related to this topic, despite the growing interest in it.

If we try to investigate it a little bit deeper then, we realize that a step is missing: nowadays, well-known streaming applications still don't get the concept of Model Serving properly, and industries still lean on lambda architecture in order to achieve the goal. Suppose a bank has a concrete frequently updated batch trained Machine Learning model (e.g. an optimized Gradient Descent applied to past buffer overflow attack attempts) and it wants to deploy the model directly to their own canary.