The 10 Commandments for Performing a Data Science Project

In designing a data science project, establishing what we, or the users we are building models for, want to achieve is vital, but this understanding only provides a blueprint for success. To truly deliver against a well-established brief, data science teams must follow best practices in executing the project. To help establish what that might mean, I have come up with ten points to provide a framework that can be applied to any data science project.

1. Understand the Problem 

The most fundamental part of solving any problem is knowing exactly what problem you are solving. Make sure that you understand what you are trying to predict, any constraints, and what the ultimate purpose for this project will be. Ask questions early on and validate your understanding with peers, domain experts, and end-users. If you find that answers are aligning with your understanding, you know that you are on the right path. 

Predicting Housing Prices Using Google AutoML Tables

Overview of Problem

Tabular data is quite common in various business and engineering problems. Machine learning can be used to predict particular columns of the table we are interested in, using other columns as input features. We will take an example of using historical house sales data to predict sales prices for houses that come on the market in the future. The house prices dataset from Kaggle contains such data for Ames, Iowa. It contains predictive columns like house area, neighborhood area name, type of building, house style, condition, year last sold, etc., among a total of 79 such predictive features. Some of these features are categorical while others are numerical and our goal is to predict the Sale Price (a numeric column) of houses using these features.

Id
LotArea
Neighborhood
BldgType
Style
Cond
YrBuilt
1stFlrSF
2ndFlrSF
Fireplaces
YrSold
SalePrice
1
8450
CollgCr
1Fam
2Story
5
2003
856
854
0
2008
208500
2
9600
Veenker
1Fam
1Story
8
1976
1262
0
1
2007
181500
3
11250
CollgCr
1Fam
2Story
5
2001
920
866
1
2008
223500
4
9550
Crawfor
1Fam
2Story
5
1915
961
756
1
2006
140000
5
14260
NoRidge
1Fam
2Story
5
2000
1145
1053
1
2008
250000
6
14115
Mitchel
1Fam
1.5Fin
5
1993
796
566
0
2009
143000
7
10084
Somerst
1Fam
1Story
5
2004
1694
0
1
2007
307000
8
10382
NWAmes
1Fam
2Story
6
1973
1107
983
2
2009
200000
9
6120
OldTown
1Fam
1.5Fin
5
1931
1022
752
2
2008
129900
10
7420
BrkSide
2fmCon
1.5Unf
6
1939
1077
0
2
2008
118000
11
11200
Sawyer
1Fam
1Story
5
1965
1040
0
0
2008
129500
12
11924
NridgHt
1Fam
2Story
5
2005
1182
1142
2
2006
345000
13
12968
Sawyer
1Fam
1Story
6
1962
912
0
0
2008
144000

Overview of Google AutoML Tables

Google AutoML Tables enables quick and high accuracy training and subsequent hosting of ML models for such a problem. Users can import and visualize the data, train a model, evaluate it on a test set, iterate on improving model accuracy and then host the best model for online/offline predictions. All of the above functionality is available as a service without any ML expertise or hardware or software installation required from users.
AutoML table can train both regression and classification models depending on the type of column we are trying to predict.

Make Crucial Predictions as Data Comes

Flink: as fast as a squirrel

Walking by the hottest IT streets in these days means you've likely heard about achieving Streaming Machine Learning, i.e. moving AI towards streaming scenario and exploiting the real-time capabilities along with new Artificial Intelligence techniques. Moreover, you will also notice the lack of research related to this topic, despite the growing interest in it.

If we try to investigate it a little bit deeper then, we realize that a step is missing: nowadays, well-known streaming applications still don't get the concept of Model Serving properly, and industries still lean on lambda architecture in order to achieve the goal. Suppose a bank has a concrete frequently updated batch trained Machine Learning model (e.g. an optimized Gradient Descent applied to past buffer overflow attack attempts) and it wants to deploy the model directly to their own canary.