Foundations of Machine Learning: Part 5

This post is the ninth (and probably last) one of our series on the history and foundations of econometric and machine learning models. The first four were on econometrics techniques. Part 8 is online here.

Optimization and Algorithmic Aspects

In econometrics, (numerical) optimization became omnipresent as soon as we left the Gaussian model. We briefly mentioned it in the section on the exponential family, and the use of the Fisher score (gradient descent) to solve the first order condition

Foundations of Machine Learning: Part 1

This post is the fifth one of our series on the history and foundations of econometric and machine learning models. The first fours were on econometrics techniques. Part 4 is online here.

In parallel with these tools developed by and for economists, a whole literature has been developed on similar issues, centered on the problems of prediction and forecasting. For Breiman (2001a), the first difference comes from the fact that statistics has developed around the principle of inference (or to explain the relationship linking y to variables x) while another culture is primarily interested in prediction. In a discussion that follows the article, David Cox states very clearly that in statistics (and econometrics) "predictive success... is not the primary basis for model choice ". We will get back here on the roots of automatic learning techniques. The important point, as we will see, is that the main concern of machine learning is related to the generalization properties of a model, i.e. its performance - according to a criterion chosen a priori - on new data, and therefore on non-sample tests.

Probabilistic Foundations of Econometrics: Part 1

In a series of posts, I wanted to get into details of the history and foundations of econometric and machine learning models. It will be some sort of online version of our joint paper with Emmanuel Flachaire and Antoine Ly, Econometrics and Machine Learning (initially written in French), that will actually appear soon in the journal Economics and Statistics. This is the first one...

The importance of probabilistic models in economics is rooted in Working's (1927) questions and the attempts to answer them in Tinbergen's two volumes (1939). The latter has subsequently generated a great deal of work, as recalled by Duo (1993) in his book on the foundations of econometrics and more particularly in the first chapter "The Probability Foundations of Econometrics."

The Variance of the Slope in a Regression Model

In my "applied linear models" exam, there was a tricky question (it was a multiple choice, so no details were asked). I was simply asking if the following statement was valid, or not

Consider a linear regression with one single covariate, y=β0+β1x1+ε and the least-square estimates. The variance of the slope is Var[β1] Do we decrease this variance if we add one variable, and consider y=β0+β1x1+β2x2+ε ?

On the Poor Performance of Classifiers

Each time we have a case study in my actuarial courses (with real data), students are surprised to have a hard time getting a “good” model, and they are always surprised to have a low AUC when trying to model the probability to claim a loss, to die, to deal with fraud, etc. And each time I keep saying, “yes, I know, and that’s what we expect because there's a lot of ‘randomness’ in insurance.” To be more specific, I decided to run some simulations and to compute AUCs to see what’s going on. And because I don’t want to waste time fitting models, we will assume that we have each time a perfect model. So I want to show that the upper bound of the AUC is actually quite low! So it’s not a modeling issue, it is a fundamental issue in insurance!

By ‘perfect model’ I mean the following : Ω denotes the heterogeneity factor because people are different. We would love to get P[Y=1∣Ω]. Unfortunately, Ω is unobservable! So we use covariates (like the age of the driver of the car in motor insurance, or of the policyholder in life insurance, etc.). Thus, we have data (yi ,xi )‘s and we use them to train a model, in order to approximate P[Y=1∣X]. And then we check if our model is good (or not) using the ROC curve obtained from confusion matrices, comparing yi‘s and  {widehat}yi‘s where {widehat}yi =1 when P[Yi =1∣xi ] exceeds a given threshold. Here, I will not try to construct models. I will predict {widehat}yi =1 each time the true underlying probability P[Yi =1∣ωi] exceeds a threshold! The point is that it’s possible to claim a loss (y=1) even if the probability is 3% (and most of the time{widehat}y=0), and to not claim one (y=0) even if the probability is 97% (and most of the time {widehat}y =1). That’s the idea with randomness, right?