How to Collect Big Data Sets From Twitter

In this post, you’ll learn how to collect data from Twitter, one of biggest sources for big data sets.

You’ll also need to set up a Hadoop cluster and HDFS to store the multi-format data you’ll gather from Twitter. Though we’ll be focusing on one platform only, you can obtain more accuracy if you collect data from other channels as well.

The Variance of the Slope in a Regression Model

In my "applied linear models" exam, there was a tricky question (it was a multiple choice, so no details were asked). I was simply asking if the following statement was valid, or not

Consider a linear regression with one single covariate, y=β0+β1x1+ε and the least-square estimates. The variance of the slope is Var[β1] Do we decrease this variance if we add one variable, and consider y=β0+β1x1+β2x2+ε ?

Doing Residual Analysis Post Regression in R

Residuals are essentially gaps that are left when a given model, in this case, linear regression, does not fit the given observations completely.

A close analogy for residual analysis is found in medical pathology. What remains post-metabolism usually becomes an indicator of what was processed and absorbed versus what was not so.

On the Poor Performance of Classifiers

Each time we have a case study in my actuarial courses (with real data), students are surprised to have a hard time getting a “good” model, and they are always surprised to have a low AUC when trying to model the probability to claim a loss, to die, to deal with fraud, etc. And each time I keep saying, “yes, I know, and that’s what we expect because there's a lot of ‘randomness’ in insurance.” To be more specific, I decided to run some simulations and to compute AUCs to see what’s going on. And because I don’t want to waste time fitting models, we will assume that we have each time a perfect model. So I want to show that the upper bound of the AUC is actually quite low! So it’s not a modeling issue, it is a fundamental issue in insurance!

By ‘perfect model’ I mean the following : Ω denotes the heterogeneity factor because people are different. We would love to get P[Y=1∣Ω]. Unfortunately, Ω is unobservable! So we use covariates (like the age of the driver of the car in motor insurance, or of the policyholder in life insurance, etc.). Thus, we have data (yi ,xi )‘s and we use them to train a model, in order to approximate P[Y=1∣X]. And then we check if our model is good (or not) using the ROC curve obtained from confusion matrices, comparing yi‘s and  {widehat}yi‘s where {widehat}yi =1 when P[Yi =1∣xi ] exceeds a given threshold. Here, I will not try to construct models. I will predict {widehat}yi =1 each time the true underlying probability P[Yi =1∣ωi] exceeds a threshold! The point is that it’s possible to claim a loss (y=1) even if the probability is 3% (and most of the time{widehat}y=0), and to not claim one (y=0) even if the probability is 97% (and most of the time {widehat}y =1). That’s the idea with randomness, right?

How to Use R for Conjoint Analysis

Conjoint analysis is a frequently used ( and much needed), technique in market research. 

To gauge interest, consumption, and continuity of any given product or service, a market researcher must study what kind of utility is perceived by potential or current target consumers.