What | When |
---|---|
Linear regression for data science | Week 4 |
Classification |
Week 5 |
Interactive visualizations with R shiny | Week 6 |
Tree-based methods | Week 7 |
Introduction to text mining | Week 8 |
What | When |
---|---|
Linear regression for data science | Week 4 |
Classification |
Week 5 |
Interactive visualizations with R shiny | Week 6 |
Tree-based methods | Week 7 |
Introduction to text mining | Week 8 |
Supervised learning: regression and classification Classification: predict to which category an observation belongs (qualitative outcomes)
Image from https://intellipaat.com/
Many supervised learning problems concern categorical outcomes:
Classification: predict to which category an observation belongs (qualitative outcomes)
Generalization: How well does a learned model generalize from the data it was trained on to a new test set?
\(K\) = 3
\(K\) = 7
Given the memorized training data, and a new data point (test observation):
Given the memorized training data, and a new data point (test observation):
Given the memorized training data, and a new data point (test observation):
Apply kNN methods with k = 1, 3 and 5 to the data points below and find the category of the test observation represented by (?) for each classifier.
Iris is a (famous) dataset that contains species of flowers and various features of the flower such as Sepal length and Sepal width
(Example by Andrew Ng)
(Example by Andrew Ng)
Classification: y= 0 or 1
Linear regression: can be <0 or >1
Logistic regression: the prediction is between 0 and 1
Solution: Use the logistic function
Why can linear regression not be used on this type of data?
This results in the following logistic function: \(Pr(Y = 1|X ) = \frac{e^{\beta_0 + \beta_1X_1 + ...}}{1 + e^{\beta_0 + \beta_1X_1 + ...}}\)
The logistic function: \(Pr(Y = 1|X ) = \frac{e^{\beta_0 + \beta_1X_1 + ...}}{1 + e^{\beta_0 + \beta_1X_1 + ...}}\) continued..
Hence, when using logistic regression, we are modelling the log of the odds. Odds are a way of quantifying the probability of an event \(E\).
Another example: The game Lingo has 44 balls: 36 blue, 6 red and 2 green balls
## Name PClass Age Sex Survived ## 1 Allen, Miss Elisabeth Walton 1st 29.00 female 1 ## 2 Allison, Miss Helen Loraine 1st 2.00 female 0 ## 3 Allison, Mr Hudson Joshua Creighton 1st 30.00 male 0 ## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.00 female 0 ## 5 Allison, Master Hudson Trevor 1st 0.92 male 1 ## 6 Anderson, Mr Harry 1st 47.00 male 1
log_mod_titanic <- glm(Survived ~ PClass + Sex + Age, data = titanic, family="binomial")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 3.760 | 0.398 | 9.457 | 0 |
PClass2nd | -1.292 | 0.260 | -4.968 | 0 |
PClass3rd | -2.521 | 0.277 | -9.114 | 0 |
Sexmale | -2.631 | 0.202 | -13.058 | 0 |
Age | -0.039 | 0.008 | -5.144 | 0 |
log_mod_titanic <- glm(Survived ~ PClass + Sex + Age, data = titanic, family="binomial")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 3.760 | 0.398 | 9.457 | 0 |
PClass2nd | -1.292 | 0.260 | -4.968 | 0 |
PClass3rd | -2.521 | 0.277 | -9.114 | 0 |
Sexmale | -2.631 | 0.202 | -13.058 | 0 |
Age | -0.039 | 0.008 | -5.144 | 0 |
log_mod_titanic <- glm(Survived ~ PClass + Sex + Age, data = titanic, family="binomial")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 3.760 | 0.398 | 9.457 | 0 |
PClass2nd | -1.292 | 0.260 | -4.968 | 0 |
PClass3rd | -2.521 | 0.277 | -9.114 | 0 |
Sexmale | -2.631 | 0.202 | -13.058 | 0 |
Age | -0.039 | 0.008 | -5.144 | 0 |
The probability to survive for a:
predict()
in R
):The probability for a 30 year old female from 1st class to survive is:
\(Pr(Survival = yes | 1^{st} class, female, 30 years) = \frac{e^{3.760 - 0.039 * 30}}{1 + e^{3.760 - 0.039 * 30}} = 0.93\)
The probability for a 45 year old male from 3rd class to survive is only:
\(Pr(Survival = yes | 3^{rd} class, male, 45 years) =\) \(\frac{e^{3.760 - 2.521 * 1 - 2.631 * 1 - 0.039 * 45}}{1 + e^{3.760 - 2.521 * 1 - 2.631 * 1 - 0.039 * 45}} = 0.04\)
When applying classifiers, we have new options to evaluate how well a classifier is doing besides model fit:
You have trained a model on your training data and you now want to check the performance of the model on the validation set.
In case of a binary outcome (e.g., survival yes or no), we either correctly classify, or make two kind of mistakes:
In case of a binary outcome (e.g., survival yes or no), we either correctly classify,
or make two kind of mistakes:
Not survived | Survivor | |
---|---|---|
Survived (predicted) | ||
No |
372 (TN) |
91 (FN) |
Yes |
71 (FP) |
222 (TP) |
Not survived | Survivor | |
---|---|---|
Survived (predicted) | ||
No |
372 (TN) |
91 (FN) |
Yes |
71 (FP) |
222 (TP) |
Not survived | Survivor | |
---|---|---|
Survived (predicted) | ||
No |
372 (TN) |
91 (FN) |
Yes |
71 (FP) |
222 (TP) |
Not survived | Survivor | |
---|---|---|
Survived (predicted) | ||
No |
0.84 (Specificity) |
0.29 (1 - Sensitivity) |
Yes |
0.16 (1 - Specificity) |
0.71 (Sensitivity) |
Measures the percentage of overall correct predictions
Accuracy (ACC): \(\frac{TP + TN}{TP + FP + TN + FN} \approx 0.79\), Error rate: 1 - accuracy \(\approx\) 0.21
Not survived | Survivor | |
---|---|---|
Survived (predicted) | ||
No |
372 (TN) |
91 (FN) |
Yes |
71 (FP) |
222 (TP) |
Not survived | Survivor | |
---|---|---|
Survived (predicted) | ||
No |
0.80 (NPV) |
0.20 (1 - NPV) |
Yes |
0.20 (1 - PPV) |
0.80 (PPV) |
with(titanic, table(p_ped > 0.4, Survived))
## Survived ## 0 1 ## FALSE 346 63 ## TRUE 97 250
with(titanic, table(p_ped > 0.6, Survived))
## Survived ## 0 1 ## FALSE 401 114 ## TRUE 42 199
Not survived | Survivor | |
---|---|---|
Survived (predicted) | ||
No | 0 | 0 |
Yes | 443 | 313 |
Classification: predict to which category an observation belongs (qualitative outcomes)
When predicting categorical outcomes (= classification)
When predicting categorical outcomes (= classification)
Lab session on Thursday.
Next week: Interactive visualizations with R shiny
Have a nice day!
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.