Classification

Today

What	When
Linear regression for data science	Week 4
Classification	Week 5
Interactive visualizations with R shiny	Week 6
Tree-based methods	Week 7
Introduction to text mining	Week 8

Classification

Supervised learning: regression and classification Classification: predict to which category an observation belongs (qualitative outcomes)

Image from https://intellipaat.com/

Classification

Many supervised learning problems concern categorical outcomes:

Cancer: yes / no
Weather: sunny / cloudy / windy / rainy / stormy
Banking data: default on payment of debt
Images: cat / no cat, or gazelle/tank/pirate/sea lion/tandem bicycle/. . .

Classification: predict to which category an observation belongs (qualitative outcomes)

Which one is a classification task?

Types of classifications

Binary classification -> y: {0, 1}
Multi-class classification: {0, 1, 2, 3, …, N}

Classification algorithms

k-nearest neighbors (kNN)
Logistic regression
Naive Bayes (NB)
Neural networks (deep learning)
Support vector machine (SVM)
Decision tree
Random forest (RF)

Which algorithm to choose: Generalization

Generalization: How well does a learned model generalize from the data it was trained on to a new test set?

No Free Lunch Theorem

No universally best classification algorithm

Classification Algorithms

K-nearest neighbors (kNN)

One of the simplest (supervised) machine learning methods;
Based on feature similarity: how similar is a data point to its neighbor(s) and classifies the data point into the class it is most similar to;

kNN

How does kNN Algorithm work? – kNN Algorithm In R – Edureka

kNN

How does kNN Algorithm work? – kNN Algorithm In R – Edureka

\(K\) = 3

kNN

How does kNN Algorithm work? – kNN Algorithm In R – Edureka

\(K\) = 7

kNN

How does kNN Algorithm work? – kNN Algorithm In R – Edureka

kNN

Given the memorized training data, and a new data point (test observation):

Identify the \(K\) closests points in the training data to the new data point \(x_0\). This set of ‘nearest neighbors’ we call \(N_0\)

kNN

Given the memorized training data, and a new data point (test observation):

Identify the \(K\) closests points in the training data to the new data point \(x_0\). This set of ‘nearest neighbors’ we call \(N_0\)
Estimate the probability that the new data point belongs to categroy \(j\) by
\(Pr(Y = j| X = x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j)\)
(so, the fraction of points in \(N_0\) whose response equal \(j\))

kNN

Given the memorized training data, and a new data point (test observation):

Identify the \(K\) closests points in the training data to the new data point \(x_0\). This set of ‘nearest neighbors’ we call \(N_0\)
Estimate the probability that the new data point belongs to categroy \(j\) by
\(Pr(Y = j| X = x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j)\)
(so, the fraction of points in \(N_0\) whose response equal \(j\))
Majority vote: classify the test observation \(x_0\) to the categroy with the largest probability

kNN points

Non-parameteric model: does not make assumptions about the dataset, no fixed number of parameters;
Lazy algorithm: memorizes the training dataset itself instead of learning a function from it;
Can be used for both classification and regression (but more commonly used for classification);
Although a very simple approach, kNN can often produce classifiers that are surprisingly good!

Quiz

Apply kNN methods with k = 1, 3 and 5 to the data points below and find the category of the test observation represented by (?) for each classifier.

kNN

Results obtained with kNN highly depend on chosen value for \(K\), the number of neighbors used
Small \(K\) (e.g., \(K\) = 1): low bias but very high variance, ‘overly flexible decision boundary’ (see next slides)
Large \(K\): low-variance but high-bias, ‘decision boundary’ that is close to linear
The optimal value for \(K\) needs to be determined using a (cross-)validation approach

Example: Iris dataset

Iris is a (famous) dataset that contains species of flowers and various features of the flower such as Sepal length and Sepal width

Example: Iris dataset

And, for two species that are less well separated:

Example: Iris dataset

10 minute break

Logistic regression

Models the probability that \(y\) belongs to one of two categories (i.e., a binary outcome), for example:
- Smoking / non smoking
- Pass / fail an exam
- Survival / Nonsurvival
- Default yes / no
Can be extended to model > 2 outcome categories: multinomial logistic regression (not treated in this course)
Other option to model > 2 outcome categories: Neural networks, naive Bayes, linear discriminant analysis (not treated in this course, but treated in ISLR)

Logistic regression

(Example by Andrew Ng)

Logistic regression

(Example by Andrew Ng)

Logistic regression

Logistic regression: logit

Classification: y= 0 or 1
Linear regression: can be <0 or >1
Logistic regression: the prediction is between 0 and 1
Solution: Use the logistic function

Logistic regression

Why can linear regression not be used on this type of data?

Linear regression would predict impossible outcomes (\(Pr(x)\) < 0 and > 1)
To avoid this problem, we use a ‘trick’: we use a logistic ‘link function (logit)’

Logistic regression

This results in the following logistic function: \(Pr(Y = 1|X ) = \frac{e^{\beta_0 + \beta_1X_1 + ...}}{1 + e^{\beta_0 + \beta_1X_1 + ...}}\)

Advantage: all predicted probabilities are above 0 and below 1
Note: the linear predictor is contained in the exponent (i.e., \(e^{...}\))
For the example below: \(Pr(Default = yes|balance) = \frac{e^{\beta_0 + \beta_1balance}}{1 + e^{\beta_0 + \beta_1balance}}\)

Logistic regression

The logistic function: \(Pr(Y = 1|X ) = \frac{e^{\beta_0 + \beta_1X_1 + ...}}{1 + e^{\beta_0 + \beta_1X_1 + ...}}\) continued..

odds = \(\frac{Pr(Y=1)}{Pr(Y=0)} = \frac{p_i}{1-p_i} = e^{\beta_0 + \beta_1X_1 + ...}\)
ln(odds) \(= \beta_0 + \beta_1X_1 + ...\)
So the linear part of the function models the log of the odds.

Intermezzo: odds

Hence, when using logistic regression, we are modelling the log of the odds. Odds are a way of quantifying the probability of an event \(E\).

The odds for an event \(E\) are: \(odds(E) = \frac{Pr(E)}{Pr(E^c)} = \frac{Pr(E)}{1 - Pr(E)}\)
The odds of getting heads in a coin toss is:
\(odds(heads) =\) \(\frac{Pr(heads)}{Pr(tails)} =\) \(\frac{Pr(heads)}{1 - Pr(heads)}\)
For a fair coin: \(odds(heads) = \frac{0.5}{1 - 0.5} = 1\)

Intermezzo: odds

Another example: The game Lingo has 44 balls: 36 blue, 6 red and 2 green balls

The odds of a player choosing a blue ball are
\(odds(blue) = \frac{36}{8} = \frac{36/44}{8/44} = \frac{0.8182}{0.1818} = 4.5\)
The odds of a player choosing a green ball are
\(odds(green) = \frac{2}{42} = \frac{2/44}{42/44} = \frac{0.0455}{0.9545} \approx 0.05\)
Hence,
- Odds of 1 indicate an equal likelihood of the event occurring or not occurring
- Odds < 1 indicate a lower likelihood of the event occurring vs. not occurring
- Odds > 1 indicate a higher likelihood of the event occurring.

Logistic regression

Interpretation regression coefficients \(\beta_1, \beta_2, ...\)
- Qualitatively: positive or negative effect of the predictor on the log of the odds (logit)
- Quantitatively: effect on the odds is \(exp(\beta)\)
- Is the effect statistically significant?
Making predictions:
- by filling in the equation \(Pr(Y = 1|X ) = \frac{e^{\beta_0 + \beta_1X_1 + ...}}{1 + e^{\beta_0 + \beta_1X_1 + ...}}\), we can predict the probability of the event to occur for a (hypothetical) case in our data

An example: Titanic dataset

##                                            Name PClass   Age    Sex Survived
## 1                  Allen, Miss Elisabeth Walton    1st 29.00 female        1
## 2                   Allison, Miss Helen Loraine    1st  2.00 female        0
## 3           Allison, Mr Hudson Joshua Creighton    1st 30.00   male        0
## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st 25.00 female        0
## 5                 Allison, Master Hudson Trevor    1st  0.92   male        1
## 6                            Anderson, Mr Harry    1st 47.00   male        1

An example: Titanic dataset

log_mod_titanic <- glm(Survived ~ PClass + Sex + Age, data = titanic, family="binomial")

	Estimate	Std. Error	z value
(Intercept)	3.760	0.398	9.457
PClass2nd	-1.292	0.260	-4.968
PClass3rd	-2.521	0.277	-9.114
Sexmale	-2.631	0.202	-13.058
Age	-0.039	0.008	-5.144

Compared to being in 1st class (reference category)
- being in 2nd class decreases your probability of survival
- being in 3rd class decreases your probability of survival
Being male instead of female decreases your probability of survival
Being older also decreases your probability of survival

An example: Titanic dataset

log_mod_titanic <- glm(Survived ~ PClass + Sex + Age, data = titanic, family="binomial")

	Estimate	Std. Error	z value
(Intercept)	3.760	0.398	9.457
PClass2nd	-1.292	0.260	-4.968
PClass3rd	-2.521	0.277	-9.114
Sexmale	-2.631	0.202	-13.058
Age	-0.039	0.008	-5.144

odds ratio = \(\frac{odds_{2nd class}}{odds_{1stclass}} = e^{-1.292} = 0.275\). The odds of survival in 2nd class are 0.275 times the odds compared to first class
odds ratio = \(\frac{odds_{3rd class}}{odds_{1stclass}} = e^{-2.521} = 0.080\). The odds of survival in 3rd class are 0.080 times the odds compared to first class

An example: Titanic dataset

log_mod_titanic <- glm(Survived ~ PClass + Sex + Age, data = titanic, family="binomial")

	Estimate	Std. Error	z value
(Intercept)	3.760	0.398	9.457
PClass2nd	-1.292	0.260	-4.968
PClass3rd	-2.521	0.277	-9.114
Sexmale	-2.631	0.202	-13.058
Age	-0.039	0.008	-5.144

Quiz: Predictions

The probability to survive for a:

30 year old female from 1st class?
45 year old male from 3rd class?

Making predictions (function `predict()` in `R`):

The probability for a 30 year old female from 1st class to survive is:
\(Pr(Survival = yes | 1^{st} class, female, 30 years) = \frac{e^{3.760 - 0.039 * 30}}{1 + e^{3.760 - 0.039 * 30}} = 0.93\)
The probability for a 45 year old male from 3rd class to survive is only:

\(Pr(Survival = yes | 3^{rd} class, male, 45 years) =\) \(\frac{e^{3.760 - 2.521 * 1 - 2.631 * 1 - 0.039 * 45}}{1 + e^{3.760 - 2.521 * 1 - 2.631 * 1 - 0.039 * 45}} = 0.04\)

Evaluating Classifiers

Evaluating classifiers

When applying classifiers, we have new options to evaluate how well a classifier is doing besides model fit:

Confusion matrix (used to obtain most measures below)
Sensitivity (‘Recall’)
Specificity
Positive predictive value (‘Precision’)
Negative predictive value
Accuracy (and error rate)
ROC and area under the curve
For even more: https://en.wikipedia.org/wiki/Sensitivity_and_specificity

Confusion matrix: Counts

You have trained a model on your training data and you now want to check the performance of the model on the validation set.

In case of a binary outcome (e.g., survival yes or no), we either correctly classify, or make two kind of mistakes:

Label an item that belongs to the positive class as negative (False negative)
Label an item that belongs to the negative class as positive (False positive)

Confusion matrix: Counts

In case of a binary outcome (e.g., survival yes or no), we either correctly classify,

Label a survivor as someone who survived \(\to\) True positive (TP)
Label someone who did not survive as non-survived \(\to\) True negative (TN)

or make two kind of mistakes:

Label a survivor as someone who did not survive \(\to\) False negative (FN)
Label someone who did not survive as a survivor \(\to\) False positive (FP)

Confusion matrix: Counts

Label a survivor as someone who survived \(\to\) True positive (TP)
Label someone who did not survive as non-survived \(\to\) True negative (TN)
Label a survivor as someone who did not survive \(\to\) False negative (FN)
Label someone who did not survive as a survivor \(\to\) False positive (FP)
Total errors: FN + FP

	Not survived	Survivor
Survived (predicted)
No	372 (TN)	91 (FN)
Yes	71 (FP)	222 (TP)

Confusion matrix: Specificity

Measures the percentage of actual negatives which are correctly identified
Of the people who did not survive, which proportion did the model ‘find’
Specificity: \(\frac{TN}{TN + FP} = 372 / (372 + 71) \approx 0.84\)

	Not survived	Survivor
Survived (predicted)
No	372 (TN)	91 (FN)
Yes	71 (FP)	222 (TP)

Confusion matrix: Sensitivity

Measures the percentage of actual positives which are correctly identified (or recall or True Positive Rate)
Sensitivity: Of the people who survived, which proportion did the model ‘find’
Sensitivity: \(\frac{TP}{TP + FN} = 222 / (222 + 91) \approx 0.71\)

	Not survived	Survivor
Survived (predicted)
No	372 (TN)	91 (FN)
Yes	71 (FP)	222 (TP)

Confusion matrix: Specificiy and Sensitivity

	Not survived	Survivor
Survived (predicted)
No	0.84 (Specificity)	0.29 (1 - Sensitivity)
Yes	0.16 (1 - Specificity)	0.71 (Sensitivity)

Confusion matrix: Accuracy

Measures the percentage of overall correct predictions
Accuracy (ACC): \(\frac{TP + TN}{TP + FP + TN + FN} \approx 0.79\), Error rate: 1 - accuracy \(\approx\) 0.21

	Not survived	Survivor
Survived (predicted)
No	372 (TN)	91 (FN)
Yes	71 (FP)	222 (TP)

Confusion matrix: Pos and Neg predicted value

Negative predicted value (NPV): \(\frac{TN}{TN + FN} = 372 / (372 + 91) \approx 0.80\)
Of the people we predicted to not survive, which proportion actually did die
Positive predicted value (‘precision’): \(\frac{TP}{TP + FP} = 222 / (222 + 71) \approx 0.76\)
Of the people we predicted to survive, which proportion actually survive

	Not survived	Survivor
Survived (predicted)
No	0.80 (NPV)	0.20 (1 - NPV)
Yes	0.20 (1 - PPV)	0.80 (PPV)

Thresholds

Moving around the threshold affects the sensitivity and specificity!
Moving the threshold especially makes sense when the predicted categories are unbalanced. For example, many more non survivors compared to survivors in the dataset.

with(titanic, 
     table(p_ped > 0.4, Survived))

##        Survived
##           0   1
##   FALSE 346  63
##   TRUE   97 250

with(titanic, 
     table(p_ped > 0.6, Survived))

##        Survived
##           0   1
##   FALSE 401 114
##   TRUE   42 199

ROC curve

ROC curve is a popular graphic for simultaneously displaying the true and false positive rate for all possible thresholds
TPR (sensitivity), percentage of actual positives (survived) which are correctly predicted as survived
FPR (1 - specificity): proportion of actual negatives (non survivors) that were incorreclty classified as survived and which are the FP
The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the curve (AUC)

ROC curve - Titanic data

Assume a very low threshold such as 0.01
TPR = Sensitivity: 313 / (313 + 0) = 1 (every survivor was correctly classified)
FPR = 1 - Specificity = 443 / (443 + 0) = 1 (every single passenger that did not survive was classified as survived)

	Not survived	Survivor
Survived (predicted)
No	0	0
Yes	443	313

ROC curve - Titanic data

The higher the curve and the larger the area under the curve (AUC), the better the classifier is

Conclusion

Classification: predict to which category an observation belongs (qualitative outcomes)

When predicting categorical outcomes (= classification)

We can use a completely non-parametric approach with kNN.
As no assumptions are made about the decision boundary, kNN will outperform logistic regression when the decision boundary is highly non-linear.
kNN does not give any information on the prediction process, e.g., which variable is most important in providing an accurate prediction.

Conclusion

When predicting categorical outcomes (= classification)

We can use a parametric approach such as logistic regression, modeling the log of the odds with a linear function.
Provides both information on the prediction process (i.e., regression coefficients) and predicted class probabilities for each observation.
To classify observations based on their probabilities, it can make sense to use a different threshold than 0.50 (in case of binary data).
We can use various metrics based on the confusion matrix to assess performance of classifiers.
More classification methods will be discussed in week 7!

Final note

Lab session on Thursday.

Next week: Interactive visualizations with R shiny

Have a nice day!

Extra

Parametric vs non-parametric classifiers

Generative classifiers try to model the data. Discriminative classifiers try to predict the label.

Today

Classification

Classification

Which one is a classification task?

Types of classifications

Classification algorithms

Which algorithm to choose: Generalization

No Free Lunch Theorem

Classification Algorithms

K-nearest neighbors (kNN)

kNN

kNN

kNN

kNN

kNN

kNN

kNN

kNN points

Quiz

kNN

Example: Iris dataset

Example: Iris dataset

Example: Iris dataset

10 minute break

Logistic regression

Logistic regression

Logistic regression

Logistic regression

Logistic regression: logit

Logistic regression

Logistic regression

Logistic regression

Intermezzo: odds

Intermezzo: odds

Logistic regression

An example: Titanic dataset

An example: Titanic dataset

An example: Titanic dataset

An example: Titanic dataset

Quiz: Predictions

Making predictions (function predict() in R):

Evaluating Classifiers

Evaluating classifiers

Confusion matrix: Counts

Confusion matrix: Counts

Confusion matrix: Counts

Confusion matrix: Specificity

Confusion matrix: Sensitivity

Confusion matrix: Specificiy and Sensitivity

Confusion matrix: Accuracy

Confusion matrix: Pos and Neg predicted value

Thresholds

ROC curve

ROC curve - Titanic data

ROC curve - Titanic data

Conclusion

Conclusion

Final note

Extra

Parametric vs non-parametric classifiers

Making predictions (function `predict()` in `R`):