Introduction
When it comes to measuring binary classifier, Receiver Operating Characteristic or also known to be ROC curve is a very efficient way of measuring the performance of a binary classifier system. A binary classifier system is a classifier that classify an observation to 2 classes. For example whether we want to classify a text review to be a positive or negative review instead of a scale review like 1-5 stars. In order to understand the graph, we will need to understand the standard bayesian binary classifier, then we will proceed with exploring the True Positive Rate and the False Positive Rate, and finally we will finally be able to understand the ROC cruve and what it represents.
Bayesian Binary Classifier
Assume that we are in the case that our target variable Y have only 2 value (Binary case). Without loss of generality, we can assume Y either 0 or 1. Then it is clear that we have the following:
Now lets assume that given an observation x, we have that P[Y=1 | x] = 1/2. This means we have equal chance to be correct whether we classify our observation to be of class 0 or 1 either way.
But lets say now we have P[Y=1 | x] = 5/7. Then ofcourse we will classify our observation to be of class 1. The higher the probability of an observation to be in a class based on our classifier, the more likely we would classify them to be in that class. In fact, it can be proven that the optimal threshold for our particular case is to make the threshold to be equal to 1/2. In other words, we will classify our observation to be 1 if P[Y=1 | x] > 1/2.
This is infact how the logistic regression classifier works.
TPR vs FPR
Now we look at the basic idea of True positive Rate and False positive rate. Imagine now we are in the case that we want to classify an email to be an attack or a harmless email.
Well, the fact is that most emails are not an attack! lets now recognize that in our dataset there are only 0.1 % of the entry that is an actual attack.
Now how would we measure the performance of any classifier we make from this dataset ?
If we are ONLY measuring the number of correct attack we classify, then the best classifier we have is the one that always classify an email to be an attack. This is where the idea of True Positive Rate (TPR), and False Positive Rate (FPR) comes in handy.
Assume we have the following confusion matrix for our attack classifier result
| Actual\Predicted | Attack | Harmless | TOTAL |
|---|---|---|---|
| Attack | 7 (TP) | 3 (FN) | 10 |
| Harmless | 456 (FP) | 9534 (TN) | 9990 |
| 463 | 9537 | 10000 |
- TP = True Positive
- FN = False Negative
- FP = False Positive
- TN = True Negative
The TPR is defined as follows :
and in our case, it will be 7 / ( 7 + 3) = 70%
Now lets look at the FPR :
and in our case, it will be 456 / ( 456 + 9534) = 4.5%
The ROC
Now that we understand better the FPR , the TPR, and the naive bayesian, we can put it all together in a nice curve.
Lets first create an axis for FPR and TPR and add the previous example as an observation
one thing to keep in mind is that from our bayesian explanation, we understand that the underlying method is relying on a certain threshold. Now, lets change this threshold to a range of decimals from 0 to 1.
Looks familiar no? This time lets make a line through the scatter plot, and there we have the infamous ROC Curve!
AUC
Now that we understand how the ROC Curve is created, it is important to understand how we can compare one ROC Curve to the other since it is equivalent to comparing 2 models and to decide which model works better. The notion that we need to keep in mind is that we want a model with the lowest FPR and the highest TPR. This means, it is equivalent to saying we want the area under the ROC curve to be as close as possible to 1. This area under the curve are also known to be AUC
Conclusion
Main points:
- ROC Curve is a way to measure binary classifier
- ROC Curve plots FPR and TPR based on different probability threshold
- Area under the curve is a good way to compare between models (the closer to 1, the better)
Let me know if you have ideas of what should I write next!