In a previous blog post, I spurred some ideas on why it is meaningless to pretend to achieve 100% accuracy on a classification task, and how one has to establish a baseline and a ceiling and tweak a classifier to work the best it can, knowing the boundaries. Recapitulating what I said before, a classification task involves assigning which out of a set of categories or labels should be assigned to some data, according to some properties of the data. For example, the spam filter your email service provides assigns a spam or no spam status to every email. If there are 2 possible labels (like spam or no spam), then we are talking about binary classification. To make things easier, we will just refer to both labels as the positive label and negative label.
We had talked about the idea of accuracy before, but have not actually defined what we mean by that. It is intuitively easy of course: we mean the proportion of correct results that a classifier achieved. If, from a data set, a classifier could correctly guess the label of half of the examples, then we say it's accuracy was 50%. It seems obvious that the better the accuracy, the better and more useful a classifier is. But is it so?
Let's delve into the possible classification cases. Either the classifier got a positive example labeled as positive, or it made a mistake and marked it as negative. Conversely, a negative example may have been (mis)labeled as positive, or correctly guessed negative. So we define the following metrics:
Let's look at an example:
|#||Correct label||Classifier's label|
In this case, TP = 2 (#1 and #4), FP = 1 (#3), TN = 1 (#5) and FN = 2 (#2 and #6).
With this in mind, we can define accuracy as follows:
accuracy = (TP + TN)/(TP + TN + FP + FN)
So in our classification example above, accuracy is (2 + 1)/(2 + 1 + 1 + 2) = 0.5 which is what we expected, since we got 3 right out of 6.
Let's now look at another example. Say we have a classifier trained to do spam filtering, and we got the following results:
|Classified positive||Classified negative|
|Positive class||10 (TP)||15 (FN)|
|Negative class||25 (FP)||100 (TN)|
In this case, accuracy = (10 + 100)/(10 + 100 + 25 + 15) = 73.3%. We may be tempted to think our classifier is pretty decent since it detected nearly 73% of all the spam messages. However, look what happens when we switch it for a dumb classifier that always says "no spam":
|Classified positive||Classified negative|
|Positive class||0 (TP)||25 (FN)|
|Negative class||0 (FP)||125 (TN)|
We get accuracy = (0 + 125)/(0 + 125 + 0 + 25) = 83.3%. This looks crazy. We changed our model to a completely useless one, with exactly zero predictive power, and yet, we got an increase in accuracy.
This is called the accuracy paradox. When TP < FP, then accuracy will always increase when we change a classification rule to always output "negative" category. Conversely, when TN < FN, the same will happen when we change our rule to always output "positive".
So what can we do, so we are not tricked into thinking one classifier model is better than other one, when it really isn't? We don't use accuracy. Or we use it with caution, together with other, less misleading measures. Meet precision and recall.
precision = TP/(TP + FP)recall = TP/(TP + FN)
If you think about it for a moment, precision answers the following question: out of all the examples the classifier labeled as positive, what fraction were correct? On the other hand, recall answers: out of all the positive examples there were, what fraction did the classifier pick up?
If the classifier does not make mistakes, then precision = recall = 1.0. But in real world tasks this is impossible to achieve. It is trivial however to have a perfect recall (simply make the classifier label all the examples as positive), but this will in turn make the classifier suffer from horrible precision and thus, turning it near useless. It is easy to increase precision (only label as positive those examples that the classifier is most certain about), but this will come with horrible recall.
The conclusion is that tweaking a classifier is a matter of balancing what is more important for us: precision or recall. It is possible to get both up: one may choose to optimize a measure that combines precision and recall into a single value, such as the F-measure, but we reach a point in which we can't go any further and our decisions are to be influenced by other factors.
Think about business importance. If we are developing a system that detects fraud in bank transactions, it is desirable that we have a very high recall, ie. most of the fraudulent transactions are identified, probably at loss of precision, since it is very important that all fraud is identified or at least suspicions are raised. In turn if we have a source of data like Twitter and we are interested in finding out when a tweet expresses a negative sentiment about a certain politician, we can probably raise precision (to gain certainty) at the expense of losing recall, since we don't lose much in this case and the source of data is so massive anyway.
There are of course many other metrics for evaluating binary classification systems, and plots are very helpful too. The point to be made is that you should not take any of them in an isolated way: there is not a best way to evaluate any system, but different metrics give us different (and valuable) insights into how a classification model performs.
Update (06/01/2017): fixed example. Thanks to the people who reported it!
At Tryolabs we are experienced at developing Machine Learning powered apps. If you need some help in a project like this, drop us a line to firstname.lastname@example.org (or fill out this form) and we'll happily connect.