Very often we get people contacting us for projects in which they envision the usage of some Machine Learning (ML) techniques to solve a specific problem they have.
Sometimes these people do not have any automated system and are solving whatever problem they currently have with human labor. In these cases, the mere fact of being able to produce anything that works reasonably well can make a huge difference for them. Imagine, for example, that your company wants you to review 100,000 tweets that make reference to its brand name, because they want to know which customers are not satisfied with products they are selling and why. Wouldn’t it be cool for you if I gave you a tool that can tell you that out of the 100,000 messages, 95,000 are not even worth looking at because they don’t even talk about products or don’t express negative feelings? It would be a huge time saver.
But there are other times in which people already have working solutions in place. If the problem is very complex, they might not be happy with how their solution performs. Sometimes even a performant solution might need a replacement. Imagine you have come up with a rule-based engine to process some type data, and then more data starts coming in that is not adequately handled with the rules you have. This can mean that most of the rules that once were good now need to be rewritten, and probably also that new rules need to be created. The complexity of the system will not cease to increase. This is another case in which a ML solution might come in handy: let’s make an algorithm figure out the rules for us. Let’s hope it performs better than any non-ML solution we have come up with so far.
A particular case
Not long ago we were proposed a ML project which consisted in a classification task for some kind of data. A classification task means identifying or predicting which out of a set of categories/labels should be assigned to some data. For example, the spam filter your email service provides assigns a spam or no spam status to every email. The classifier demo1 in this very same webpage is able to identify a hierarchy of topics of subtopics any text or webpage can belong to. A wide range of day to day problems can be turned into classification tasks. Using ML we can achieve classification by having someone deliver an already classified dataset, from which we can learn (ie. infer patterns). This is called supervised classification. There are other kinds of classification (namely unsupervised and semi-supervised), but I am not going to talk about them now. What was special about this particular proposal was a requirement in the contract in which the client wanted us to commit to the fact that, by the end of the project, our system would have to achieve 95% accuracy in the task. I immediately asked myself two questions:
- Why 95%?
- Why accuracy?
In what follows, we are going to analyze the first question. The next post of this series is going to deal with measures of performance and why accuracy in particular is deceiving.
Machine Learning and metrics
It is clear that when developing a classification system, its going to be very useful to have an objective metric by which we can know how well it performs. But saying 95% because it sounds good is arbitrary unless you have previously done some work to make sure this is indeed achievable. As arbitrary as it is, being right 95% of the times may be an indicator of a great performance for some tasks, while a sign of terrible performance in some others. For example, imagine a dataset in which 99% of the data belongs to class A, and the remaining 1% belongs to class B. Dataset likes this are called imbalanced datasets (guess you weren’t expecting that, huh?). A classifier that always says “class A” will be right 99% of the time. I guess no customer would be happy if a contract-compliant solution turned out to be a one-liner like
return True or such :-)
What we do in ML tasks first is establishing a baseline. A baseline can be the measure of performance of a very simple system. For example, for a spam detection task a baseline may very well be the output of a system which classifies an email as spam if it contains words such as viagra, pharmacy, etc. Ideally it should be a bit more complex than that. A baseline could also be the performance of a previous, existing system which we wish to improve upon. After establishing what the baseline is, we know a ML solution has to perform at least as good as it. If it doesn’t achieve that then we are doing something wrong or need a larger tagged dataset for training (learning), unless the baseline was a really complex system in itself (then maybe it wasn’t so much base-line after all).
But even knowing the baseline, and assuming it is less than the 95% our client wanted us to achieve, one cannot promise a certain level of accuracy above it before the ML algorithm is actually built (and ‘tweaked’ and tested).
How good could it possibly get? For many tasks, achieving 100% accuracy is never possible. And I don’t mean just really hard, I actually mean impossible. This is not always the fault of the algorithms: there are algorithms and implementations which can make computers achieve human level performance for certain tasks. The problem is this: when using two humans to classify data independently of each other they might not agree 100% of the time. Say they agree 97% of the time. This is called ceiling and is the absolute maximum performance a ML system will ever be able to achieve. Asking more than that is meaningless. Knowing this, the performance of any ML solution will lie between the baseline and a ceiling (unknown but probably less than 100%).
Getting to know the actual value of the ceiling takes some more effort: you would have to assign at least two persons (with deep knowledge of the data they are going to tag) to tag examples, each one independently of the other, and then calculate their so called agreement rate. There are several metrics used for this, but I will not go in the details. If one were focusing on research this would be a must, but its definitely possible to build a great system (ie. one that saves money by reducing human labour) without knowing the value of the ceiling.
- one can never say something like: the baseline is 90% so we are going to get to 95% by the end of this project.
- one can say something like: the baseline is 90%, so we are going to commit to improve that by as much as we can; until we actually do it, we cannot say by how much.