Some years ago I was loitering in a trade expo where I was working. I found myself mesmerized in front of a big machine able to get rid of the impurities in a flow of grains of wheat.
With grains falling vertically, this machine, through an artificial vision system, could use an air blow to discard little gravels, grains with a strange colour or any impurity at a really astonishing speed.
I was approached by a seller from the company that realized this machine, and he started to eulogize its extraordinary features. When he told me that this was the last model, and had a 98% Precision, I asked about the Recall of this amazing machine.
He looked at me quite baffled, and vanished like he was a teleported member of the Star Trek crew, telling me that he would have sent me a more expert colleague.
I saw the same raised-eyebrow expression when I asked about Recall to his more ‘erudite’ colleague, and he replied with incertitude: “What do you mean with ‘Recall’?”
I knew that they had no idea of what they were talking about, so I told him to stop thinking about Arnold Schwarzenegger’s ‘Total Recall‘, but to think around the statistical definition of ‘Precision‘ and ‘Recall‘.
To apply this concept to this story, I was told that the system could detect 98% of improper grains, naming this metrics ‘Precision’ but the reality is that they misnamed it, because they probably didn’t even know what statistical Precision was.
What is relevant for a machine like that is to avoid the impurity to pass, and only secondary not to discard too many good grains by mistake.
The first statistical property I mentioned is commonly called ‘Recall’, and the second one is defined as ‘Precision’.
This is very typical in some Expert Systems like fighting the terrorism: better to bother a moltitude of travellers rather than letting a single dangerous person to slip.
Anyway, that day I figured out how much confusion there’s around two metrics I supposed to be elementary, and my concern depends on the importance of these two metrics.
As always, let’s make an example to clarify the whole concept.
Some friends invite you for lunch, and you decide to buy ice-cream for everyone.
What would be the best method to satisfy everyone’s tastes ?
Easy: to buy all the flavours in the refrigerator aisle.
Right in that moment in front of the fridge, you recall that your friend Joe, who happens to be invited at the lunch, is terribly allergic to something, and last time he inadvertently ate it, he became purple and swollen like an aubergine.
You have just figured out that there are flavours that could kill Joe (and let’s assume you are not a Jim Jones emulator), when immediatly your idea of buying every possible flavours turns out to be – in at least one case –
crappy unwise and potentially lethal.
In other words, your description of the accuracy of your system (the purpose is to satisfy all your friends) needs to take into account the scenario were every friend would be happy except Joe (especially because dying of ice cream looks like a very stupid death).
This is why you need specific metrics.
Our formal problem here is to determine whether an element belongs to a class or not. Basically we are talking about a classification problem.
In our tasty example, the problem is to understand if the flavours we are picking will satisfy our friends.
But to satisfy our friends means also to avoid killing Joe, so
decomposing breaking up our ice cream selection we can observe that it is actually composed by 2 sub-problems:
- Pick the satisfying flavours
- Exclude the mortal flavours
Our selection (i.e. our model) will introduce 4 possible ‘classes’:
- True Positives: the choices thought to be good and revealed good (flavours making everyone giving you high fives)
- False Positives: the choices thought to be good and revealed wrong (“Farewell Joe”)
- True Negatives: the choices thought to be bad and revealed bad (flavours no one liked and discarded)
- False Negatives: the choices thought to be bad and revealed wrong (“No one likes fruit flavours!” But your friends Jack, Jill and John have become fruitarian)
As always, an image explains more than 1000 words:
Error could be explained as the proportion of data points classified incorrectly. Most of times it is calculated as the number of incorrect classification divided by the total number of classifications.
If we take the error metrics described here above as a measurement applied to the ice-cream example, now we have to clarify what exactly is an incorrect classification.
To sort things out is essential to solve this potentially infinite play on words. A good approach to put things in order is to visualize them, so we are going to use a so-called Confusion Matrix, which is a grid like this:
In such a grid, Precision corresponds to the ratio contained in the ‘True Positive’ column:
Prosaically, evaluating Precision is like asking: “What percentage of these predictions were actually positive?”
Recall corresponds to the ratio contained in the ‘True Positive’ row:
Prosaically, evaluating Recall is like asking: “What percentage of these positives was the model able to correctly identify?”
The two questions might sound similar, but we want to know “How many of the positive data points was the model able to ‘recall’ “, and “How ‘precise’ were the actual predictions“, which is slightly but significantly different, especially in Expert Systems as mentioned before.
The key point of the difference between these two coefficients is to figure out that if we want to understand how good our classification is, we need to the consider Precision and Recall together, like the two sides of the same coin.
The next logical question is:
Is there a way to condensate both Precision and Recall in a single number ?
The answer is a metrics called the F score (a.k.a. the F1 score a.k.a. the Formula 1 score).
One easy definition is:
In other words, it’s a value between 0 and 1 (if we have perfect precision = 1 and recall = 1, the calculation becomes 2 x [1 x 1 / (1 + 1)] = 1 ) giving the precision and recall an equal importance.
It is easy to figure out that if one value is extremely positive and the other is not positive at all, the overall F score value would be around 0.5, which is a coherent value representing the classical glass half full (of Joe’s ice cream, of course).