Defining a classification problem is pretty easy: there are scattered points and someone wants to create distinct subsets of these points (called, guess how? – classes) to reach a more synthetic definition of the variety which characterizes the reality he’s observing.
What am I babbling about?
Ok, let’s make an example. Let’s think about human races.
In many countries when you want to apply for a job, you are asked to choose your ethnicity from a drop-down menu.
The choice might be between 3, 4 up to 10 different description of races. Are there really less than 10 different human races in the world?
If you want to synthesize, yes.
Actually, there are only three major groups: Caucasian, Mongolian and Negroid.
Am I really asserting that a Mongol and a Maori belong to the same race?
Yes and no. Yes: they’re both Mongolian. And no: there are more than 30 sub-groups, which discriminate the evident internal differences between the 3 super-groups.
To make it all more complicated, there are anthropologists affirming that actually there’s only one race: the human one.
All this introduction was to say that, very often, the number of classes that you want to find in your classification problem, is not a quantitative problem.
This might seem unexpected for you, but there’s a mathematical reason for this.
The central subject around a classification problem is the concept of distance from a point which represent the whole class, called centroid.
Being part of a class means in effect to be closer to a centroid rather than to another.
And ‘closer‘ is not always about Euclidean distance.
Think about the Bishop in the Chess game. Think about its possible moves.
You can be just on the opposite corner of the chessboard and still be ‘closer‘ than a piece positioned one square distant on the left of the Bishop.
So, what is the minimum distance that a point may have from a centroid? Easy answer: it’s zero. When a point corresponds with its centroid, the error (i.e. the distance with its centroid) rate is zero.
Which implies that if you have 2000 points and you choose each of them as a centroid, the error rate is zero.
Vice-versa, if you pick zero centroids, your error rate will be 1 (or 100%).
This exaggeration is to say that the ‘centroid’ concept is an extreme approximation, and the number of classes (or centroids) is something really connected with the reality (or the perception of the reality) of the phenomenon that you want to classify: it might be classes of elements, behaviours or peculiarities.
In other words, the first thing you need to investigate to find the optimal number of classes in your classification problem is not an algorithm, but the reality you want to describe, and we will see how.
Anyway, this does not mean that it’s impossible to put it down also in a quantitative manner.
Let’s see it with an example.
Let’s say that you are asked to classify how much prone is a company to default only looking at its yearly profit.
We have to create classes of risk; then for every class of risk, a team will be created, spending the next year to crunch zillions of data to help the bank system not to bet on the losing horse (translation: to avoid giving credit to possible defaulting companies).
This is a sample of companies’ distribution:
On the x-axis we have yearly profit, on a the y-axis we have a coefficient of probability to default (where 0=extreme stability and 1=Enron Corporation).
The most idiotic subdivision in classes would be in 2: risky / non risky.
Ok, seems sensible.
A doubt: what if we decided to split our customers in 3 classes: low risk / medium risk / high risk ?
Not bad. No, wait: isn’t the ‘medium risk’ class too much populated?
Yes, this is a typical classification issue working with social data: people want to conform. Incidentally, when you are facing social data, and you split you data in a typical ‘high / medium / low’ distinction, you will see the majority of your population to dwell in the ‘medium’ class, following a standard Gaussian distribution; using more prosaic words, it’ll create a central ‘mare magnum’ group – collecting every Tom, Dick and Harry – and two other classes basically only for freaks (‘outliers’, using mathematical terms).
Ah, ok: so it would be better to use 4 classes. Mmh, and what about 5? Or 6?
Here we are: is there a (automatic) way to find the optimal number of classes?
There is one.
It is called the Elbow Curve method.
Intuitively, as the number of classes (and original centroids) increases, the error rate should decrease. As we said, if we choose k centroids, with k = number of examples in our population, error rate should be zero.
So, let’s plot the number of classes we choose versus the error rate we observe.
We can see that every new class produces a reduction in terms of error rate; but after the first four-five centroids, every further addition is always less significant.
This means that using 4-5 classes will be the best effective option (not the absolute best) for our classification problem.
But, as you know, life is not a bed of roses.
Sometimes when plotting a ‘number of classes vs centroids’ comparison, we face a situation like this:
Ok, maybe this one is a little bit too extreme; but it’s good for the dramatization.
It seems that, as before, we have a constant reduction in terms of precision for every new class added. In other words, we’re plotting an elbow curve, but there’s really no elbow.
So how to choose a satisfying number of classes?
We said that it is very important to reason around the reality we are considering. We stated that
for every class of risk, a team will be created
This means that every point in that plot is not only a coloured dot, but also a team – a team that has a cost.
Let’s assume that this table summarizes the cost to create new teams:
Let’s apply these costs to the improvement of precision that every new class guarantees (cost for every new team added / improvement in precision):
Now we can see how much it will cost to improve 1% in terms of precision.
If we take our ‘linearly improved’ (improbable) example, and apply this ratio,
we will start to make some assumptions:
- Adding a fourth team is quite expensive
- After adding a fourth team, arriving to six teams is very expensive (it will cost almost a 30% extra compared to the average cost for 2-3 teams).
- Only a fool would add a seventh team
Does this give us the Columbus’ egg about how many classes to use in a classification problem?
Of course not.
If you were expecting a super complicated algorithm based on dimensionality reduction in centroids’ distance, you might be disappointed receiving a junior high school suggestion.
But the real lesson to learn associated with this article is that you have to remember that you are not considering simply points on a Cartesian system: that one is the representation (the synthesis) of the problem, in a graphical model.
But beyond those dots and those euclidean distances there are dozens of nuances that may play a central role in a decisionally critical moment.
Because sometimes we forget that, as Norbert Wiener once said:
“the best model of a cat is a cat”.