‘Class will out’ – in a classification problem

multiclass

Defining a classification problem is pretty easy: there are scattered points and someone wants to create distinct subsets of these points (called, guess how? – classes) to reach a more synthetic definition of the variety which characterizes the reality he’s observing.

What am I babbling about?

Ok, let’s make an example. Let’s think about human races.

In many countries when you want to apply for a job, you are asked to choose your ethnicity from a drop-down menu.

The choice might be between 3, 4 up to 10 different description of races. Are there really less than 10 different human races in the world?

If you want to synthesize, yes.

9089a0936105f9e0b83047446c2cc9ec

United Colors of Class-a-thon

Actually, there are only three major groups: Caucasian, Mongolian and Negroid.

Am I really asserting that a Mongol and a Maori belong to the same race?

Yes and no. Yes: they’re both Mongolian. And no: there are more than 30 sub-groups, which discriminate the evident internal differences between the 3 super-groups.

To make it all more complicated, there are anthropologists affirming that actually there’s only one race: the human one.

 

All this introduction was to say that, very often, the number of classes that you want to find in your classification problem, is not a quantitative problem.

This might seem unexpected for you, but there’s a mathematical reason for this.

The central subject around a classification problem is the concept of distance from a point which represent the whole class, called centroid.

Being part of a class means in effect to be closer to a centroid rather than to another.

And ‘closer‘ is not always about Euclidean distance.moves_bishop

Think about the Bishop in the Chess game. Think about its possible moves.

You can be just on the opposite corner of the chessboard and still be ‘closer‘ than a piece positioned one square distant on the left of the Bishop.

So, what is the minimum distance that a point may have from a centroid? Easy answer: it’s zero. When a point corresponds with its centroid, the error (i.e. the distance with its centroid) rate is zero.

Which implies that if you have 2000 points and you choose each of them as a centroid, the error rate is zero.

Vice-versa, if you pick zero centroids, your error rate will be 1 (or 100%).

This exaggeration is to say that the ‘centroid’ concept is an extreme approximation, and the number of classes (or centroids) is something really connected with the reality (or the perception of the reality) of the phenomenon that you want to classify: it might be classes of elements, behaviours or peculiarities.

In other words, the first thing you need to investigate to find the optimal number of classes in your classification problem is not an algorithm, but the reality you want to describe, and we will see how.

How_to_be_a_web_developer_without_being_good_at_math

Centroids vs Reality

Anyway, this does not mean that it’s impossible to put it down also in a quantitative manner.

 

Let’s see it with an example.

Let’s say that you are asked to classify how much prone is a company to default only looking at its yearly profit.

We have to create classes of risk; then for every class of risk, a team will be created, spending the next year to crunch zillions of data to help the bank system not to bet on the losing horse (translation: to avoid giving credit to possible defaulting companies).

This is a sample of companies’ distribution:

distr_def

On the x-axis we have yearly profit, on a the y-axis we have a coefficient of probability to default (where 0=extreme stability and 1=Enron Corporation).

The most idiotic subdivision in classes would be in 2: risky / non risky.

distr_def_2

Ok, seems sensible.

A doubt: what if we decided to split our customers in 3 classes: low risk / medium risk / high risk ?

distr_def_3

Not bad. No, wait: isn’t the ‘medium risk’ class too much populated?

Yes, this is a typical classification issue working with social data: people want to conform. Incidentally, when you are facing social data, and you split you data in a typical ‘high / medium / low’ distinction, you will see the majority of your population to dwell in the ‘medium’ class, following a standard Gaussian distribution; using more prosaic words, it’ll create a central ‘mare magnum’ group – collecting every Tom, Dick and Harry – and two other classes basically only for freaks (‘outliers’, using mathematical terms).

Ah, ok: so it would be better to use 4 classes. Mmh, and what about 5? Or 6?

Here we are: is there a (automatic) way to find the optimal number of classes?

There is one.

It is called the Elbow Curve method.

 

Intuitively, as the number of classes (and original centroids) increases, the error rate should decrease. As we said, if we choose k centroids, with k = number of examples in our population, error rate should be zero.

So, let’s plot the number of classes we choose versus the error rate we observe.

ElbowCurve

precisions

 

We can see that every new class produces a reduction in terms of error rate;  but after the first four-five centroids, every further addition is always less significant.

This means that using 4-5 classes will be the best effective option (not the absolute best) for our classification problem.

But, as you know, life is not a bed of roses.

Sometimes when plotting a ‘number of classes vs centroids’ comparison, we face a situation like this:

ElbowCurve_2

Ok, maybe this one is a little bit too extreme; but it’s good for the dramatization.

It seems that, as before, we have  a constant reduction in terms of precision for every new class added. In other words, we’re plotting an elbow curve, but there’s really no elbow.

So how to choose a satisfying number of classes?

We said that it is very important to reason around the reality we are considering. We stated that

for every class of risk, a team will be created

This means that every point in that plot is not only a coloured dot, but also a team – a team that has a cost.

Let’s assume that this table summarizes the cost to create new teams:

Costs

Let’s apply these costs to the improvement of precision that every new class guarantees (cost for every new team added / improvement in precision):

Costs_for_precision

Now we can see how much it will cost to improve 1% in terms of precision.

If we take our ‘linearly improved’ (improbable) example, and apply this ratio,

Costs_for_precision_3

 

we will start to make some assumptions:

  1. Adding a fourth team is quite expensive
  2. After adding a fourth team, arriving to six teams is very expensive (it will cost almost a 30% extra compared to the average cost for 2-3 teams).
  3. Only a fool would add a seventh team

 

Does this give us the Columbus’ egg about how many classes to use in a classification problem?

Of course not.

If you were expecting a super complicated algorithm based on dimensionality reduction in centroids’ distance, you might be disappointed receiving a junior high school suggestion.

But the real lesson to learn associated with this article is that you have to remember that you are not considering simply points on a Cartesian system: that one is the representation (the synthesis) of the problem, in a graphical model.

But beyond those dots and those euclidean distances there are dozens of nuances that may play a central role in a decisionally critical moment.

Because sometimes we forget that, as Norbert Wiener once said:

“the best model of a cat is a cat”.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s