The Metrics Reloaded – Total (Precision and) Recall

Some years ago I was loitering in a trade expo where I was working. I found myself mesmerized in front of a big machine able to get rid of the impurities in a flow of grains of wheat.

With grains falling vertically, this machine, through an artificial vision system, could use an air blow to discard little gravels, grains with a strange colour or any impurity at a really astonishing speed.

I was approached by a seller from the company that realized this machine, and he started to eulogize its extraordinary features. When he told me that this was the last model, and had a 98% Precision, I asked about the Recall of this amazing machine.

He looked at me quite baffled, and vanished like he was a teleported member of the Star Trek crew, telling me that he would have sent me a more expert colleague.

1969, STAR TREK

I saw the same raised-eyebrow expression when I asked about Recall to his more ‘erudite’ colleague, and he replied with incertitude: “What do you mean with ‘Recall’?”

recall-hi-res

 

I knew that they had no idea of what they were talking about, so I told him to stop thinking about Arnold Schwarzenegger’s ‘Total Recall‘, but to think around the statistical definition of ‘Precision‘ and ‘Recall‘.

 

To apply this concept to this story, I was told that the system could detect 98% of improper grains, naming this metrics ‘Precision’ but the reality is that they misnamed it, because they probably didn’t even know what statistical Precision was.

What is relevant for a machine like that is to avoid the impurity to pass, and only secondary not to discard too many good grains by mistake.

The first statistical property I mentioned is commonly called ‘Recall’, and the second one is defined as ‘Precision’.

This is very typical in some Expert Systems like fighting the terrorism: better to bother a moltitude of travellers rather than letting a single dangerous person to slip.

Anyway, that day I figured out how much confusion there’s around two metrics I supposed to be elementary, and my concern depends on the importance of these two metrics.

 

As always, let’s make an example to clarify the whole concept.

Some friends invite you for lunch, and you decide to buy ice-cream for everyone.

What would be the best method to satisfy everyone’s tastes ?
Easy: to buy all the flavours in the refrigerator aisle.

ice_cream_

Ehr….

 

Right in that moment in front of the fridge, you recall that your friend Joe, who happens to be invited at the lunch, is terribly allergic to something, and last time he inadvertently ate it, he became purple and swollen like an aubergine.

You have just figured out that there are flavours that could kill Joe (and let’s assume you are not a Jim Jones emulator), when immediatly your idea of buying every possible flavours turns out to be – in at least one case – crappy unwise and potentially lethal.

In other words, your description of the accuracy of your system (the purpose is to satisfy all your friends) needs to take into account the scenario were every friend would be happy except Joe (especially because dying of ice cream looks like a very stupid death).

This is why you need specific metrics.

Our formal problem here is to determine whether an element belongs to a class or not. Basically we are talking about a classification problem.

In our tasty example, the problem is to understand if the flavours we are picking will satisfy our friends.

But to satisfy our friends means also to avoid killing Joe, so decomposing breaking up our ice cream selection we can observe that it is actually composed by 2 sub-problems:

  • Pick the satisfying flavours
  • Exclude the mortal flavours

 

Our selection (i.e. our model) will introduce 4 possible ‘classes’:

  • True Positives: the choices thought to be good and revealed good (flavours making everyone giving you high fives)
  • False Positives: the choices thought to be good and revealed wrong (“Farewell Joe”)
  • True Negatives: the choices thought to be bad and revealed bad (flavours no one liked and discarded)
  • False Negatives: the choices thought to be bad and revealed wrong (“No one likes fruit flavours!” But your friends Jack, Jill and John have become fruitarian)

As always, an image explains more than 1000 words:

icecream-class

 

Error could be explained as the proportion of data points classified incorrectly. Most of times it is calculated as the number of incorrect classification divided by the total number of classifications.

If we take the error metrics described here above as a measurement applied to the ice-cream example, now we have to clarify what exactly is an incorrect classification.

To sort things out is essential to solve this potentially infinite play on words. A good approach to put things in order is to visualize them, so we are going to use a so-called Confusion Matrix, which is a grid like this:

confusionmatrix

In such a grid, Precision corresponds to the ratio contained in the ‘True Positive’ column:

precision

Prosaically, evaluating Precision is like asking: “What percentage of these predictions were actually positive?

Recall corresponds to the ratio contained in the ‘True Positive’ row:

recall

Prosaically, evaluating Recall is like asking: “What percentage of these positives was the model able to correctly identify?

The two questions might sound similar, but we want to know “How many of the positive data points was the model able to ‘recall’ “, and “How ‘precise’ were the actual predictions“, which is slightly but significantly different, especially in Expert Systems as mentioned before.

The key point of the difference between these two coefficients is to figure out that if we want to understand how good our classification is, we need to the consider Precision and Recall together, like the two sides of the same coin.

The next logical question is:

Is there a way to condensate both Precision and Recall in a single number ?

The answer is a metrics called the F score (a.k.a. the F1 score a.k.a. the Formula 1 score).

One easy definition is:

Fscore

In other words, it’s a value between 0 and 1 (if we have perfect precision = 1 and recall = 1, the calculation becomes 2 x [1 x 1 / (1 + 1)] = 1 ) giving the precision and recall an equal importance.

It is easy to figure out that if one value is extremely positive and the other is not positive at all, the overall F score value would be around 0.5, which is a coherent value representing the classical glass half full (of Joe’s ice cream, of course).

realists-demotivator

PCA – Principal Component… Astronomy?

 

Recently for some reason it happened to talk frequently about PCA (Principal Component Analysis), so I decided to write an article around it because it seems a complicated subject, but it is not.

I don’t think something could be impossible to be acknowledged; usually it depends on two factors:

  1. Bad teachers
  2. Uninterested pupils

Since I can do nothing with the second factor (apart from instilling interest in what I’m expounding), I will try to explain the subject at my best.

The reason why someone has problems to understand PCA is that there are two ways to explain what PCA is; one is complicated, the other one is pretty straightforward.

Let’s do it step by step.

If I had to write a book about Machine Learning, I would put PCA under the section ‘Dimensionality Reduction‘.

Sometimes Most of times we have to address the computability of a project: we may need to speed up an algorithm, or reduce the space dedicated to the data.

But Dimensionality Reduction truly comes in our aid when we want to visualize data or to find a initial pattern on data: in other words, to turn an unsupervised problem in a slightly more supervised one.

So here’s the easy way to describe how PCA works.

Let’s assume you are looking at a cluster of stars, gases, and celestial bodies in a portion of the universe, something like this:

434459main_hs-2010-12-a-full_full

Groovy…

Let’s assume you can even navigate through it using 3D images; something like this:

Navigate through the Nebula

Now you have to describe the disposition of the relevant elements inside this cluster (i.e. the stars). It’s pretty hard to compare this clump with something recognizable.

In other words, what we are trying to do is to find a pattern inside these data.

What if we try to fix a viewpoint and look at this nebula from a stuck position ?

This is exactly what happens when we look at the constellations from the Earth.

It is like to create an immaginary perpendicular plane between the observer (us) and the nebula; in this plane all the major stars will be projected and, inevitably, flattened.

To visualize what I’m saying, it is exactly like this:

orion-in-3d-small

So, instead of that indiscernible mass of gases and stars, we can finally say:

orion

“Hey, that’s a moka pot the mythological giant Orion!

As you could guess, this is very useful to find a pattern (in this case, a mythological huntsman with a fancy for belts) but on condition that we agree to lose part of our data.

Betelgeuse and Bellatrix, two of the most recognizable stars of the constellation, are not close at all in a 3D space, but in a 2D plane they can compose a pattern that is easier to identify.

In other words, PCA is about perspective.

We renounced some data (in this case, the depth: here’s the ‘Dimensionality Reduction‘) to have a better line of vision which can help to synthesize a coherent perspective about the subject.

But, keep in mind that this is not the subject, it’s a perspective of the subject.

Do you remember that Norbert Weiner’s quote?

“The best model of a cat is a cat”

PCA is a perfect application of Weiner’s remark, and it demonstrates that it could seem a banal observation but it is not.

Is PCA just this ?

Yes! But usually when you want to implement a PCA the problem is more complicated.

When we talk about constellations, the plane we use to reduce dimensionally our cluster of stars is mandatory, because our viewpoint is obligated: we are on the Earth planet, and the perspective (especially in the past, without the Hubble telescope) is stuck.

But if we are talking about data, the question becomes:

  • What’s the best plane (e.g. ‘perspective’) to find an insightful pattern in these data ?

 

Now it’s time to introduce some math in our speculative description of PCA.

Between the actual position of a point and its projection on the summentioned plane there’s a distance. This distance is sometimes called Projection Error.

This projection creates a set of n-1 vectors that defines a n-1 dimensional plane (e.g. 3D –> 2D, otherwise what ‘dimensionality reduction‘ would it be?) to project you data onto.

PCA simply finds the plane that minimizes this Projection Error – calculating the sum of its squares (in a more formal way, this represents the Mean Squared Error), or, always invoking geometry, PCA minimizes the lenght of the shortest orthogonal distance.

Every point represents a feature, a trait, for our phenomenon; so which statistical tool could help us to implement the concept of orthogonal distance from a ‘centre’ ?

The answer is Variance.

So the PCA algorithm puts the Variance associated to each vector in order, and, obviously, gets rid of the last one (i.e. the less ‘spread’ –> i.e. the less significant).

 

pca_3d_axis

 

Ok – we got the geometrical intuition behind what PCA actually is.

Now we would like to implement an algorithm to calculate it – a.k.a. to definite it in such a way to be calculated through a heartless CPU.

Wait a minute. I said that to define our dimensionality reductive plane we used a set of vectors of Variance (which are accidentally called Principal Components).

(SPOILER: Remember we said there are 2 way to describe how PCA works ? We’re about to take the harder path)

A set of vectors is also a matrix.

In Linear Algebra there’s a method called Singular Value Decomposition, where any n-by-d matrix can be uniquely written as a product of 3 matrices with special properties, like this:

svd

U is an n-by-r matrix, V is an d-by-r matrix (with r = d < n) and are orthogonal (i.e. rotations), whilst sigmais a r-by-r matrix and diagonal (i.e. a scaling).

The columns of U and V are known as the left and right singular vectors, whilst the values of (the diagonal of) sigmaare the corresponding singular values. Keep in mind this dicotomy between the vectors and their corresponding values.

Now, let’s consider the n-by-r matrix svd2, and notice that svd3by the orthogonality of V.

Let’s explain this: we can construct W from X by means of the transformation V: it’s the reformulation of X in terms of its principal components.

Practically, the newly constructed features are found in W: the first row is the first principal component, the second row is the second principal component, and so on.

This is the Algebric version of saying what we said geometrically here above: the directions in which X has the largest, second-largest, semi-second-largest, semi-semi-second-largest, … Variance.

If we use SVD to rewrite the scatter matrix in a standard form, we have:

svd4

This is known as the Eigendecomposition of the matrix S: the columns of V are called the Eigenvectors of S, and the elements on the diagonal of sigma2 are called the Eigenvalues.

Don’t be scared by the names.

Eigenvectors and Eigenvalues are an esoteric way to call the dicotomy we noticed before.

3vm22

Another way to describe what they represent is through Physics:

A tuple of Eigenvectors and Eigenvalues is nothing but what in Physics is called a Tensor.

As a consequence, a set (i.e. a matrix) of Eigenvectors and Eigenvalues is what in Physics is called a Tensor Field.

If you’re still alive after this long and (I hope) interesting disquisition, you will strain at the leash to understand when where and why it’s useful to implement a PCA process.

Since I think a reader at this point will likely have his brain melted, we will discuss of this topic in a separated post.

I’m taking my telescope to watch my neighbour taking a shower the infinite misteries of the night sky!

Cheers!
Note: While I was writing this article I found another great article (probably clearer than mine) which explains PCA:
https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/

 

‘Class will out’ – in a classification problem

multiclass

Defining a classification problem is pretty easy: there are scattered points and someone wants to create distinct subsets of these points (called, guess how? – classes) to reach a more synthetic definition of the variety which characterizes the reality he’s observing.

What am I babbling about?

Ok, let’s make an example. Let’s think about human races.

In many countries when you want to apply for a job, you are asked to choose your ethnicity from a drop-down menu.

The choice might be between 3, 4 up to 10 different description of races. Are there really less than 10 different human races in the world?

If you want to synthesize, yes.

9089a0936105f9e0b83047446c2cc9ec

United Colors of Class-a-thon

Actually, there are only three major groups: Caucasian, Mongolian and Negroid.

Am I really asserting that a Mongol and a Maori belong to the same race?

Yes and no. Yes: they’re both Mongolian. And no: there are more than 30 sub-groups, which discriminate the evident internal differences between the 3 super-groups.

To make it all more complicated, there are anthropologists affirming that actually there’s only one race: the human one.

 

All this introduction was to say that, very often, the number of classes that you want to find in your classification problem, is not a quantitative problem.

This might seem unexpected for you, but there’s a mathematical reason for this.

The central subject around a classification problem is the concept of distance from a point which represent the whole class, called centroid.

Being part of a class means in effect to be closer to a centroid rather than to another.

And ‘closer‘ is not always about Euclidean distance.moves_bishop

Think about the Bishop in the Chess game. Think about its possible moves.

You can be just on the opposite corner of the chessboard and still be ‘closer‘ than a piece positioned one square distant on the left of the Bishop.

So, what is the minimum distance that a point may have from a centroid? Easy answer: it’s zero. When a point corresponds with its centroid, the error (i.e. the distance with its centroid) rate is zero.

Which implies that if you have 2000 points and you choose each of them as a centroid, the error rate is zero.

Vice-versa, if you pick zero centroids, your error rate will be 1 (or 100%).

This exaggeration is to say that the ‘centroid’ concept is an extreme approximation, and the number of classes (or centroids) is something really connected with the reality (or the perception of the reality) of the phenomenon that you want to classify: it might be classes of elements, behaviours or peculiarities.

In other words, the first thing you need to investigate to find the optimal number of classes in your classification problem is not an algorithm, but the reality you want to describe, and we will see how.

How_to_be_a_web_developer_without_being_good_at_math

Centroids vs Reality

Anyway, this does not mean that it’s impossible to put it down also in a quantitative manner.

 

Let’s see it with an example.

Let’s say that you are asked to classify how much prone is a company to default only looking at its yearly profit.

We have to create classes of risk; then for every class of risk, a team will be created, spending the next year to crunch zillions of data to help the bank system not to bet on the losing horse (translation: to avoid giving credit to possible defaulting companies).

This is a sample of companies’ distribution:

distr_def

On the x-axis we have yearly profit, on a the y-axis we have a coefficient of probability to default (where 0=extreme stability and 1=Enron Corporation).

The most idiotic subdivision in classes would be in 2: risky / non risky.

distr_def_2

Ok, seems sensible.

A doubt: what if we decided to split our customers in 3 classes: low risk / medium risk / high risk ?

distr_def_3

Not bad. No, wait: isn’t the ‘medium risk’ class too much populated?

Yes, this is a typical classification issue working with social data: people want to conform. Incidentally, when you are facing social data, and you split you data in a typical ‘high / medium / low’ distinction, you will see the majority of your population to dwell in the ‘medium’ class, following a standard Gaussian distribution; using more prosaic words, it’ll create a central ‘mare magnum’ group – collecting every Tom, Dick and Harry – and two other classes basically only for freaks (‘outliers’, using mathematical terms).

Ah, ok: so it would be better to use 4 classes. Mmh, and what about 5? Or 6?

Here we are: is there a (automatic) way to find the optimal number of classes?

There is one.

It is called the Elbow Curve method.

 

Intuitively, as the number of classes (and original centroids) increases, the error rate should decrease. As we said, if we choose k centroids, with k = number of examples in our population, error rate should be zero.

So, let’s plot the number of classes we choose versus the error rate we observe.

ElbowCurve

precisions

 

We can see that every new class produces a reduction in terms of error rate;  but after the first four-five centroids, every further addition is always less significant.

This means that using 4-5 classes will be the best effective option (not the absolute best) for our classification problem.

But, as you know, life is not a bed of roses.

Sometimes when plotting a ‘number of classes vs centroids’ comparison, we face a situation like this:

ElbowCurve_2

Ok, maybe this one is a little bit too extreme; but it’s good for the dramatization.

It seems that, as before, we have  a constant reduction in terms of precision for every new class added. In other words, we’re plotting an elbow curve, but there’s really no elbow.

So how to choose a satisfying number of classes?

We said that it is very important to reason around the reality we are considering. We stated that

for every class of risk, a team will be created

This means that every point in that plot is not only a coloured dot, but also a team – a team that has a cost.

Let’s assume that this table summarizes the cost to create new teams:

Costs

Let’s apply these costs to the improvement of precision that every new class guarantees (cost for every new team added / improvement in precision):

Costs_for_precision

Now we can see how much it will cost to improve 1% in terms of precision.

If we take our ‘linearly improved’ (improbable) example, and apply this ratio,

Costs_for_precision_3

 

we will start to make some assumptions:

  1. Adding a fourth team is quite expensive
  2. After adding a fourth team, arriving to six teams is very expensive (it will cost almost a 30% extra compared to the average cost for 2-3 teams).
  3. Only a fool would add a seventh team

 

Does this give us the Columbus’ egg about how many classes to use in a classification problem?

Of course not.

If you were expecting a super complicated algorithm based on dimensionality reduction in centroids’ distance, you might be disappointed receiving a junior high school suggestion.

But the real lesson to learn associated with this article is that you have to remember that you are not considering simply points on a Cartesian system: that one is the representation (the synthesis) of the problem, in a graphical model.

But beyond those dots and those euclidean distances there are dozens of nuances that may play a central role in a decisionally critical moment.

Because sometimes we forget that, as Norbert Wiener once said:

“the best model of a cat is a cat”.

The Clim-Hate Change

The English expression “How many angels can dance on the head of a pin?” originates from a myth about the fall of the city of Costantinople in 1453 E.V.

While the army of Muhamed II was about to conquer the city and to put an end to the Eastern Roman Empire, the Byzantine theologians were busy arguing around formal minutiae, completeley unaware of the gravity of their situation.

siege_constantinople_bnf_fr2691

Muhamed II’s army instead was all about business

During the first 10 days of December 2015, 195 countries joined in Paris to discuss how to reduce the greenhouse effects driving global warming.

The question is: is it necessary ‘to discuss’? Is the global warming a fact or an hypothesis?

J. Scott Armstrong is a professor of Marketing at the University of Pennsylvania, and he is the author of ‘Principles of Forecasting‘, a cornerstone for anybody studying, ehr, forecasting methods.

This preamble aims to explain that Armstrong is not the ultimate schmuck in the field of forecasting.

In 2007 Armstrong challenged Al Gore to a bet about global climate warming: Al Gore is in the front line (together with IPCC, the United Nations’ International Panel on Climate Change) supporting the idea that the temperature would be constantly raising, and it’s something perceivable yet in few decades.

Armstrong, on the contrary, asserts that the temperatures would remain at early 2000s’ level.

September 2015: what’s the result?

The climatebet.com site states:

The 93 months of the 120 month (10-year) Climate Bet so far has witnessed 45 months in which the global average temperature anomaly increased from the previous month, and 46 months in which the global temperature fell. This pattern, or lack of it, is of course consistent with the Green, Armstrong, and Soon (2009) evidence-based no-change forecast that is the basis of Professor Armstrong’s notional bet with Al Gore.

This is their last chart:

What’s the point of Armstrong’s bet?

Nate Silver, in his ‘The Signal and the Noise‘ identifies 3 critiques to the IPCC forecasts:

First, Armstrong contends that agreement among forecasters is not related to accuracy – and may reflect bias as much as anything else. “You don’t vote“, Armstrong told me. “That’s not the way science progresses“.

Remember this last sentence. We will discuss about this again later.

Next, they say the complexity of the global warming problem makes forecasting a fool’s errand. “There’s been no case in history where we’ve had a complex thing with lots of variables and lots of uncertainity, where people have been able to make econometric models or any complex models work”, Armstrong told me. “The more complex you make the model the worse the forecast gets“.

Finally, Armstrong and Green write that the forecasts do not adequately account for the uncertainity intrinsic to the global warming problem. In other words, they are potentially overconfident.

Ok, so Armstrong seems to propose a sense of skepticism about the Climate Warming.

But I believe that probably his words need to be accurately weighted. Armstrong is too much an old fox to affirm that global warming is not a fact.

And he doesn’t, indeed.

Armstrong bets that during the 2000s (and only 15 years have passed) the temperature is not raising, but he has never stated that the temperature hasn’t risen in the last centuries, or at least just in the XXth century.

Take a look at this image:

Now focus on the left black line that shows the 20th century temperatures record.

(If you want to go deep into the meaning of the coloured lines and bars about the 21st century, see the original IPCC report where I stole this image).

Only Ray Charles or a very biased mind would not agree that during the second half of the 20th century the global temperature has risen of about 1°C.

Well, what could have caused this global temperature’s perceivable increment?

Sulph

All the models seem to agree that a peak has been reached in sulfur emissions caused by human industrial activity in the late 20th century.

The chain of causation that has lead to the global warming phenomenon is something that can be taught even in primary school:

20th century’s Industrial Activity increases –> Sulfur Emissions increase –> Greenhouse Effect increases –> Global Temperature increases

So, has Armstrong completely gone nuts or does he seriously believe that the global warming is not a fact?

I believe Armstrong’s bet is actually a provocation.

The problem is that the time horizon of climate changes in the past was eras, or hundreds of years. Now we are witnessing a change in the scale of years.

What Armstrong has really done is called interpolation.

He has never talked about projections or models.

The difference between an interpolative curve and a predictive curve is that the interpolative one will have zero standard deviation and high variance with training data, because this curve will fit them all as much accurately as it can.

The interpolative curve does not care a damn thing about new data, it is all about the training set.

In other words, Armstrong’s bet is overfitted.

Why does he seem to be right?

Because only a decade has passed. The more new data about global temperature will be gathered, the more Armstrong will lose his bet – unless we won’t start to modify seriously the causes of global warming.

But Armstrong’s bet is also very witty for me.

His bet is teaching us that as long as we give rise to something so misleading, we are legitimizing everyone to tell his opinion.

Even people telling that the global warming is caused by cows’ flatulence.

phantomena

The Phantom Menace

The debate about global warming is between two factions, composed by 98% of scientists stating that global warming is a fact and about 2% of scientists stating the contrary.

We are under siege of climate change and we are wasting time listening to the 2% of lunatics of the scientists’ population.

Does it seem sensible or do you feel a little Byzantine aftertaste in it?

Remember Armstrong’s words: Science does not progress democratically. There’s no room for everyone to tell his opinion, because not every opinion has the same weight.

Many scientists and climatologists agree that to stop the climate change is more about politics than about policies.

As long as we have to mediate between countries’ desiderata, fearing to disappoint someone, letting the future of human society being relegated in the background, I don’t think it is possible to put it on a scientific level.

If there’s only one chance to fight them, sadly it’s with their weapon: the profit.

Government by nature will never listen to you because it is right (or because they’re basically firing their own foot) but only in case An Inconvenient Truth would become convenient.

And, believe it or not, our future might depend on insurance companies.

Munich Re‘s core business is risk management since 1880. They started to show interest in global warming in 1974, when they recorded a strong increase in natural disasters all around the world. So, they decided to found the Corporate Climate Center (CCI).

Currently CCI’s director is a meteorologist, Peter Hoeppe, and together with his équipe, they have come to the conclusion that in few decades it will be impossible for them to predict serious weather conditions.

The final result for Munich Re is that it will be pointless to take out insurance policies.

Now, Munich Re has got two possibilities:

  • Devote themselves to other business activities
  • Do their best to avoid the climate change to take place

I know that many people think that multinational companies are the devil and are first in line between the guilty party that caused the global warming itself, but do you remember the old adage “The enemy of my enemy is my friend?

The most important point to remember is that COP21 ought to conclude with an active engagement between governments, policymakers and the scientific community.

Otherwise, in the future there will be no angels to pin on about climate disasters, but only our eve’s indolence.

Neural Networks vs. SVM: Where, When and -above all- Why

NNvsSVM

Many years ago, in a galaxy far, far away, I was summoned by my former team leader, that was clearly preoccupied by a difficult situation. They developed a cool (in every way) project about predicting alarms for refrigerator aisles. It was implemented in 2 tastes, one using a Neural Network, one using a Support Vector Machine.

– “So what?” I asked with my renowned prettiness.

– “We can’t understand which performs best.”

– “Okey… Why did you decide to implement it with a Neural Network AND with a SVM?”

– “Because we couldn’t understand which would perform best.”

Yes. That was the point. You had no idea which one to choose, and, like a little kid which is not able to decide between spaghetti and ice-cream, you went for a dish of tomato spaghetti with Haagen-Dazs. The final result can be surprisingly good (I wouldn’t bet) or a terrible mistake (more likely).

icecreamspaghetti

Ice cream spaghetti were ACTUALLY invented in Germany. By Heinrich Himmler, I suppose.

 

The first point to understand what differentiates NN and SVM is probably to understand what NN and SVM have in common.

From an architectural perspective, I’d say nothing.

To understand what the two algorithms have in common, we need to go to the root of the class of problems they solve.

Mainly they help to solve multi-class classification problems.

multiclass

Multi-class classification problems can be supervised or unsupervised, but think of them as asking questions like “Who are you? A car, a tree or a cat?“. A typical implementation is the captcha mechanism to exclude you are a robot:

139961548102-Image-CAPTCHA

N.B: Neural Networks seem to have a fancy for cats.

As you may figure out from the captcha example above here (if you weren’t distracted by the cats), a multi-class classification problem is nothing but a multi-leveled binary classification problem.

Follow me: the previous “who are you?” problem can be represented as a cascade of binary classification problems; in pseudo-code:

code

or, with vectors, like this:

vectors

In other words, the minimal classification problem that might be addressed is a binary classification problem: the most common is a spam/non spam filter. In such a classification problem you could be part of a set (for instance, being spam) or not (for instance, being ham), no other options.

What is the simplest machine learning method used for binary classification problems?

The answer is: Logistic Regression.

Think about it: both SVM and NN’s cost function include the concept of an activator; an activator is nothing but a sigmoid function, also called logistic function: a function that translates continuous values into a discrete interval (usually 0-1, where 0=off, 1=on).

Now prepare yourself for the big Matrix revelation: the simplest versions of SVM (which is a SVM with a linear kernel) and of NN (which is a single layer NN with a single output hypothesis unit/node) ARE LOGISTIC REGRESSION ALGORITHMS.

Let’s go further: non-trivial SVM and NN (I mean: with non linear kernels or with >1 inner layers) could be thought as A PIPELINE OF LOGISTIC REGRESSION COMPUTATIONS.

PeterGriffinTheRing

Still alive?

So we’ve found the connection between the two algorithms. Ah, this revelation implies also that when you have to create a very simple binary classification algorithm in a 2 dimensional space, the choice is not a 2 players game between SVM and NN, but you ought to include also the third wheel: Logistic Regression.

Better: in such a problem, Logistic Regression SHOULD be your first choice, following the old advice to “keep it simpler (but not dumber)“.

Now let’s get back to the core of this article: when and why to select SVM rather than NN (or viceversa).

Basically SVM is an improved nearest neighbor classifier; this means that you should be able to reduce your abstract problem to a problem which you can visualize, and which consists in clumps of points related to the same class; something like the picture we used earlier:

multiclass

Can you reduce your problem to something like this, with many classes deducible with your own eyes?

If so, this is a good starting point for selecting a SVM.

This is because SVM requires you to select a kernel, and to pick an appropriate kernel you must have an underlying knowledge about the distribution of the classes.

SVM is (as we said) a maximum margin classifier, which means that it does not learn any parameter but it’s pretty good to get rid of training examples which do not help defining the classification boundary: that’s exactly the purpose of the support vectors.

SvmMargin2

Let’s jump to the other side of the table.

A NN aims similarly to find a separating hyper-plane in a data set to divide observations of 2 classes, but unfortunately has nothing to do with the concept of ‘being supportive‘.

The purpose of all the inner layers of a NN is purely to find this separating surface; after achieving this goal, the NN literally stops learning. It doesn’t matter if this surface is 1 millimeter away from (most of) ‘class 1’ and 1 meter away from (most of) ‘class 2’: as long as the surface separates the classes, a single NN does not care about a ‘classification boundary’ – i.e. more prosaically, to move to the middle of (most of) the 2 classes.

The reason is that Neural Networks are not maximum margin classifiers.

Sometimes an image is better than 1000 words:
SVMvsNN_boundary

So, is a NN indefinitely doomed to be inadequate to generalize?

The answer for a single feedforward NN is yes; but consider that you can train a system of a multitude of NNets.

Returning to the initial example of the spaghetti-ice cream, if you choosed a single feedforward NN, ask yourself why you did not choose a simpler architecture (like a Logistic Regression) for such an (apparently) trivial classification problem.

As I said before, sometimes an image is better than 1000 words:

LRvsNN_boundary

So, is the difference in terms of accuracy so significant? Or are you just over-complicating things?

I am telling this because training multiple NNets surely avoids the danger of overfitting data and helps to generalize your solution, but it pays the penalty to be way more complicated, and way more onerous in computational times (it’s easy to figure out that training several nets requires more time and resources than ‘training’ a single SVM).

So, SVM may be faster for many data sets, but remember that this speed depends on its monolithicity: NNets have a complex structure that can be increased or downsized, SVM not.

This is very important for another factor: Does your data set fit all into memory in a single representation?

In this case NNets are preferable since it’s easier to adapt their structure to learn from data in chunks.

Ok, maybe you have read this rigmarole and want desperately to ask:

“You bored me to death, very good. Just one question: which one is easier to implement: SVM or NNets?”

Well, from an architectural viewpoint, SVM tends to be easier (especially – as we said above – if you are somehow able to visualize the classes in your data set): basically you just need to select a type of kernel -and the list is not infinite: Gaussian, Linear, Polynomial, String Kernel, Chi-squared Kernel, Histogram Intersection Kernel; the reason is that they have to satisfy the Mercer’s Theorem – and to calibrate few other parameters like the C (cost) and Gamma.

NNets need to be built up from scratch, and this includes all the structural choices like how many (hidden) layers, how many neurons in each layer, what type of neurons in each layer and, finally, the way you connect the neurons.

Continuing the dessert parallelism, SVM is like cannoli: you buy the ready-made cilyndrical wafer, buy the chocolate chips, buy the chopped pistachios, and all you have to do by yourself is the sweetened ricotta filling; then just assemblate all these ready-made ingredients with grace, et voilà:

Cannoli_Siciliani

SVM for everyone!

NNets are more like the French Iles Flottants dessert: at the beginning you just have whipped egg whites, milk, sugar and vanilla. Then it’s all up to you to know how to create all the different creams, until you reach the wished taste and texture and whatever.

This freedom gives more room to the adaptability to different tasks and scenarios, but requires a higher level of awareness about what you are doing at every step.

Another point to consider is that usually the testing and refinement of the hidden (internal) layers of a NN is quite complicated, often resulting  in a typical scenario of impasse where there’s something wrong (usually an overfitted model) but you can’t make head nor tail of anything to fix it.

ile_flottant

A picture of a typical Stochastic Neural Network

So, to summarize there’s not an absolute answer, as you could expect. Speaking of desserts, remember the ‘No Free Lunch Theorem‘: there isn’t a ‘bestest’ learning algorithm, it depends a lot on the model, which is, after all, a representation of the reality.

In the end, the NNets and SVM can even collaborate for a common task: never heard about Model Ensemble?

For instance, tasks like ImageNet classification don’t start with a hand-designed good representation, and need to learn one. A solution provided in the past used Deep Learning (so NNets models) to come up with a good representation, and then trained SVM on top of resulting features. NNets got rid of unuseful features, SVM got rid of unuseful training examples. Inspiring, don’t you think?

So folks, that’s all (for the moment).

I think I’m going to have my spaghetti-ice cream.

icecreamspaghetti

Cheers!

In Midtjylland Stat Virtus

An old quip states that “Everybody said something was impossible, until someone came along who didn’t know that, and just did it”.

It was the 2003 when the book ‘Moneyball – The art of winning an unfair game‘ by Michael Lewis came out. It chronicled the Oakland Athletics manager Billy Beane‘s revolutionary approach to baseball.

In the 80’s, in a world of hot dog-eating scouts that trusted uniquely in their own eyes’ judging ability, Beane proposed a strong analytical (today we would use the term ‘data-driven‘) viewpoint in their strategical decisions. This approach took the name of Sabermetrics (from another milestone book called ‘The Bill James Baseball Abstract‘ and its SABR, which stands for ‘Society of American Baseball Research’) and raised many eyebrows between scouts.

Moneyball charted a course for another generation of ‘statheads’.

First in line there was Nate Silver and his PECOTA predictive system of players’ growth. His strong point was the ability to forecast the aging curve which heavily affects a player’s skill level.

Nate Silver frequently mentions one of his most emblematic successful example: Dustin Pedroia. Whereas PECOTA in 2006 ranked Pedroia as the 4th best prospect, the scout magazine ‘Baseball America‘ classified him just 77th.

The result? On November 2007 Dustin Pedroia was named Rookie of the Year and the next season he took no less than the MVP award as the American League’s best all-around performer.

PECOTA proved to see farther than scouts’ ability.

The aging curve is a refined information which derives from a huge amount of data, and is, above all, something that a scout will never be able to perceive, because it requires the ability to synthesize into a pattern, which is far beyond human capabilities with dozens of features.

Let’s make a 7 years jump forward.

Matthew Benham is the owner of the English Championship club Brentford, and he’s also a milionaire that made a fortune betting on football.

In 2013, in a world of beans-and-sausages eating scouts that trusted uniquely in their own eyes’ judging ability, Benham made an astonishing announcement that raised eyebrows so much that they ran the risk of ending under the scalps: he bought a Danish club named Midtjylland and proclaimed that this club would be led uniquely with a statistical approach.

Rasmus Ankersen, chairman at Midtjylland and Benham’s right-hand man (as well as successful writer about analytical methodology applied to football), is sure that a data driven approach is a no man’s land in football, and will essentially turn his club into a laboratory for a radical and fruitful experiment.

My father used to say that there’s a reason if you see a no man’s land in a business: or you’re the first visionary to foresee its potential, or you’re another of the imbeciles that cannot see its obstacles.

And football surely presents its hurdles: a 2010 New York Times article dubbed this game as “the least statistical of all major sports”.

Furthermore, football managers are generally adverse to give up control to the numbers, for a psychological reason: they, basically, don’t trust them. This means that a total harmony of intents is necessary from the head to the tail of the club. But this not a problem for Midtjylland because there the boss IS an analyst.

For example, Ankersen stated that the league table alone is inaccurate of a team’s value because “there’s just way more randomness than people understand”. For this reason, he claimed that the table position will never determine the firing of a coach; their one and only judge will always be their mathematical models.

In such an objective approach to the decisional process, we expect the results to be the only element to tip the balance of power.

So the question is:

Has this numerical approach demonstrated significantly positive outcomes?

Midtjylland won the Danish Superligaen by 8 points over FC Copenhagen, showing by far the best attack (64 goals scored, with the best difference -30 goals- between scored and taken).

They also have scored almost one goal at match from ‘set pieces‘, quite an impressive number which testifies the effectiveness of their ideas.

PECOTAvsBaseballAm

On the other hand, Nate Silver’s PECOTA system selected a list of players that generated 546 wins for their major league teams through 2011. But the players in ‘Baseball America‘ (as we said, a magazine mainly based on scouts’ ranking) did better, producing a 630 wins. This 15% better performance generated an overall of $336 million in terms of prizes, not exactly a walk in the park.

In conclusion, obtaining effective results is not only a matter of being a visionary or an imbecile (even if it may help). It is the ability to accomplish the three great talents of a good data scientist:

1) To identify the signal from the noise

2) To account for the context in statistic

3) To separate out skill from mere luck

Orient (DB) Express

April, 2015

These years there’s been a lot of growing attention around the so-called ‘No SQL’ databases.

This name distinguishes a generation of db with alternative premises compared with usual ER db’s: these principles can be graph db, document db or key-value db.

Originally they had a strong anti SQL-centric viewpoint; this inflexibility was one of the reasons of their interest, but also their limit: the world was still using largely SQL and an environment excluding SQL was a cost rather than an investment.

The ones which evolved and decided to adopt SQL and other points of convergence with the world outside are still alive and fighting with us. A side effect is that the term ‘No SQL’ does not fit anymore, and ‘NoT Only SQL’ seems more appropriate.

Old school ER db’s had to face quite a rigid architecture of data: basically entities with attributes connected with other entities. Using a specific term, it was a world easy to normalize.

Modern days db’s live in a jungle: unstructured (for instance: documents to be semantically parsed) and large amount of data (often in a swiftly changing environment, a.k.a. ‘high costs for join operations‘) that requires a high fluidity in their represantation. It’s no more about to normalize data; we need something polymorphic, something that, rather than descending from canonical db’s structures, belongs to the Object-Oriented modelling.

Since we are still talking about db’s architecture, we need also something that may preserve the ACID integrity of data – or, at least, most of it.

One of these ‘Not Only SQL‘ db is Orient DB, an italian project, totally written in Java and implementing interesting ‘No SQL’ features, especially for document oriented project’s environments.

Orient DB’s main quality is being versatile:

  •  It’s a document db but at the same time a graph db (implementing the Tinkerpop Blueprints framework), a key-value db and also an object db
  • It supports ACID transactions
  • It supports SQL statements (remember that speech about ‘NoT Only SQL’ db’s?)
  • It supports natively HTTP, RESTful and JSON requests
  • It’s portable: the engine runs 100% on a JVM
  • It’s self-sufficient: it has got a Apache 2.0 License (no fees or royalties required) and no dependices from other software

Unfortunately, it’s not all bed of roses.

Even though Orient DB’s peculiar data structure (called MVRB-Tree, a specialization of classical red-black trees) claims to be quite efficient in terms of ‘read’ operation, it shows computational limits when performing ‘update’ operations, or worst, a long succession of ‘update‘ operations, something not exactly negligible talking about a database.

The Open Journal of Databases (OJDB) in a 2014 article by Abramova, Bernardino and Furtado (Coimbra University, Portugal) compared different documental db’s against many different workloads.

In the final summarizing chart – excluding Redis, which is an in-memory key-value db which totally load data into volatile memory, and for this reason plays in a different pitch – our friend Orient DB pays a heavy penalty for all its versatility: it turns out to demonstrate the worst performance.

Using their words:

“[Orient DB] requires more system capacities compared with the capacities provided by the environment used in the evaluation. Therefore, their performances were limited by the memory management, the operating system and the use of virtual machine environment”

So, as it was easily predictable, every rose has its thorns.

For a more complete and pleasant (which is never a despised factor) series of articles about Orient DB and life side by side with it, I recommend this blog:

http://odino.org/blog/categories/orientdb-101/