The importance of being Twitter

March, 2015

For years researchers around the world have striven to find real proofs of black holes influencing the time. But an observable black hole that could bear witness of this behaviour has always been under their eyes: Facebook.

Try to occur this situation: You open in all innocence your FB page, then you take a look at what your friends have done in the last few days, and – BOOM – you fall into an Einstein-Rosen bridge and outside it’s sunset.

Yes, Facebook is a brilliant time-wasting machine, which has really few useful applications in common works – except of course for people working with Social Sciences or advertising (which could probably be thought as the Social Science of year 2000).

I believe the real problem of these years is categorization coupled with inattentiveness. This gives birth to ignorance, and the laziness to contain it. Social media have the guilt to be promoters of this dreadful combination. Everyone has his Warhol’s 15 minutes of notoriety, everyone can discourse on everything, from technology to politics, from religion to science, with the same cultural level of a sport pub.

This mental impoverishment seems to have repercussions also on technology itself. In a 2012 essay entitled ‘Is our economic growth over?‘, Robert Gordon compares the impact of Computer Science with the effects of the Second Industrial Revolution (end of 1800), which introduced power plants, bulbs, combustion motor, telephone, radio, recorded music and cinema.

We got along with these inventions till the 70’s, the computer revolution’s eve. Since then, our economy has kept a regular 2% growth rate. In the meantime mainframes started to crank out bank reports and bills, ATM’s replaced check-out assistants and barcode scanners replaced clerks.

But, with year 2000, the iPod replaced the Walkman, smartphones replaced cellphones and cars started to board advanced devices like abs, esp, six air-bags but are still moving thanks to a 1800’s combustion motor.

These innovations were welcomed with enthusiasm, but constitute only further incentives to consumerism and time wasting. They don’t contribute to productivity: on the contrary, they probably reduce it and steal time to higher level interests.

An electric bulb changed the world, Tumblr is probably just a platform to share pictures of a cat resembling Orson Welles.

How to get out of this vicious circle of mediocrity ?

‘Cum grano salis’ – with a little bit of wisdom.

Social media are not the devil itself. They are just a tool. It’s up to us trying to get the best out of them.

For instance, let’s speak about Twitter.

Twitter is a magnet for people with a huge ego that want to pollute the world with their fluff and fuss – as the world was interested about what they’re about to eat tonight.

But Twitter can be also a goldmine to keep in touch with some of the most exciting projects and best minds all around the world.

Some examples?

Looking for people?

I personally follow Andrew Ng (@AndrewYNg), Joel Spolsky (@spolsky), Chris Olah (@ch402), Elon Musk (@elonmusk), Gael Varoquaux (@GaelVaroquaux).

Want inspiring companies?

IBM Watson (@IBMWatson), Numenta (@Numenta), Lumiata (@lumiata), Medidata (@Medidata)

Tools and Systems for Data Science?

FastML Extra (@fastml_extra), Yhat (@YhatHQ), Startup.ML (@startupml)

Social Networks are a time-wasting machine. Do you want to break the circle? Learn to dominate your network and milk its best.

Select your interests, and make your time profitable: every morning, right at the tip of your finger, go to your Home, and – taaaac! -Twitter will provide you an updated newsletter on what’s new about your main interests!

Advertisements

Google got a Flu (but some Analysts are Sick too)

May, 2014

Google FluTrends is a project that aims to predict influenza’s trend worldwide (mainly in US, anyway) with a lapse of time of 9 weeks.

FluTrends mines the web gathering information about the frequency of terms related to flu, like ‘influenza’, ‘sickness’, ‘cough’, ‘fever’ and so on. These terms are “good indicators of flu activity” – as Google analysts state.

The most noteworthy fact isn’t anyway this typical cool Google project, but its poorness in terms of results. David Lazer (nomen omen), professor at Northeastern University, in an article entitled The parable of Google Flu: Traps in Big Data Analysis sniped at Big G’s flu models, basically because they predicted a number of medical visits that overestimated by two to one the actual number.

Lazer compared FluTrends previsions (yearly) to the official outcomes from the Centers for Disease Control and Prevention, and pointed out that Google predictions were quite inaccurate; worse yet, he demonstrated that, just developing a linear projection of CDC’s data, it was possible to improve FluTrends’ results.

I took some time to check myself Google’s predictions – and yes, even my aunt would figure out that they are quite off the track.

But I wrote this article to bring into focus a couple of things.
First of all, Google’s outcomes are presented badly. I don’t mean that this stuff does not have a cool look – I mean it is quite unintelligible.

Here’s a capture:

GoogleFluChart

GoogleFluChart

What does the y-axis represent ? Flu Activity ?

Low – Moderate – High ? What do you mean with ‘Moderate’ ?

Where are the numbers ? How can I compare this chart with another coming from a different source ?

When you present a prediction, you also show the actual data, to make clear the error between your model and reality.

Where are the actual data ?

If I presented a chart like this, without any feedback about expected result vs. actual result, every reader would share a simple thought:

“You know you sucked and you don’t want to make it clear”.

Second thought, dedicated to Prof. Lazer.

If you look at Google’s predictions, you will figure out that, as time goes by, FluTrends’ results improve.

Comparison_10_11

Comparison_11_12

Comparison_11_12

Comparison_13_14

Comparison_13_14

Specifically, in 2013-2014 FluTrends did a good job, predicting properly the magnitude of the first spike (Dec 2013) as well as an influenza’s second wave in March 2014.

This model’s peculiar behaviour (sloppy starting in a ‘diesel attitude’) isn’t surprising, at least for two (interconnected) reasons.

1. This model belongs to a class of systems named Agent based. Agent based models rely mainly not on real data but on a filtered image of reality. The starting point are surveys, or, as for FluTrends, a social networks representation of reality. It is a distorted image, like watching through the bottom of a glass: people claim to wash their hands more often than they really do, and on social networks no one’s ever fired, unemployed or a loser.

So, FluTrends started with a social bias: people search on internet when feel early symptoms of flu; but if it’s not really flu, really unlikely people will be searching for topics like “I thought I had flu but I just overate”.

In few words, FluTrends draws an image of all the people who thinks to have flu, not all the people who are actually ill.

2. What’s more, Agent based models are not able to represent reality as a whole, but only a part of reality. Then reality is painted extrapolating from this partial knowledge.

And extrapolation unluckily can lead to pratfalls.

In the early 80’s the number of AIDS cases were growing exponentially. Some US scholars tried to extrapolate a prediction starting from the first 4 years’ trend: the most optimistic prevision revealed to guess just half the amount of the actual number of AIDS cases in 1995 (560000); on the other hand, pessimists painted a scenario that would have led essentially to human race extinction.

AIDS_trends

AIDS_trends

A continuous improvement based on new data ended up in models that are nowadays quite accurate.

So, what is a final comment about Google’s FluTrends project ?

Despite a clumsy look in data reporting (and this a big mistake, because it does leave room to insinuations), FluTrends might have some potential, if analysts will have the patience to let it mature.

Maybe another key might be to interpret FluTrends previsions as some kind of worst-case scenario model.

‘Show me your company and I’ll show you who you are’

March, 2014

When I was at the University, I had a very amusing and offbeat researcher that worked as a tutor for the “Probability and Statistics” class. One day, during his lesson, he taught us a life lesson about his study subject that I could never forget.

The lesson started stating that many studies proved the acceptable idea that there is a precise correlation between smoking cigarettes and lung cancer.

Smoke

Starting from a significant dataset, this researcher illustrated one of this studies, proving that the percentage of people with lung cancer who had a past as a smoker was high (I cannot remember the exact value, but something meaningful – like more than 75%). It seemed clearly a proof of direct causation between the two facts.

Then he explained that the common term ‘Bayesian’ comes from an eminent statistician named R.A. Fisher, and originally had a denigrating connotation.

R.A. Fisher

R.A.Fisher besides not to believe in Bayesian probability, also didn’t believe that smoking can cause lung cancer, and in a magistral confutation showed that, starting from the same dataset, the percentage of smokers that actually died of lung cancer was incredibly low.

Then, this researcher asked:

“Do you know why R.A.Fischer intervened in this subject ?

He was a smoker. And he was a consultant paid by the tobacco companies”.

The life lesson was:

“Every time you will see a probabilistic value, ask yourself who stands behind it and then ask yourself why”.

Now back to our days.

One of the best metrics to evaluate your forecast probably is calibration. Calibration is intended as your probability of an event to occur (a probablistic factor), against the actual observed frequency (a statistical factor).

Let’s make an example: you predict a probability of 60% of raining in a certain time interval; if the actual observed frequency is 55%, it means your predictions are accurate. If it turns out to be 20%, maybe it’s time to review your model (or your work, or the damn wheather in general).

J.D. Eggleston lived with his son in Kansas City. Kansas City has got a peculiar wheather – really variegated and extreme, with hellish summers, droughts and tornadoes (as every ‘The Wizard of Oz’ reader knows).

J.D. Eggleston one day asked himself how good were local TV wheather forecasts. They turned out to be really scanty (see the figure):

Local TV Forecasts

The question is: Why do they prefer to present such a miserable result (when they declared a 100% of forecast probability they ended up having only a 67% of accuracy) instead of using – for instance – the National Weather Service, which turned out to be really much more reliable and – above all – free ?

Nat_Weat_Serv

The answer resides in this payoff matrix, which reflects the way weather forecast channels usually sense their job:

Senza nome

In a payoff matrix, every choice in a decision-making process has a cost.

If your prediction about tomorrow turns out to be accurate (like the first and third line in this table), this is good. Unfortunately this is also what you are supposed to be paid for, so it is positive but not impressive: let’s give it a score of +100.

Otherwise, if your prediction about tomorrow turns out to be fallacious, we have to face two very different scenarios.

In one hand we have a false positive: it was supposed to be rainy but it is not rainy. Despite of being a mistake, this situation can be perceived by the users as positive: “I planned to spend my Sunday at home watching football, but it turns out to be a sunny day, so I am going out with a smile upon my face” (if this sounds incredible for you, please see Wikipedia Serendipity article to know more about this subject, and please, take your life easy, man).

So this is positive anyway: let’s give it a score of +50.

In the other hand, we have a false negative: it was supposed not to rain but it is actually raining. “Damn wheather forecasts! You ruined my weekend and I hate you! I will never trust you anymore!” This scenario is way the worst possible, and we gave it the score of -1000.

In such a payoff matrix your major effort is just to avoid to end up in this last scenario: any other scenario is positive, even the one where you are wrong with a false positive.

Now, do you understand why the local meteorologists turned out to suck so much in their predictions ?

They did it purposely, it was a matter of calibration: even if they have a significant probability not to be rainy, they chosed to be cautious and affirm that it could have been rainy anyway. The threshold of nice weather was intentionally pushed to high values of certitude: only if they were super sure to be a nice sunny day they declared a forecast of a nice sunny day.

So, as my University researcher taught, in Probability and Statistics some results are not just a potshot, and “every time you will see a probabilistic value, ask yourself who stands behind it and then ask yourself why”.

New blog (a.k.a: WordPress Is The New Black)

I decided to move my blog to WordPress, to a more fashionable and professional platform (as far as I can argue about being fashionable..).

I’m taking the chance to publish here also some old articles I wrote somewhere around the web; so I will make explicit the date of writing on top, which may not match with the wordpress release date.

Ah, this will probably be reflected also in the content of my articles that may wind up flitting from topic to topic like a drunk butterfly (indeed it could be a proof it’s just my mind that works like that).