Google FluTrends is a project that aims to predict influenza’s trend worldwide (mainly in US, anyway) with a lapse of time of 9 weeks.
FluTrends mines the web gathering information about the frequency of terms related to flu, like ‘influenza’, ‘sickness’, ‘cough’, ‘fever’ and so on. These terms are “good indicators of flu activity” – as Google analysts state.
The most noteworthy fact isn’t anyway this typical cool Google project, but its poorness in terms of results. David Lazer (nomen omen), professor at Northeastern University, in an article entitled ‘The parable of Google Flu: Traps in Big Data Analysis’ sniped at Big G’s flu models, basically because they predicted a number of medical visits that overestimated by two to one the actual number.
Lazer compared FluTrends previsions (yearly) to the official outcomes from the Centers for Disease Control and Prevention, and pointed out that Google predictions were quite inaccurate; worse yet, he demonstrated that, just developing a linear projection of CDC’s data, it was possible to improve FluTrends’ results.
I took some time to check myself Google’s predictions – and yes, even my aunt would figure out that they are quite off the track.
But I wrote this article to bring into focus a couple of things.
First of all, Google’s outcomes are presented badly. I don’t mean that this stuff does not have a cool look – I mean it is quite unintelligible.
Here’s a capture:
What does the y-axis represent ? Flu Activity ?
Low – Moderate – High ? What do you mean with ‘Moderate’ ?
Where are the numbers ? How can I compare this chart with another coming from a different source ?
When you present a prediction, you also show the actual data, to make clear the error between your model and reality.
Where are the actual data ?
If I presented a chart like this, without any feedback about expected result vs. actual result, every reader would share a simple thought:
“You know you sucked and you don’t want to make it clear”.
Second thought, dedicated to Prof. Lazer.
If you look at Google’s predictions, you will figure out that, as time goes by, FluTrends’ results improve.
Specifically, in 2013-2014 FluTrends did a good job, predicting properly the magnitude of the first spike (Dec 2013) as well as an influenza’s second wave in March 2014.
This model’s peculiar behaviour (sloppy starting in a ‘diesel attitude’) isn’t surprising, at least for two (interconnected) reasons.
1. This model belongs to a class of systems named Agent based. Agent based models rely mainly not on real data but on a filtered image of reality. The starting point are surveys, or, as for FluTrends, a social networks representation of reality. It is a distorted image, like watching through the bottom of a glass: people claim to wash their hands more often than they really do, and on social networks no one’s ever fired, unemployed or a loser.
So, FluTrends started with a social bias: people search on internet when feel early symptoms of flu; but if it’s not really flu, really unlikely people will be searching for topics like “I thought I had flu but I just overate”.
In few words, FluTrends draws an image of all the people who thinks to have flu, not all the people who are actually ill.
2. What’s more, Agent based models are not able to represent reality as a whole, but only a part of reality. Then reality is painted extrapolating from this partial knowledge.
And extrapolation unluckily can lead to pratfalls.
In the early 80’s the number of AIDS cases were growing exponentially. Some US scholars tried to extrapolate a prediction starting from the first 4 years’ trend: the most optimistic prevision revealed to guess just half the amount of the actual number of AIDS cases in 1995 (560000); on the other hand, pessimists painted a scenario that would have led essentially to human race extinction.
A continuous improvement based on new data ended up in models that are nowadays quite accurate.
So, what is a final comment about Google’s FluTrends project ?
Despite a clumsy look in data reporting (and this a big mistake, because it does leave room to insinuations), FluTrends might have some potential, if analysts will have the patience to let it mature.
Maybe another key might be to interpret FluTrends previsions as some kind of worst-case scenario model.