Saturday, March 15, 2014

Google flu trends as a lesson in big data prediction

A recent article in the science section of TIME magazine reports that prediction using "big data" techniques is not as easy as portrayed.  It analyzes the Google Flu Trend case, in which the assumption has been that there is a strong correlation between the spread of flu, and the searchers for flu related terms in Google.   It seems that this does not produce accurate results.   The article claims that while using the big data methods is useful, they should be combined with traditional "small data" methods.  There are various definitions of what a small data is - for example, the one from "small data group"Small data connects people with timely, meaningful insights (derived from big data and/or “local” sources), organized and packaged – often visually – to be accessible, understandable, and actionable for everyday tasks.   

I guess that this also relates to the discussion about understanding causality in addition to statistical correlation that I've discussed before on this blog.

Sunday, March 9, 2014

Big Data analytics by Robin Bloor

Today I had to give students in a seminar introduction to big data analytics -- I chose a recent presentation by Robin Bloor (from slideshare). Bloor states that the term "data science" is a misnomer, since all science is empirical and involves analysis of data.   This is true for many of the sciences, still if my memory does not mislead me Einstein did not use empirical analysis of data to come with the relativity theory.  It also goes to the discussion of causality vs. correlation in science.   In any event, Bloor asserts that data science is actually a multidisciplinary efforts involves software engineering, statistics and domain knowledge. 

BI, according to this presentation, is partitioned to:  
  • Hindsight: regular reporting
  • Oversight: dashboards etc,
  • Insight: data mining & statistical analysis
  • Foresight: predictive analytics
He does not get as far as prescriptive analytics,  and puts the heavyweight on the insight. 
The second part of the presentation gives fast introduction to machine learning.       Overall, it gives introductory level insights on insights from big data, and is well presented as such.