Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts

Tuesday, October 21, 2014

Michael Jordan on the delusions of big data

Michael Jordan (the one from Berkeley, not the basketball player) gave an interesting interview to IEEE Spectrum.  it is recommended to read his own words. 

Some of the highlights of Jordan's opinions are:


  1. Using brain metaphors for computing is misleading:   computing does not work like the brain, this is also includes one of Jordan's expertise areas - neural nets.
  2. He says that the advances in computer vision lead us to be able to solve some kind of useful problems, but we are very far from giving machine the vision capabilities of a human
  3. "Big Data" is over-promising.  One can prove many false hypotheses using big data methods.  This is similar to building bridges without a theory of how to build bridges,  some may  survive, and some will collapse... 
  4. If he will have $1B to spend on research, he will invest in natural languages processing...


I think that it adds to some other observations about the overhype of "big data" (for example, see my posting on Noam Chomsky's opinion couple of years ago, or Tim Harford's recent article). 

Saturday, June 14, 2014

On data centric, decision centric, and situation centric - a response to Chris Taylor's "time and effort we waste on big data"

Some times there are scientific truths,  Nicolaus Copernicus coined the "heliocentric hypothesis", which states that the earth is revolving around the sun, and not vice versa.  His hypothesis was proved as a scientific facts.

The centric orientation is often a question for dispute, in a past post on this blog, I wrote about the dispute between Plato who advocated a society-centric approach, and Aristotle who advocated individual-centric approach.  

Chris Taylor recently wrote in the"Real-time & Complex Event Processing" site a post entitled: "The time and efforts we waste about big data".   Chris used the analog of "Tower of Babel"  and criticized the efforts invested in accumulating data within large warehouses, and the "data centric" approach, advocating another approach  "decision centric" approach. Stating ---  let's architect the "big data" around decisions, identify decisions required first, and then manage data as part of the decision architecture, making it decision centric.

  Let me add another view point here.  

 If we look at the sources of Big Data in 2015, we'll see the most of the data will come from sensors, and the second source is social media, where enterprise data which is the more familiar world became the minority.   If we look at the value of data the "Internet of Things",  one of its main values is the ability to detect situations and act upon them (in either reactive or proactive way).  Thus the center is neither data, nor decision, it is about situations, it becomes situation centric, and the architecture is around -- which situation we wish to identify, and then what data we need for that, and sometimes also what decisions we need when the situation is detected (note, the decision can be trivial, since when a situation occurs there is a single action associated with it, so it is not necessarily decision centric).

We have mentioned data-centric, decision-centric, and situation-centric.   Maybe one of the conclusions we can draw from Chris' analogy of "The Tower of Babel" is that there is no single viewpoint.  

Sometimes there is a need to accumulate data without a-priori knowledge what it will be used for. Medical data, for example, can be accumulated and lead to unexpected results, which will drive new type of decisions, and/or new situations we'll wish to identify.    In this case the data-centric approach is valid. 

In an organized world of structured processes with well-defined decisions, the decision-centric approach makes sense. As an example, when the main process is credit approval, this is a well-defined decision that centers both processes and data around it.

In the new world based on "Internet of Things" - situation-centric might become more dominant, and if we look at where big data really is -- we'll see more and more situation-centric in the universe.

Unlike the "heliocentric hypothesis" which is a scientific fact,  we don't have single scientific truth, but when anybody invests time and effort on big data, one has better to sort out what is the best value, instead of assuming that accumulating data is the value. 





Thursday, April 3, 2014

Big data - are we making a big mistake


I came across an interesting article by Tim Harford in FT Magazine.   This article in in line of several posts I have made on this Blog, which express some skeptics on the ability of merely looking at statistical correlation in the past to create "big insights". Harford brings some examples for that and concludes that there are some naive believes around the big data hypes.   I'll keep writing more insights about this topic. 

Friday, March 28, 2014

More from the Big Data workshop -- crowd wisdom vs. expert wisdom

Yesterday I spent all day in the second day of the Technion Computer Engineering center workshop on big data.  There were a few interesting talks, and the organizers promised to put the slides of all talks on their website (eventually).   I chose to write about an interesting talk given by Tova Milo, from Tel-Aviv University.  Tova talked about her work on crowd wisdom,  and also presented a video in which a contestant in a TV show who did not know an answer used the option of - ask the audience, and followed the audience to the wrong result, and out.   The talk discussed some means of knowledge acquisition, how to phrase questions.   The examples she gave were: what to do when I have headache,  and I am looking for a place to go for children attraction in NYC and a nearby restaurant which is children friendly.  

I asked her whether in the case of constant headache it is not better to ask an expert physician, and her answer was that people trust the crowd wisdom more than they trust their physicians, well I think it is a function of who the person is, and who the physician is.  When we planned the trip to New Zealand ,    We could use crowd wisdom, there is a lot of material on the web, of course, but we chose to go to an expert travel advisor and ask for a trip plan (including all travel arrangements), it certainly saved us time, but if one has enough time,  getting advices from the crowd is useful.   I wonder if somebody researched the trade-offs between expert wisdom and crowd wisdom, and classified the cases in which each should be used. 

Thursday, March 27, 2014

My talk in the Technion Big Data workshop




Yesterday, I gave a talk in the Technion Computer Engineering Big Data days --  the talk dealt with three topics:  why  the Internet of things did not happen yet,  very brief introduction to "The Event Model", and a new introduction of the Technological Empowerment Institute.  I'll write more about the institute soon.


Saturday, March 15, 2014

Google flu trends as a lesson in big data prediction

A recent article in the science section of TIME magazine reports that prediction using "big data" techniques is not as easy as portrayed.  It analyzes the Google Flu Trend case, in which the assumption has been that there is a strong correlation between the spread of flu, and the searchers for flu related terms in Google.   It seems that this does not produce accurate results.   The article claims that while using the big data methods is useful, they should be combined with traditional "small data" methods.  There are various definitions of what a small data is - for example, the one from "small data group"Small data connects people with timely, meaningful insights (derived from big data and/or “local” sources), organized and packaged – often visually – to be accessible, understandable, and actionable for everyday tasks.   

I guess that this also relates to the discussion about understanding causality in addition to statistical correlation that I've discussed before on this blog.

Sunday, March 9, 2014

Big Data analytics by Robin Bloor

Today I had to give students in a seminar introduction to big data analytics -- I chose a recent presentation by Robin Bloor (from slideshare). Bloor states that the term "data science" is a misnomer, since all science is empirical and involves analysis of data.   This is true for many of the sciences, still if my memory does not mislead me Einstein did not use empirical analysis of data to come with the relativity theory.  It also goes to the discussion of causality vs. correlation in science.   In any event, Bloor asserts that data science is actually a multidisciplinary efforts involves software engineering, statistics and domain knowledge. 

BI, according to this presentation, is partitioned to:  
  • Hindsight: regular reporting
  • Oversight: dashboards etc,
  • Insight: data mining & statistical analysis
  • Foresight: predictive analytics
He does not get as far as prescriptive analytics,  and puts the heavyweight on the insight. 
The second part of the presentation gives fast introduction to machine learning.       Overall, it gives introductory level insights on insights from big data, and is well presented as such.

Thursday, August 15, 2013

On machine learning as means for decision velocity

Chris Taylor has written in the HBR Blog a piece that advocates the idea that machine learning should be used to handle the main issue of big data - decision velocity.  I have written recently on decision latency, which according to some opinions - real-time analytics will be the next generation of what big data is about.
Chris' thesis is that the amount of data is substantially increasing with the Internet of Things, and thus one cannot get a decision manually in viewing all relevant data,  there will also not be enough data scientists to look at the data.   Machine learning which is goal oriented and not hypothesis asserting oriented will take this role.     I agree that machine learning will take a role in the solution, but here are some comments about the details:

Currently machine learning is off-line technology, case sensitive, and cannot be the sole source for decisions.


It is off-line technology, systems have to be trained, and typically it looks at historical data in perspective and learns trends and patterns using statistical reasoning methods.  There are cases of applying continuous learning, which again done mostly off-line, but is incrementally updated on-line.    When a pattern is learned it needs to be detected in real-time on streaming data, and here technology like event processing is quite useful, since what it does is indeed detect that predefined patterns occur on streaming data.  These predefined patterns can be achieved by machine learning.    The main challenge will be the online learning -- when the patterns need change, how fast this can be done in learning techniques.  There are some attempts at real-time machine learning (see presentation about Tumra as an example), but it is not a mature technology yet.

Case sensitive means that there is no one-size-fits-all solution for machine learning, and for each case the models have to be established in a very specific way for that case.  Thus, the shortage in data scientists will be replaced by shortage of statisticians,  there are not enough skills around to build all these systems, thus the state of the art need to be improved to make the machine learning process itself more automated.

Last but not least - I have written before that get decisions merely based on history is like driving a car by looking at the rear mirror.  Conclusion from historical knowledge should be combined with human knowledge and experience sometimes over incomplete or uncertain information.  Thus besides the patterns discovered by machine learning, a human expert may also insert additional patterns that should be considered, or modify the machine learning introduced patterns.




Tuesday, June 25, 2013

On speed and accuracy in event processing

This scary picture is taken from Theo Priestley's post in "Business Intelligence".   As a follow-up to his previous post about the two recent acquisitions in event processing, he talks about the focus in this world on speed.   While speed can provide relative advantage it can also be a double edge sword, if it comes on the expense of accuracy, as the recent Twitter hoax indicates.  
When talking about the the four Vs of big data -  one of them is velocity, and the other is veracity - which is defined as "data in doubt".  Indeed  processing  uncertain, inexact or inaccurate events or data is a major part of what big data is all about -- while there are some works in this area (for example: see my post from last year), it is still the less investigated part of the four Vs.  
Priestley is right -- doing things fast and inaccurate can incur big damage, doing things slow and accurate can also incur big damage.  The wisdom is to balance and minimize the risk.   Resolving the uncertainty issue is the key.   

Friday, June 21, 2013

Event processing platforms - reboot?

Doug Henschen, the editor of Information Week wrote a commentary entitled "big data reboots real-time analysis" .  Henschen says that event processing was in the height of its hype in 2008, but the economic crisis stopped the growth of this area.  He sees indications of "reboot" in the recent acquisitions of Apama by Software AG and Streambase by TIBCO, and attributes the reboot to the need of big data to evolve from its batch origins to detect patterns on moving data.  
As I have written before, the barriers to growth stem from some external factors (certainly the general financial situation), but also the over-hype of request-response or batch oriented analytics (see my post on Sethu Raman's keynote in DEBS 2012).  Another reason, as observed by Roy Schulte last year,   is that many enterprises developed in-house solutions.    I assume that Henschen is right in the sense that big data gives additional opportunities to event processing technology, and that the recent acquisitions will create waves of interest in the market.   As I have written before, the next frontier is not improving the technology, but making it accessible to the business users and convert the enterprises to think in an event-driven way.   Jeff Adkins and myself will discuss this issue in the coming DEBS'13 tutorial, on June 30.  More - later. 

Friday, May 10, 2013

Event processing - small data vs. big data and the Sorites Paradox.

This picture is taken from a blog post from the "Big Data Journal" by Jim Kaskade entitled "Real-time Big Data or Small Data".  

Kaskade attempts to define quantitative metrics to what is "small data" vs. what is "big data".  
In terms of throughput big data is defined as >> 1K event per second, while small data is << 1K per second, I guess that around 1K event per second is defined as medium data...  
On variety big data is defined as at least 6 sources of structured events and at least 6 sources of unstructured events.  There are other dimensions like - small data relates to one function in the organization, while big data to several lines of business.     

The attempt to define where "big data" starts is interesting, the main issue is what are the conditions in which implementation of systems should become different, and here the borders are not that clear, since there are currently systems that can scale both up and down.

Interestingly -- "Big" and "Small" are fuzzy terms.  Which reminds me on one of the variations of the Sorites Paradox,  that I've came across during my Philosophy studies, many years ago, which goes roughly like this.

Claim:  Every heap of stones is a small heap.
Proof by mathenatical induction.
Base:  A heap of 1 stone is a small heap
Inductive step:  Take a small heap of K stones and add 1 stone, surely it will stay a small heap.



Thursday, May 9, 2013

Causality vs. correlation - statistical reasoning is not enough - NY Times Interview with Dave Ferrucci


Dave Ferrucci, who was until several months ago an IBM Fellow  and was known as the father of Watson, was interviewed by the NY Times in his new working place at Bridgewater Associates.

In the interview Ferrruci somewhat continues the line of thought of Noam Chomsky,  saying that AI has concentrated around statistical reasoning based on correlations, but the drawback is that one cannot understand why the prediction made by the statistical reasoning is correct.  While Chomsky bluntly stated that statistical reasoning does not create a solid model of the universe, Ferruci claims that a complementary approach is required -  understanding causality.    This is a rather old issue, in symbolic logic, there is a distinction between "material implication"  which states that  IF A is true then B is true, and the meaning is that always when A is true then B is also true, which makes a sentence like  "If the week has seven days than  the capital city of France is Paris" - a valid statement in logic.    Entailment, on the other hand, said that "A ENTAILS B" if it is necessary and relevant, in other word, there is a causality among them.  Thus, Ferruci concentrates now on building causality models to model the world economy.      I concur with the assertion that understanding causalities give better abilities of reasoning and prediction.   As David Luckham already noted, causality among events is one of the major abstraction of event processing models.   Here is a rather old discussion about causality of events.  

Monday, April 22, 2013

Statistical reasoning and event processing tutorial


Streambase has posted a video tutorial on combination of their EP product with statistical reasoning based on MATLAB.  The idea of combining event processing and statistical reasoning is becoming part of the big data offerings, and no wonder that statistical reasoning vendors are adding event processing to their portfolio, for example the introduction of event processing within SAS
Streambase takes it from the other side -event processing vendor that combines statistical reasoning.
Interesting tutorial to watch.  

Friday, January 18, 2013

Using event processing to make "big data" becoming "fast data" by Alex Alves

As part of the first issue of the online magazine "Real-time business insights", Alex Alves wrote an article on the use of event processing in big data,  recently Alex remarked on this article in his blog,  saying that while the common big data platforms are batch oriented,  turning "big data" into "fast data" is done by combining event processing with big data technologies.     Stay tuned for the second issue of the online magazine,  now in preparation. 

Saturday, November 24, 2012

The big data hype cycle 2012

I haven't written in the last few days,  I have been in EU project review (as a reviewer) in Brussels and also had some time to be tourist, and climbed the Atomium, Brussels known icon

and  in several museums in center city, taking refuge from the rain 

  including the famous Magritte museum.   I have imported some Belgian chocolate (most of it was already given away)  and a Belgian virus, with whom I am struggling in the last couple of days.

I also came across the Gartner's big data hype cycle for 2012 -- the first time in which Gartner chose to look at big data as an area.


You may notice that "complex event processing" is around the peak of the diagram.

It seems that this hype cycle made Irfan Khan, CTO of Sybase quite furious, his firm reaction was:
"Gartner dead wrong about Big Data life-cycle".    Khan claims that Big Data is not a hype but a reality, and expectations are under-inflated not over-inflated since it can do much more than what people assume.

I guess that there is growing adoption to technologies associated with Big Data, but I don't think that it reached the plateau of productivity, as Khan's claims,  since this is not around whether there are mature products (by the vendors' conception), but around the utilization in industry, and it is difficult to say that most organizations had good exploitation of such technologies.  Furthermore, Khan's claim that Big Data is under-inflated actually shows that the plateau of productivity has been reached.   

In any event,  the event processing angle is interesting.  Note that originally event processing appeared in the hype cycle of enterprise architecture for several years.  In 2012 event processing does not appear explicitly, 
Big Data appears as one block in the top.  This shows that event processing has migrated (at least in Gartner's mind) from the middleware world into the analytics world,  and this is also compatible  with some of the current trends, but this should be a subject of another posting - coming soon. 

Sunday, October 7, 2012

On big data, small things and events that matter

In a recent post in the Harvard Business Review Blog entitled: "Big Data Doesn't Work if You Ignore the Small Things that Matter" ,  Robert Plant argues that in some cases organization invest a lot in "big data" projects trying to get insights around their strategy, while failing to notice the small things, like customers leaving due to bad service.   Indeed big data and analytics are now fashionable and somewhat over-hyped.  There is also some belief, fueled by the buzz that it solves all the problems of the universe, as argued by Sethu Raman in his DEBS'12 keynote address.   Events are playing both in the big data game, but also in the small data game, trying to observe a current happening, such as time-out on service, long queues etc..., when it relates to service, and other phenomena in other domains.  Sometimes the small things are the most critical.
I'll write more about big data and statistical reasoning in a subsequent post.

Saturday, February 11, 2012

Uncertainty in event processing

This cartoon is taken from Cartoonsbyjosh.com indicates uncertainty about uncertainty. 
And indeed, there has been a lot of work about uncertainty in data over the years in the research community, but very little got into the products, the conception has been that while data may be noisy, there is a cleansing process that is applied before using the data.    Now with the "big data" trend, this assumption seems not to hold at all times,  the nature of data (streaming data that need to be processed online), the volume of the data, and the velocity of having also imply that the data, in many cases, cannot be cleansed before processing, and that decisions may be based on noisy, sometimes incomplete or uncertain data. Veracity (data in doubt) was thus added as one of the four Vs of big data. 
Uncertainty in event is not really different from uncertainty in data (that may represent either fact or event).
Some of the uncertainty types are:

  • Uncertainty whether the event occurred (or forecast to occur)
  • Uncertainty about when event occurred (or forecast to occur)
  • Uncertainty about where the event occurred (or forecast to occur) 
  • Uncertainty about the content of an event (attributes' value)


There are more uncertainties relate to the processing of events

  • Aggregation of uncertain events (where some of them might be missing)
  • Uncertainty whether a derived even matches the situation it needs to detect -- this is a crucial point, since the pattern indicates some situation that we wish to detect, but sometimes the situation is not well-defined by a single pattern.  Example:  a threshold oriented pattern such as:  "event E occurs at least 4 times during one hour".   There are false positives and false negatives.  Also if event E occurs 3 times during an hour,  it does not necessarily indicate that the situation did not happen.


We are planning to submit a tutorial proposal for DEBS'12 to discuss uncertainty in events, and now working on it.   I'll write more on that during the next few months

Sunday, December 4, 2011

Cloud Computing Journal on Big data meets Complex Event Processing


Cloud Computing Journal published some citations from a podcast in which the analyst Dana Gardner, interviews Mahesh Kumar from AccelOps, about correlating streaming events with transient data to get real-time analysis of data in the context of big data, and cloud implementation.  The applications that AccelOps targets are mostly availability and performance management and security.   The idea is not new,it seems that there are various approaches and realization that data needs to be analyzed in real-time in order to get decisions, before being written to a disk.   

Thursday, November 24, 2011

More on big data and event processing



Philip Howard, one of the analysts who follows the event processing areas for many years, recently wrote about "CEP and big data".    emphasizing the synergy of data mining techniques on big data as a basis to create real-time scoring based on predictive model created by data mining techniques, his inspiration for writing this piece was reviewing the Red Lambda product.   It is certainly true that creation of event processing patterns off-line using mining techniques and then tracking this event patterns on-line using event processing is a valid combination, although the transfer from the data mining part to the event processing part typically requires some more work (in most cases also involves some manual work).    In general getting a model built in one technology to be used by another technology is not smooth, and require more work.  
The synergy between big data and event processing has more patterns of use -- as  big data in many cases is manifested in streaming data that has to be analyzed in real-time,  Philip mentions Infosphere Streams, which is the IBM platform to manage high throughput streaming data.   The data mining on transient data as a source for on-line event processing, and the real-time processing of high throughput streaming data, are orthogonal topics that relate to two different dimensions of big data, my posting about the four Vs summarizes those dimensions.   

Friday, September 30, 2011

On the four Vs of big data


In my briefing to the EU guys about the "data challenge", I have talked about IBM's view on "big data", recently Arvind Krishna, the IBM General Manager of the Information Management division, talked in the Almaden centennial colloquium about the 4Vs of big data.  The first 3 Vs have been discussed before:



  • Volume
  • Velocity
  • Variety  
 While the 4th V has just been added recently -  Veracity -- defined as "data in doubt".


The regular slides are talking about volume (for data in rest) and velocity (for data in motion),  but I think that we need velocity to process sometimes also data in rest (e.g. Watson),  and we need sometimes also to process high volume of moving data; the variety stands for poly-structured data (structured, semi-structured, unstructured).

The veracity --- deals with uncertain/imprecise data.  In the past there was an assumption that this is not an issue, since it would be possible to cleanse the data before using it,  however, this is not always the case.   In some cases, due to the need of velocity in moving data, it is not possible to get rid of the uncertainty, and there is a need to process data with uncertainty.    This is of course true when talking about events, uncertainty  in event processing is a major issue still need to be conquered.  Indeed among the four Vs, the veracity is the one which is least investigated so far.    This is one of the areas we investigate, and I'll write more about it in later posts.