Saturday, October 1, 2011

On context and punctuation

This illustration is taken from the EPIA book and shows the notion of context as a first class construct that groups together events that need to be processed together, where each of its instances is processed as a single unit,  context may have temporal aspects, segmentation aspects, spatial aspects, and state-oriented aspects, and the actual context instance may be combined of one or more aspects.  

Recently I had some discussion with somebody from the data stream community who was in the opinion that the stream oriented way to do it is more simple since they are based on standard thinking (i.e. SQL).

In the stream research literature, the temporal context is divided between windows and punctuation, all the other aspects are somehow expressed by the group-by construct.    I guess that windows and group-by constructs are familiar to many people, punctuation is less familiar, thus I'll explain the idea briefly. 

A good source to learn about punctuation is Peter Tucker's site, from where the illustration below is copied:
Peter Tucker has done his PhD (under the supervision of Dave Maier) on data stream punctuation;  the idea of punctuation is to enable creating sub-streams, since a stream by definition is infinite.   The notion of windows in classical data stream models consists of variations of sliding windows, either by event count, or time, however, in reality there is a need to explicitly end stream, i.e. when the bid ends.  The definition of 
punctuation in the database encyclopedia (the value was written by Maier and Tucker) says:

In other words,  this is a "dummy event" put into the stream to denote the end of a sub-stream.  

Now the question is whether using windows + punctuation + group-by is indeed simple.    The relational model's claim to fame is its simplicity, however, from early days of the relational model, it has been realized that there are semantic anomalies, and all the normalization rules came to resolve those anomalies, yet, the overall model has been simple.  However, this is not really true for the extension of the relational model for streams,  while the relational model has a single entity: relation,  the extension has multiple entities: relation, stream, window.  Punctuation is a kind of trying to add semantics as a kind of logical patch.  

IMHO, the context model is semantically cleaner.  It has a single semantic notion for all the constructs that determine how to group event together (in the stream terminology, create sub-streams).   In the context model  there are no dummy event; the notion of temporal context (window) can also support event intervals - time windows that start by event and end by event; the events that start and end the window are real events, and can be used for other purposes.  Moreover, looking at the definition of punctuation above, it defines the end of a sub-stream, but assumes that the next bid starts when a new bid event arrives.  However, if we look at a situation where bids have not only "end bid" events, but also "start bid" events, then we would like to ignore bids that arrive when no bid is open,  and this has no natural representation, and has to be implemented with tricks, so we also need the equivalent of  "start sub-stream punctuation".   So, we view this as kind of context based on , and not as kind of event within the stream with a special semantics. 

The model we presented in EPIA (which we keep evolve)  to be a semantic model that provides abstractions above the current implementations of both event and stream processing systems;  several product owners have already told us that they use concepts from the book as inspiration to next versions of their products, and I guess that full implementation that is based on this model as native is yet to come.   

Friday, September 30, 2011

On the four Vs of big data

In my briefing to the EU guys about the "data challenge", I have talked about IBM's view on "big data", recently Arvind Krishna, the IBM General Manager of the Information Management division, talked in the Almaden centennial colloquium about the 4Vs of big data.  The first 3 Vs have been discussed before:

  • Volume
  • Velocity
  • Variety  
 While the 4th V has just been added recently -  Veracity -- defined as "data in doubt".

The regular slides are talking about volume (for data in rest) and velocity (for data in motion),  but I think that we need velocity to process sometimes also data in rest (e.g. Watson),  and we need sometimes also to process high volume of moving data; the variety stands for poly-structured data (structured, semi-structured, unstructured).

The veracity --- deals with uncertain/imprecise data.  In the past there was an assumption that this is not an issue, since it would be possible to cleanse the data before using it,  however, this is not always the case.   In some cases, due to the need of velocity in moving data, it is not possible to get rid of the uncertainty, and there is a need to process data with uncertainty.    This is of course true when talking about events, uncertainty  in event processing is a major issue still need to be conquered.  Indeed among the four Vs, the veracity is the one which is least investigated so far.    This is one of the areas we investigate, and I'll write more about it in later posts.   

Wednesday, September 28, 2011

CICS event processing improved version

IBM CICS is an example for smart producer of event processing system, it does not do event processing inline, but instruments CICS transactions to emit events, and works in a loosely coupled mode with any event processing engine that can read its emitted events.  CICS TS 4.2  released recently has several improvements in the CICS event producing capabilities.  Among these improvement are:

  • Including the event emission to be part of the transaction, by doing the event emission as part of the commit process.  Note that since it is loosely coupled with the event processing itself, this does not becomes atomic unit with the event processing itself, I have recently written about the relationships between transactions and events, and identified this area as one that need to be investigated more. 
  • Change management inside the event instrumentation in CICS with appropriate tools
  • Inclusion of system events inside the CICS instrumentation (e.g. connection/disconnection to databases, transactions aborts etc..). 

Since the strength of a chain s typically equivalent to the strength of the weakest link,  in many cases the producer is the weakest link, and the amount of work required to emit the right events and the right time is often much larger than the rest of the system.   Smart event producers like CICS making this weakest link much stronger.

Sunday, September 25, 2011

On Actian and action applications

Ingres, one of the oldest DBMS companies which produces open source DBMS, and the first of the sequence of companies that Mike Stonebraker founded and sold, has recently changed its name to Actian, and positioned itself as focused on "action applications in big data".  The stated rationale about "action applications" is that current BI create reports and then it is left to the human to read the reports (or screens) and decide what to do, in "action applications", the application trigger actions automatically in response to data events and thresholds.   It seems that people from the BI community re-discovers/re-invents the Event-Condition-Action model?   so they'll probably get to more advanced event processing at some point.

It is interesting to note that the motivation they state on the Actian website (you'll have to press on "action apps" to see it) is - "BI is not working, more than $10B are spent every year on a pile of reports with no actions".    I guess that this is somewhat consistent with my previous posting citing a study that indicate that human decision makers don't succeed to get fast decisions based on BI.   Maybe BI is getting in the hype cycle to the phase of disillusionment, and maybe people in this community like SAS CEO who said last year that event processing has limited appeal  to BI (along with BI in the cloud), would have second thoughts.