Saturday, October 1, 2011

On context and punctuation

This illustration is taken from the EPIA book and shows the notion of context as a first class construct that groups together events that need to be processed together, where each of its instances is processed as a single unit,  context may have temporal aspects, segmentation aspects, spatial aspects, and state-oriented aspects, and the actual context instance may be combined of one or more aspects.  

Recently I had some discussion with somebody from the data stream community who was in the opinion that the stream oriented way to do it is more simple since they are based on standard thinking (i.e. SQL).

In the stream research literature, the temporal context is divided between windows and punctuation, all the other aspects are somehow expressed by the group-by construct.    I guess that windows and group-by constructs are familiar to many people, punctuation is less familiar, thus I'll explain the idea briefly. 

A good source to learn about punctuation is Peter Tucker's site, from where the illustration below is copied:
Peter Tucker has done his PhD (under the supervision of Dave Maier) on data stream punctuation;  the idea of punctuation is to enable creating sub-streams, since a stream by definition is infinite.   The notion of windows in classical data stream models consists of variations of sliding windows, either by event count, or time, however, in reality there is a need to explicitly end stream, i.e. when the bid ends.  The definition of 
punctuation in the database encyclopedia (the value was written by Maier and Tucker) says:

In other words,  this is a "dummy event" put into the stream to denote the end of a sub-stream.  

Now the question is whether using windows + punctuation + group-by is indeed simple.    The relational model's claim to fame is its simplicity, however, from early days of the relational model, it has been realized that there are semantic anomalies, and all the normalization rules came to resolve those anomalies, yet, the overall model has been simple.  However, this is not really true for the extension of the relational model for streams,  while the relational model has a single entity: relation,  the extension has multiple entities: relation, stream, window.  Punctuation is a kind of trying to add semantics as a kind of logical patch.  

IMHO, the context model is semantically cleaner.  It has a single semantic notion for all the constructs that determine how to group event together (in the stream terminology, create sub-streams).   In the context model  there are no dummy event; the notion of temporal context (window) can also support event intervals - time windows that start by event and end by event; the events that start and end the window are real events, and can be used for other purposes.  Moreover, looking at the definition of punctuation above, it defines the end of a sub-stream, but assumes that the next bid starts when a new bid event arrives.  However, if we look at a situation where bids have not only "end bid" events, but also "start bid" events, then we would like to ignore bids that arrive when no bid is open,  and this has no natural representation, and has to be implemented with tricks, so we also need the equivalent of  "start sub-stream punctuation".   So, we view this as kind of context based on , and not as kind of event within the stream with a special semantics. 

The model we presented in EPIA (which we keep evolve)  to be a semantic model that provides abstractions above the current implementations of both event and stream processing systems;  several product owners have already told us that they use concepts from the book as inspiration to next versions of their products, and I guess that full implementation that is based on this model as native is yet to come.   

No comments: