Showing posts with label data stream models. Show all posts
Showing posts with label data stream models. Show all posts

Monday, July 1, 2013

DEBS 2013 -- the tutorial day

Arlington Texas.

DEBS 2013 started today with the tutorials day.   In the morning I have attended a tutorial given by my IBM colleagues (past and present) members of the System-S team who gave a tutorial about stream optimizations.   The presentation was not posted yet -- so I'll write more when it does.   The interesting and useful concept they introduced is a catalog of optimization techniques along with the conditions when such an optimization might be applicable.   Some of the methods in the catalog are: operators reordering,  fusion, operator separation, and fission (defined as - replicate operator for parallel processing).     Very interesting talk.  As an observation, the talk covered "black box optimizations" - where the code of the operator itself is touched.   There are other efforts on "white box optimizations" like the one we presented in DEBS 2011

In the second half of the day - Jeff Adkins and myself presented our tutorial about event driven thinking.  
I'll write about it after the tutorial slides will become public.   

Tomorrow --  the first paper and industry sessions.   

Saturday, October 1, 2011

On context and punctuation

This illustration is taken from the EPIA book and shows the notion of context as a first class construct that groups together events that need to be processed together, where each of its instances is processed as a single unit,  context may have temporal aspects, segmentation aspects, spatial aspects, and state-oriented aspects, and the actual context instance may be combined of one or more aspects.  

Recently I had some discussion with somebody from the data stream community who was in the opinion that the stream oriented way to do it is more simple since they are based on standard thinking (i.e. SQL).

In the stream research literature, the temporal context is divided between windows and punctuation, all the other aspects are somehow expressed by the group-by construct.    I guess that windows and group-by constructs are familiar to many people, punctuation is less familiar, thus I'll explain the idea briefly. 

A good source to learn about punctuation is Peter Tucker's site, from where the illustration below is copied:
Peter Tucker has done his PhD (under the supervision of Dave Maier) on data stream punctuation;  the idea of punctuation is to enable creating sub-streams, since a stream by definition is infinite.   The notion of windows in classical data stream models consists of variations of sliding windows, either by event count, or time, however, in reality there is a need to explicitly end stream, i.e. when the bid ends.  The definition of 
punctuation in the database encyclopedia (the value was written by Maier and Tucker) says:






In other words,  this is a "dummy event" put into the stream to denote the end of a sub-stream.  


Now the question is whether using windows + punctuation + group-by is indeed simple.    The relational model's claim to fame is its simplicity, however, from early days of the relational model, it has been realized that there are semantic anomalies, and all the normalization rules came to resolve those anomalies, yet, the overall model has been simple.  However, this is not really true for the extension of the relational model for streams,  while the relational model has a single entity: relation,  the extension has multiple entities: relation, stream, window.  Punctuation is a kind of trying to add semantics as a kind of logical patch.  


IMHO, the context model is semantically cleaner.  It has a single semantic notion for all the constructs that determine how to group event together (in the stream terminology, create sub-streams).   In the context model  there are no dummy event; the notion of temporal context (window) can also support event intervals - time windows that start by event and end by event; the events that start and end the window are real events, and can be used for other purposes.  Moreover, looking at the definition of punctuation above, it defines the end of a sub-stream, but assumes that the next bid starts when a new bid event arrives.  However, if we look at a situation where bids have not only "end bid" events, but also "start bid" events, then we would like to ignore bids that arrive when no bid is open,  and this has no natural representation, and has to be implemented with tricks, so we also need the equivalent of  "start sub-stream punctuation".   So, we view this as kind of context based on , and not as kind of event within the stream with a special semantics. 


The model we presented in EPIA (which we keep evolve)  to be a semantic model that provides abstractions above the current implementations of both event and stream processing systems;  several product owners have already told us that they use concepts from the book as inspiration to next versions of their products, and I guess that full implementation that is based on this model as native is yet to come.