Sunday, September 14, 2008

On sporadic events

I have never been a student in Stamford high school, but Stamford, CT is my home away from home for the next seven days. Starting tomorrow, I'll provide some impressions from the Gartner meeting and EPTS symposium, but I rely on other people in the blog-land to have a better coverage (e.g. Paul Vincent, with endnotes and references). I've Arrived earlier today, and resting before the busy week.

One of the thoughts that came to mind when looking at some of the discussions around the Stream-SQL standards, is one more observation.

While the claim that the difference between a "stream" and a "cloud" is that a stream is totally ordered and a cloud is partially ordered, I think that there are also some more distinctions. I'll discuss one of them --- sporadic events vs. known events. When dealing with "time series" type of input, then the timing of events are known, for each time unit (whatever it is) there is an event (or set of events) that are reported. This is true when the events are stock quotes provided periodically, or signals from sensors provide periodically. There are events that are not naturally organize themselves in time-series fashion, for example: bids for an auction, complaints from customers, irregular deposit, coffee machine failure etc...

From the point of view of functionality, there is no much difference --- one can create time series which reports a possibly empty set of events for each time-unit, but if in most time-unit the reported set is empty, this will not be very efficient way to handle it. On the other hand a system that does not support time series as a primitive can view all events as sporadic events, but there may be some optimization merit to the knowledge that events are indeed expected at every time unit. This is just one dimension -- but this leads me to reinforce my conclusion that there are various ways to provide event processing functionality, and the most efficient way is probably hybrid of approaches based on the semantics of functions and characteristics of the input. So this is the observation of today; it will be interesting to see what will be the main discussion topics in the coming conferences.
More later - from Stamford Hilton.


Jeff Wootton said...


Just catching up on the blogs. Wanted to comment on the assertion that streams deliver data based on a fixed time interval.

That may be true for some streams, but is not true for real-time market data, which is one of the most often cited example of a data stream. Market data feeds deliver prices on an event-driven basis, not a fixed time interval. When market activity is high is when you might get thousands of messages per second, and when markets are slow you might on get hundreds. For a single stock, you can get several trades a second for a busy stock, but minutes or even hours might go by without a trade on a thinly traded security. Bottom line: there is no fixed time interval.

Opher Etzion said...

The term "stream" may be overloaded like any other term -- talking about "time series" oriented processing, I referred to cases where events are handled as a set ("batch") every fixed time unit, the size of the batch may not be the same.