Saturday, March 21, 2009

More on event-at-a-time processing

I have borrowed this picture from Brian Connell's postings a few months ago entitled: "one-by-one can still be CEP". I have not really returned to this topic, but it is a good time to do it now, following my last posting on "set-at-a-time" vs. "event-at-a-time". Many people are used to program in set-oriented languages like SQL, thus a simple match between two entities become first creating the Cartesian products of the sets they they belong to, and then selecting the element from this Cartesian product. This is not a natural way that people are thinking about processing events. Let's look at a nice penguins in the picture, we want to trace want happen in the penguin colony, by putting observing them. Let's say that we want to observe when a young penguin stays away more than 1KM from the home glacier f0r more than an hour, which may indicate that it can get lost. Furthermore, we wish to get the alert immediately when this happens, and not at the end of any time window. Some of the "set thinkers" view the "stream processing" paradigm as something organized in which the partition of events is well defined by the notions of streams and windows, and the processing already has the input set, and all it has to do is to apply some function on the input set, and "event-at-a-time" processing as ad-hoc programming in which events arrive to some event processor who then has to hard-code the entire semantics. But this is, of course, a misconception. Let's look again at the penguin example; assuming that we are looking at the following pattern: a young penguin may be lost if it stays over a 1 KM away from the glacier for over one hour. This can be expressed by set oriented processing, as looking at observations of the penguin (let's say we watch it once every minute), and determine at the end of an hour that all observations are more than 1KM away. But -- when do we start the one hour period? the answer is -- first time that the penguin crosses the 1KM bound, so we need a notion of an event that starts a time window, actually the term context is wider than time window -- it contains here : temporal aspect (within one hour, when the one hour can be initiated by a specific event), semantic aspect : an EPA is tracing a single specific penguin, so it is associated with some penguin id, and the pattern is spatial :
all events are more than 1KM away from the glacier. Now, what is the benefit of having event-at-a-time implementation, a simple benefit: if the young penguin starting to head back then we can close this context instance, and terminate the EPA, say after 3 minutes, and don't trace this penguin anymore, until the next time it swimming far, while in the set-at-a-time we'll determine only at the end of the time window that there is nothing to detect here. Of course, the set thinkers will immediately say that we can reduce the window, so reducing the window to units of single events exactly gives us the "event-at-a-time" notion. More than that, it is not only a question of efficiency, it is also question of expressing a fine-tuned semantics. Let's look at another penguin scenario: we are now tracing lazy penguin who return to the glacier after less than 2 minutes after jumping to the water. here we have a sequence of two events relate to the same penguin within a specific temporal context ("within 2 minutes"). This is not a set operation at all, it is looking of a sequence of two individual event. True- it can be expressed in set-oriented-programming we'll have to create two streams (or one heterogeneous stream) of "jumping to the water" and "returning to the glacier", and then join them, select the appropriate instance and thus determine which members of the set matched this pattern, but this is not a natural way to think about it. While in "event-at-a-time" this can be done by just opening a context-instance for every penguin that jumped into the water, if it does not return within 2 minutes, this context-instance is closed, if the penguin returns, then there is pattern match, and the context-instance is closed even earlier.

But let's move to an example about the tune-up of semantics. Assume that we are looking for the pattern saying that the average stay in the water of a penguin is less than 2 minutes, which may indicate some laziness plague, or any other plague that makes the penguins lazy. In a set-oriented programming we'll have to define the set -- let's say a time window of one hour, thus, when the set is all accumulated we can calculate the average and match it to the threshold; however, it becomes tricky when the average is actually a moving average, thus it may be possible that if we do this calculation after 30 minutes the pattern is matched, since the average in this 30 minutes of staying in the water is 1 minutes and 56 seconds, while if we consider the whole hour, the pattern is not matched, since the average is now 2 minutes and 9 seconds. Doing the calculation event-at-a-time enables us to get the average even to set of any size, even without committing ahead on the size of the set.

This is not done in ad-hoc processing, but supported with high-level programming primitives that are sometimes easier to express than their equivalent set-oriented notation.

There are of course cases in which the set-oriented calculation makes sense -- exactly when we are doing aggregations at the end of some fixed time intervals, and in some applications this may be the main function we need -- but, I assume that we'll see more and more hybrid applications.

Last but not least -- the distinction between "simple" and "complex" event processing is considered in the fact that simple deals with events being processed "one by one" while complex process multiple events, however "event-at-a-time" is not processing "one-by-one" since in the "one-by-one" processing, an event is processed without looking at other events, and in "event-at-a-time" each event is processed individually but within the state of a certain context-driven EPA. More - Later


Hans said...

I think I get what you're saying now. Clearly a hybrid approach is the only long term solution.

Interestingly, some SSQL languages do include event-at-a-time features within their set-based framework. For example, some SSQL languages have windowed operators (say, average) that can output incremental results on every incoming event. But this integration of techniques is not yet complete or holistic.

Arturs said...

I believe that if we refer to set-oriented programming (SQLish) in context of event processing, we already get somewhat a hybrid solution by having operations that allow to define windows (and thus have sets). If they have only 'fixed' time-based windows, then we are stuck with '< 2 min swim' use case, but in most cases, if we do extend to time-based windows, we probably will also have row(event)-based windows (Hans had mentioned that as well). But I don't want to get into language wars either; for me it's also not about which language extensions are used, but conceptual model underlying them. I believe that this post is also a somewhat follow-up to your If SQL extensions are the answer then what is the question ?,- set-oriented programming my not be the answer. So I agree to " is also question of expressing a fine-tuned semantics."
It is interesting to observe evolving CEP, as it goes towards standardized and well-understood subject of interest.

Opher Etzion said...

Hello Arturs, you are right, I believe that different languages styles have pros and cons, but my objective is not to advocate a certain programming styles, one can get from "set-at-a-time" langague and extend it to have some "event-at-a-time" functionality, or go the other way, and clearly we need both types of functionality for certain cases; what we need is a formal model that includes both as an underlying model.



Anonymous said...

Thought it over and over again - just don't get it: What do you mean with "Doing the calculation event-at-a-time enables us to get the average even to set of any size, even without committing ahead on the size of the set."
How's that different? How can I not predefine a moving average?

I'm fairly new to the field. Maybe I just got my wire's crossed. Any clarification most welcome.

Thanks, Chris

Opher Etzion said...

Hello Chris. When doing event-at-a-time computing then each event is processed individually, if this event -- in conjunction of the previous events that have been processed in the same context -- complete a pattern -- then a match is detected -- and some action comes out of it. In this case, let's say that we define some condition on the average, any event that gets in changes the average, then the pattern condition is evaluated and if matched then the action is been taken.