Monday, September 8, 2008

A footnote to the streamSQL paper

The comment that my good friend Claudi (AKA Pattern Storm) made in the complexevents forum made me curious to actually read this paper; reading it I had the uncomfortable feeling that since people insist to use a language style that implies type of thinking about event processing, and this creates semantic problems which they try to solve by use the same type of thinking, with more complicated constructs.


I'll use one simple example taken from the paper, which they had to deal with semantic problems that were caused by the way the language semantics.



The scenario (translated to my language - without the "streams") -- Events are reported about cars that move through some segment of the road; each event consists of

There are also simultaneous events, i.e. several events that happen in the same time unit (what ever the time granularity is). The inputs are events of this type, the output is - for each event, generate a derived event that include the original attributes of the events and the average speed of cars in the same time unit. If you want to see the types of problems that the SQL implementators see in this simple example, read the streamsql paper. Instead of discussing SQL, I would like to show an alternative way to think about the same problem.

The slide below shows an alternative way to think about this problem - this is a very simple EPN (Event Processing Network) which has two functional agents, one producer (e.g. an event emitter that create events from video stream produced by a camera that looks at the road) and one consumer (whoever wants to see the output events)..




The two agents work under the same temporal context (it can be spatio-temporal if we also want to group by road segment) - in this case, a temporal context is opened and closed every beginning and end of 1 time unit.

  • The raw event is called "car position event" and it goes to both agents.
  • The first agent is an aggregator which calculates (incrementally) the average, since it is bounded to the context, the average is of events from the same time unit, at the end of the time unit it produces a single event "speed-average-event" with the structure

  • The second agent is a "pattern detector" which takes two input events - the "car position event" again, and the derived event "speed-average-event"; the pattern that need to be identified is AND, and the "speed-average-event" for that agent has a consumption policy of "reuse" (which means that if an event can be used for multiple patterns). The agent produces a derived event - for each AND pattern that consists of the "output-event" whose structure is:

This EPN does not involve "streams" - the thinking is "event oriented" and it attempts to provide natural thinking about event processing functionality.

Comments:

1. This is rather simple example, can also be solved by putting the average speed event on a global state (or event store/database) and then enrich it back - but the event-oriented is closer to the spirit of the original example which work on streams.

2. Aggregator and pattern detector are type of agents, there are some (not many) more types. Typically, an event processing network consist of multiple types of agents.

3. "Pattern Storm" claims that stream SQL ignore causality. One can view the relation between input events and output events of the same agent as a causality relation (he is using another scenario from the paper), and this can be set while defining the EPN.

One general comment (not related to this posting) - to "anonymous" - I'll gladly answer your question if you'll send it back and identify yourself. I don't publish anonymous comments.

I can post the solution to the rest of the examples in the stream SQL paper if anybody is interested...

6 comments:

Hans said...

I think that some of your post got cut out: "each event consists of", "whose structure is:". But this doesn't detract from the point.

I agree that it is worthwhile to try to reduce the differences between the semantics that we use in our reasoning and the semantics of the computer language. I think that this is really what David L is asking for when he talks about a language that is better than SQL.

And I also agree that your description here is semantically better than an SQL alternative. Although I am not sure what is the difference between your connectors and "streams". Even though the SQL approaches are coming to (or have arrived at) languages where one can express a query that produces pretty much the same algorithmic structure that you have described here, there is still the overhead of translating between the human semantics (a temporal context) and the SQL semantics (a bunch of operators that, by the nature of their windowing declarations, produce the result of a temporal context).

From experience, one test will be whether, now that you have made the easy stuff easy, the hard stuff will at least not be any more difficult. Or will you unintentionally introduce a different set of hacks and semantic twisting that is required required to produce more tricky logic. I am sure you all have put plenty of thought into this, so I am optimistic.

It will also be interesting to see whether your new language will be translatable to SQL. This is not purely an academic question, since there is the issue of query plan optimization. Because the batch sizes may be relatively small in EP-SQL (compared to the number of results that come from a moderate query on a static table) I am not sure whether those optimizations will buy very much. However, if the optimization turns out to be worthwhile, then it may be interesting to translate your semantic language to SQL.

I think that, since you seem to want community feedback on your approach, offering comparable solutions to published problems is possibly the best way to get everyone to understand the thinking behind the approach.

Opher Etzion said...

Hi Hans. You are right, some of the text have been cut (probably by careless movement of the mouse)- I'll correct.

The different between streams and what I call "event pipes" is that stream is typically a "time series" and is processed in sets, while in pipes events arrive in sporadic way individually. Streams are actually subset of pipes.

There is a semantic gap, but also difficulty to express some of the required semantics easily in SQL. I think that the benefit of semantic approaches are even more in difficult cases.

The regular SQL optimizations is not the same as optimization needed for event processing, and given this, they may not be a basis for implementation, cases where it may be worthwhile to translate to SQL is when doing retrospective processing on historical events that are persistent.
I'll write more on this at a later
phase.

cheers,

Opher

Hans said...

I believe that there are many of the basic SQL optimizations that apply to EP. Of course, not every EP scenario can use these optimizations, but quite a number of EP use cases require batches (as they are called by the paper). For example, your temporal context describes a batch. And when you combine many operations within the context of a batch, then in many cases, you get back to the old SQL scenarios of joining, nested loops, etc.

But anyway, that is beside the point. I'm personally looking forward to seeing more of your approach, thanks for taking the time to explain.

Opher Etzion said...

Hans.

The difference is that "batch" arrives in a set, while "events" arrive one by one - even if the processing is required at the end of the context only, in many cases it can be optimized by doing calculations incrementally. If everything is done in retrospect, then we are in good old SQL mode.

cheers,

Opher

Richard Tibbetts said...

Opher, I appreciate the response, and the attempt to dig into the issues of the paper. However, I think you are still missing the main point.

Your notation is fine. I think it is equivalent to various approaches to SQL notation. Really, notation doesn't matter for this topic. The important thing is when processing occurs, and how composable are units of processing.

In your example, you say "at the end of the time unit it produces a single event 'speed-average-event' with the structure". But how do you know what the end is?

Does the derived event come out just after the given time unit? If so, then how does it combine with other events in the time unit, since aren't they gone by the time it is emitted? And which time unit are they in? The same one that other events are in, or some interstitial time unit?

Does the derived event come out in the same time unit? Then how do you know you have all the input events? What if the output of a computation is fed back into the input? Doesn't it then become impossible to wait for all input?

Or does the derived event come out in the same time unit, but ... later? Does that still make for a consistent definition of time?

I won't claim to have the right answer to all these questions, but the important thing is that any real system, and any standard language, needs to have an answer. And, when it comes to standardization, to nail down all the edge cases. That's the work that Jennifer Widom started in 2004, and that we were continuing with this paper.

Opher Etzion said...

Hi Richard. The issue is indeed not the notation, it is the semantics. In this case the semantics of temporal context - and uncertainty about completeness (e.g. have all relevant events arrived?), or how "late arrival" should be handled. These are indeed the "edge issues" are tough issues; while there are no perfect solutions, I'll provide some thoughts about them in this Blog; but it will have to wait a bit.

cheers,

Opher