Monday, October 20, 2008

More on the semantics of synonyms

Still lazy days of the holiday week, I took advantage of the time to help my mother, who decided that she wants to leave home and move to seniors residence located 10 minutes walk from my house, this requires to deal with many details, so that is what I was doing in the last three days.... In the picture above (taken from the residence site) you can see the front entrance and the view seen from the residence tower, on the right hand side of the upper picture one can see part of the neighborhood we are living in (Ramat Begin) surrounded by pine trees all over.
Now, holiday eve again, and this is a good time to visit the Blog again. Last time I started the discussion in the semantics of synonyms by posing a simple example of conjunction over a bounded time interval (same pattern that Hans Glide referred to in his Blog), and slightly different from the "temporal sequence" pattern.
In the previous posting I have posed the following example:

Detect a pattern that consists of conjunction of two events (order is not important) - e1, e2.
e1 has two attributes = {N, A}; e2 has also two attributes = {N, B} ; the pattern matching is partitioned according to the value of N (on context partitions I'll write another time).

For each detection, create a derived event e3 which includes two attributes = {N, C}; E3 values are derived as: E3.N := E1.N ; E3. C = E1. A * E2. B.

Let's also assume that the relevant temporal context is time-stamps = [1, 5] - and the events of types E1 and E2 that arrived during this period are displayed in the table below:

The question is: how many instances of event E3 are going to be created, and what will be the values of their attributes?

Looking at this example, for N = 2, there is exactly one pair that matches the pattern
E1 that occurs in timestamp 5, and E2 that occurs in timestamp 4, so E3 will have the attributes {N = 2, C = 24}. However, for N = 1 things are more complicated. If we'll take the set oriented approach that looks at it as "join" (Cartesian product), since we have 3 instances of E1 and two instances of E2, we'll get 6 instances of E3 with all combinations. In some cases we may be interested in all combinations, but typically in event processing we are looking for match and not for join -- that is the difference between "event-at-a-time" type of patterns and "set-at-a-time" patterns that is being used by some of the stream processing semantics. So what is the correct answer ? -- there is no single correct answer, thus what is needed is to fine tune the semantics using policies. For those who are hard-coding event processing, or using imperative event processing languages, this entire issue seems a non-issue, since when they develop the code for a particular case they also build (implicitly) the semantics they require for a specific case, however policies are required when using higher level languages (descriptive, declarative, visual etc...), policies are needed to bridge between the fact that semantics is built-in inside higher level abstraction, and the need to fine-tune the semantics in several cases. In our case we can have several types of policies:

Policies based on order of events - example:

For E1 - select the first instance; for E2 - select the last instance.
For E1 - select the last instance; for E2 - select the last instance

Policies based on values - example:

For E1 - select the highest 2 instances for the value of A ; for E2 select the lowest instance for the value of B.

These are examples only -- it is also important to select a reasonable default which satisfies the "typical case", so if the semantics fits this default, no more action is needed.

These have been examples only, in one of the next postings I'll deal with identifying the set of policies required in order to make the semantics precise.


Hans said...

As you say, it's important to understand that this is not a join. Although several of the possible matching policies can be implemented using SQL (I have surely not tried them all), a few additional operators added to existing SQL languages would make it infinitely easier. I know that this is not the direction you are going, I'm just pointing it out.

In this post, you illustrate something that I like about this pattern: it is "tunable". By choosing a matching policy and other parameters, one could code in just the minimum knowledge of the underlying event generation, while maintaining as much flexibility as possible.

Opher Etzion said...

Hello Hans.

Personally, I don't think that SQL is the ideal starting point, however, this can certainly be implemented as an addition to SQL, somewhat differently than the additions done by vendors today. I'll write in one of the next posting on additional way to extend SQL.



Brian Connell said...

I'd view it pretty much the same way - depends on policies and the ability to tune the results. But I tend to think in terms of "event at a time" though, in which case I'd get 4 E3 events. (Numbering the events sequentially, event 5 matches 1,2,4, then event 7 with event 3, leaving event 6 still waiting)

Opher Etzion said...

Hi Brian.

Your interpretation matches the policy of:

ALL for E1
FIRST for E2.

This is a possible interpretation but not the only one. You could have ALL for both (and then get the cartesian product), there are also several other possiblities.



Hans said...

By the way, this is a moderately common data processing algorithm too. For example, for merging time series data or some other kinds of processing that SQL doesn't really support.

If someone is going to be processing events from their ESB or from some software components in the enterprise, I will not recommend any of these policies though. I would recommend that they go to every practical length to have the events match by some unique value (just a change to the partitioning of N). The point being that any of these policies that you describe here encode some knowledge ordering. A little code, network or bus configuration change could alter the order of the events and mess up the event processing. So I would recommend that, as far as is possible, people don't rely on ordering logic for enterprise data.

Opher Etzion said...

Hans. It is true that base on order of events arriving to the system ("detection-time semantics") may be non-deterministic. However, basing on the order that events happened in reality (according to the time-stamp they carry) assuming that we get it right -- may be used; also order is not the only predicate when setting policies, but it is the most common.