Thursday, June 16, 2011

IBM celebrates 100 years

Today, June 16, IBM celebrates its centennial.    It is an achievement to a person to get to 100 years, but also to an organization.    There is a famous book: Built to Last, that talks about companies the lasts for many years.  IBM is in this list along with American Express, Boeing, Disney and some others.  You can view the IBM centennial site. 

On Different roles of events in context

Continuing the series of posts on sliding windows,  following some questions and comments.

An event can have multiple roles:  it can participate in determine the context boundaries (start and end of a window), in an event processing function (e.g. aggregation or pattern matching) and also can serve both roles.

Let's look at the following example (which I thought of while driving to work today - not a long thought, I live 3.7 KM from the office).  

Consider a sliding window of every 100 cars that pass through some  point in the road (assuming there is a sensor that creates for every passing car an event with the velocity of the car).

Assume that also there is a  traffic light for pedestrians that is activated when a person presses a button, and each time this traffic light turns green for the pedestrians, there is an event created. 

Our applications consists of two aggregation EPAs that derive the following events:

1. Create a derived event with the average velocity per window.
2. Create a derived event with the count of pedestrian crossings per window.

Now, recall that the window is a non-overlapping sliding window that counts 100 passing cars.

For derivation 1:  The "car passing event" has three roles: an instance of this event initiates  the window, an instance of this event terminates the window, and the aggregation consumes instances of the same event.   The boundaries decision determines whether the 1st event and the 100th events are included in the aggregation function, and here the intuition says that the answer is positive for both, thus it makes sense to use the close interval semantics.

Now going to derivation 2.   Here the event participates in the derivation - pedestrian crossing, is different from the events that determine the boundaries of the window.

Assume that the 1st instance of "car passing event" happens in 1:00, the 100th instance is 3:45, and the 101st instance in 4:15 (no much cars driving at that hour).   

In this case, a pedestrian crossing that occurs in 3:42 is counted in this window,  a pedestrian crossing that occurs in 4:18 is counted in the next window, pedestrian crossing that occurs in 4:01 is not counted in any window, since the semantics of the sliding window don't enforce them to cover all time points.  

This semantics is OK, if that what we meant,  if not - we have to use different semantics.   More - later.

Tuesday, June 14, 2011

More on the two types of sliding windows

In a comment to my previous postings on window boundaries, I was asked why do we need two type of interval semantics:  half closed for the time-oriented sliding window, and closed for the event-oriented sliding window (one of them counting time, the other counting the number of events of certain types).  

The question is: why can't we use the half-closed interval semantics also for the event-oriented sliding window,  let's say we have a sliding window that counts 5 events of one type, the 6th event that serves as a starting point for the next window, will also terminate the previous window.

The answer:   this solution is not really equivalent to the one with closed on 5 events.   

Let's take an example:

Instance 1 occurs in 10:02
Instance 2 occurs in 10:03
Instance 3 occurs in 10:13
Instance 4 occurs in 10:14
Instance 5 occurs in 10:17
Instance 6 occurs in 11:01

According to the closed interval semantics, the interval is [10:02, 10:17],  according to the half closed interval semantics on the 6th instance, the interval is [10:02, 11:01) , which means that event that occurs in 10:35 belongs to the window according to the first interpretation, and does not belong to the window according to the second interpretation.

Furthermore, if in the end of the window, there are some derived events emitted, or action triggered, this will now occur in 11:01 and not in 10:35 -- which again may create other problems. 

In some applications the distance between events is very small, since the assumption is that the events of the types that bounds the windows are very dense, thus the distinction between the two becomes marginal,  however, this is not the general case; in the general case the distance between the 5th and 6th instances of the events may be quite substantial, this is true for many applications.  

This reminds me that in the course that I've taught, the students implemented projects using various products available on the market today, and one of the teams (I will not disclose the product name) has written in its report that indeed the window is closed only when the next event arrives, thus when they debugged their system they added dummy event, otherwise the window would never close.   

More window related discussion - later

Monday, June 13, 2011

On the boundaries of windows

Today I heard rain knocking on the window.  Rain in June is quite a rare event in Haifa, but happened before.  When I lived in the USA it was always peculiar to me to watch heavy rains and thunderstorms in the summer. 

Looking at windows,  window in event processing is an important concept, designating temporal context.  we have discussed this issue in length in chapter  7 of the EPIA book,    Windows can be isolated or sliding, isolated windows can start by event or fixed time, end by event or fixed time, or expire after some time offset.   
Sliding windows can slide by time, or by event count, and can be overlapping or non-overlapping.  

In any of these variations, every single window is a time interval.    The question is what type of interval - open or closed on both ends.    

In the book we mention two types of intervals:

For most types of window, we used the half open window,   if the interval boundaries are denoted by Ts and Te (for start and end),  then an event whose time-stamp is  T belongs to the window if  Ts ≤ T < Te, which says that events that occur in the interval starting point are included, and those at the interval ending point are not included.   This, for example, guarantees, that in non-overlapping sliding time window, an event belongs to exactly one window instance.   
For the sliding event window we used the close interval semantics Ts   T  Te.  The rationale is that if the sliding window has a count of five events, we typically mean that all of those five events belong to this window. 

Some comments here:  I have heard the opinion that it is not an issue, since there are systems which create total order of events, serializing them, by having a single process that assigns time-stamps.     These can be valid for some applications, but is not valid in the general case due to two reasons:

  1. The event that starts or ends the window can by itself participate in some EPAs that are active in this window, thus a decision is needed whether it participates or not.
  2. In various applications the applicable time-stamp is the occurrence time of this event as reported by its source, and not the detection time assigned by the system, thus several events can occur at the same time-point.  Furthermore, even in the assigned detection time,  in distributed systems there may be multiple entry points, and ensuring total order may not be cost effective.   
While in the book we mentioned two of the four possibilities for time intervals, one can think of cases in which the other two may be useful, which indicates that such semantics might be required to be configurable.