Saturday, August 30, 2008

On the streaming SQL evolving standard

Kudos to our colleagues from Oracle and Streambase for their presentation in the industrial section of VLDB 2008 -
Towards a Streaming SQL Standard
Stan Zdonik (Streambase,Inc.), Namit Jain (Oracle), Shailendra Mishra (Oracle), Anand Srinivasan (Oracle), Johannes Gehrke (Cornell University, USA), Jennifer Widom (Stanford University), Hari Balakrishnan (Streambase,Inc.), Mitch Cherniack (Streambase,Inc.), Ugur Cetintemel (Streambase,Inc.), Richard Tibbetts (Streambase,Inc.).
Unlike last year, I have not participated in VLDB this year, though I would love to visit New Zealand when an opportunity arrives. VLDB is certainly a respectable conference, and the list of authors include some respectable members of the database research community. Mark Palmer also blogs about it, under the title: towards a CEP standard.
A few comments about it:
  • I think that this work is important, currently there are multiple variations of SQL extensions for various event processing purposes, and it will be easier if there will be consolidated.
  • There is a mention of "event based" vs. "set based" views. Looking at patterns that are detected, there are indeed patterns that are best approached in "event based" view, meaning that when each individual event arrives, there is an evaluation whether a pattern has been completed; "set oriented" is more convenient when the pattern is on set operations -- for example: looking if the average value of some attribute for all events that belong to certain context, is more than some threshold. Example of "event based" pattern is: looking for a sequence of two events (customer-complained, delivery-arrived), example of "set based" is: average of all delivery-actual-times in a certain shift is more than 30 minutes, where the delivery-summary is a derived event derived from: order-made and delivery-arrived).
  • Retrospective pattern - i.e. patterns on historical events are "set oriented" by nature, but as shown there are cases in which the set-oriented thinking is also applicable to running events (this, of course, can be emulated by "event based" pattern).
  • SQL extensions, of course, cover only part of the languages that exist in the event processing universe, and those who don't believe in the SQL region, will probably not convert to be believers if streaming SQL standard will be approved; I have written in the past about the Babylon tower and did not change my opinion since then -- I view SQL (with all of its extensions) a natural way to express queries about "states", but not about "collection of transitions", and think that there is a more natural way to think about it. The EPDL work we are doing is a step towards it, however, the idea is to use it (at least initially) as a meta-language, where the Streaming SQL may be one of its major implementations - I'll provide more information about the EPDL project later this year.
  • Another comment: while the language standard is certainly the most challenging, there are also other standards that need to be discussed in the area of inter-operability, event formats, modeling and more. In the EPTS symposium coming next month - we'll dedicate some of the time to standards, starting with a keynote address of a standard expert about the impact of standards on industries, and then there will be a panel with various participants to discuss these issues.

Friday, August 29, 2008

On research and practice in event processing

Triggered by a question of Hans Glide to a previous posting, today's topic is the relationships between research and practice in event processing. I'll not go to ancient history of the event processing area ansectors such as: simulation, active databases etc.., but start from mide to late 1990-ies, when the idea to generate a generic languages, tools and engines for event processing has emerged. This area has emerged in the research community. David Luckham and his team in Stanford, has done the Rapide project, Mani Chandy and his team in Cal Tech has done the Infospheres project, John Bates has been a faculty member in Cambridge University and Apama was a continuation of his academic work, my own contribution has been in establishing the AMIT project in IBM Haifa Research Lab, which is also part of the research community (kind of..). In the "stream processing" front there have been various academic projects - The Stream project in Stanford, The Aurora project in Brown/MIT/Brandeis, this are just samples, and there were more - however, the interesting observation is that the research projects have been there before the commercial implementation, furthermore, many of the commercial implementation were descendents of academic projects, examples are: Isphers was descendent of Infospheres, Apama was descendent of John Bates' work, Streambase was a descendent of Aurora, Coral8 was a descendent of the Stnaford stream project, and probably there are more. However, when commercial products are introduced, the world is changing, and there is a danger of disconnect between the research community and the commercial world, since products have life of their own, and are being developed to various directions, while people in the research community continue in many cases with the inertia to work on topics that may not be consistent with the problems that the vendors and customers see. While wild research is essential to breakthroughs, the reality provides a lot of research topics that have not been anticipated in the lab, and there is a need to do synchronization in order to obtain relevant research.

The Dagstuhl seminar in May 2007, where people from academia and industry met for five days and discussed this issue has been one step, my friend Rainer von Ammon organizes periodic meetings on these issues, and a European project may spin off these meetings We shall discuss this topic in the EPTS symposium, we have more than 20 members that are part of the research community, many of them will also participate in the meeting.

Bottom line: the life cycle is --

1. Ideas start in the research community.

2, At some point the commercial world catches-up.

3. Parallel directions - research continues, commercial products evolve to their own way.

4. Synchronization, exchange of knowledge, ideas flow in both directions -- need guidance.

More - later.

Thursday, August 28, 2008

On the "Event Processing Thinking" Blog - after the first year

One of the ways to obtain events is through "calendar events", this is useful for time-out management, periodic triggering etc. Today I saw in my calendar a reminder: this is the one year anniversary of the "event processing thinking" Blog - you should write something about it. Actually, yesterday I got a note from one of the analyst firms that research the impact of Web 2.0 on companies and was asked to participate in this study on my Blogger hat... This is not the first time that people approach me based on reading my Blog for various purposes, and actually I can say that I have under-estimated the power of Blogs and the amount of visibility it gets. This is probably the most visible communication vehicle exists today (how many people are reading papers?)

Looking at the Blogland I also realized that the visibility can be a double-edged sword, since people can easily expose their own ignorance, so I am trying to write only on stuff that I think
I know something about...

One thing that is interesting is the statistics (who reads the Blog) - it seems that the previous time I've written about statistics has been one of the most read postings (see below).

Looking at the Google Analytics statistics it seems that since the start of measurement (I've installed Google Analytics 2 weeks after the Blog start) more than 10,000 distinct persons (10,139 to be exact) have read this Blog. I don't have any illusion that there are 10,000 people who are interested in event processing, and some got due to the wonders of the almighty Google (e.g. looked for a picture of unicorn), so a better metrics is to see that 1/3 of the readers returned more that once, and 1432 readers returned more than 50 times - which is the more reasonable number the amount of people interested in the content. It seems that the amount of people who read all or at least 2/3 of the Blog postings is around 800, and this seem to be the size of effective readership.

What else can I learn from the statistics? The most popular postings are:

(1). Agnon, the dog, playing and downplaying is still, and by far the most popular one, in this posting is one of the postings where I claim that "event processing" is a discipline that stands on its own fits, and not a footnote to database technology or business rule technology.

(2). Revisiting the Blog **2 again which, like this posting, is talking about statistics around this Blog, I wonder why this posting is so popular (or people wanted to look at the map of Arkansas to plan their next holiday.

(3). On infant, professor and unicorn despite the fact that this posting is much younger, it had a lot of traction, some because people are looking for pictures of unicorns, and some because always disputes bring more rating... However, rating is not all, and when I think that I've said all that I need to say about particular topic, I move on.

As far as the geographical distribution of readers: there have been readers from 124 countries.
In terms of amount of entries - the big ones are:
(1). USA, (2). UK, (3). Israel, (4). Japan, (5). Germany, (6). Canada, (7). France and (8).India. As far as the amount of individual readers - the big ones are:
(1). USA, (2). UK, (3). Germany, (4). India, (5). Australia, (6). Israel, (7). France and (8). Holland. So it seems that in Japan I have relatively small (less than 100) but loyal set of readers - I am still looking for some opportunity to travel to Japan - never been there (actually I have never been in India either).
In the USA there are now readers from all 50 states (+ DC) and the leading are: California, Massachusetts and New York. Putting Arkansas map helped - and now Arkansas in the 16th place in the USA in visits.

The three big cities in terms of visits are still : (1). London, (2). New York City, (3). Bangalore.

I'll not survey the negative and positive reviews about this Blog - and let every reader judge. that is the essence of the entire Web 2.o business! -- well, that's all for today; Will return soon with a more professional posting.

Wednesday, August 27, 2008

On event processing as a discipline and some subsets

Every parent knows that discipline is very important from very early age (e.g. make the child realize that "sleeping time" is not negotiable), the term "academic discipline" also derived from a body of knowledge related to a certain issue that a student should learn, and indeed one of the interpretation of this word in the Webster dictionary is a field of study. In the DEBS 2008 conference, a disscussion has started whether "event processing" is a discipline by its own right, or subset of another discipline. My answer is that "event processing" is beginning its steps as a discipline, but has to go some way in order to become a full-fledged discipline. One of the interesting questions is -- since event processing applications in one or other form have existed in the past, why it did not develop until now to become a discipline, what has changed recently? my own answer is that earlier work have done ad-hoc event processing for a special purpose, while recently there is an attempt to handle "event processing" as an area systematically. To give a couple of examples:

(1). Processing of time series (see example) have existed long time ago, and also used as inspiration to those in the database community who looked at data stream management. Time series processing assumes that events arrive in fixed intervals, and typically the processing is statistical operations - like aggregation, exponential smoothing, regression, trend analysis and other stuff. The people who has dealt with this area were not interested in the more general picture of event processing (e.g. event processing when event arrive in a sporadic way and not in fixed intervals, processing event at a time, and not set etc..).

(2). Event correlation in network and system management - this has been around in the last 15-20 years (see the Computerworld article) - here again, there is a very specific sub-case, aimed to cope with "event storm" - a network or system administrator is facing a lot of events which are symptoms for problems (e.g. time-out of device can be a symptom for the fact that a router is offline) and "correlate" the symptoms to their "root cause", the problem. There is a notion of looking at patterns - but typically very limited patterns (e.g. conjunction of events over a time interval), while this has been core to system and network management, people who dealt in this area have never investigated event processing in the larger sense (e.g. looking at additional patterns), and this area has also not spawned the event processing discipline. In a previous posting about the term event correlation I have discussed the fact that while in the network and system management this term is well-defined, people sometimes use it in different contexts in an ambiguous way, and thus this is one of the confusing terms (a common misconception is that event correlation is alias to CEP) . I prefer to leave it in its network and system management meaning, and use more precise term for other interpretations.

There are more subsets of event processing, these have been just examples; the difference in the recent years is that the fact that several works both in academia and industry started to look at the bigger picture of "event processing" - what is it, what are all possible utilizations, what are the required functions, what are the non-functional requirements, what techniques have been used in other disciplines (AI, databases, control theory, distributed computing, simulation...) that can help solving this issue. This has not been done before; as said, there is a long way to go on all these topics, but this has been true for each discipline in its beginning, and I see positive momentum.

Last but not least -- I was recently asked if the discipline is called "event processing" or "complex event processing" are they aliases ? -- my own preference is to use "event processing", discipline names typically consist of two words (information retrieval, data management, artificial intelligence, image processing, computer vision...). As noted before, I accept the glossary definition of "Complex Event Processing" according to which denotes a subset of the larger event processing picture. More later -- this issue will be discussed in the EPTS 4th event processing symposium, and it will be interesting to hear the various opinions on that topic.

Sunday, August 24, 2008

On Event Stores and Temporal Databases

I am an old-fashioned guy who carries handkerchiefs, like this one, anywhere he goes, it is handy for multiple usages, anyway - while in the past, all department stores in Israel carried handkerchiefs and it was quite a popular product, for some reason, it went out of fashion, and I have hard time to renew the inventory of handkerchiefs, and in this sense, I wish I could step for a minute into the past, buy two dozens of handkerchiefs and return. In the past, I have been involved in work around temporal databases and even co-edited a book in this area. Temporal databases had two major goals:
(1). Keep historical data, and enable easy retrieval of this data
(2). Enable to issue queries "as of" any point in time, i.e. issue query that takes into account the information that was available at a certain point in time (not as seen from "now") - again, returning for the past.
One may wonder why am I writing about temporal databases today, well - the issue of temporal databases is coming back when thinking about "event stores", I know that some of my database colleagues don't like the term "event store" or "event repository", since it does not include explicitly the word "database", but for me, using DBMS is just a possible implementation, while others, such as grid cache are also possible - but this is a topic for another discussion.
Anyway - why do we need an "event store" - in some cases we need to maintain historical events and use them, in some cases even apply pattern detection on past events. For auditing purposes we may also want to issue "as of" queries. Note that temporal representation of events can be done according to multiple temporal dimensions (see discussion about temporal dimensions of events). One of the characteristics of temporal databases are that they are "append only" databases, meaning: database records can be added, but not modified or deleted; modification and deletions are logical operators that create other instances, keeping the old ones. This is linked to one of the properties of events - immutability, which is actually a controversial property that still needs discussion about - in what conditions it is needed. Temporal databases seem to be a proper way to represent historical events.
Some concluding comments:
(1). Current DBMS do not support temporal databases as primitive, although temporal databases have been built as a second layer above them.
(2). Not all events need to be persistent for historical processing, this is a property of event-type, and its retention policies. Different events need to be persisted for different purposes.
(3). The issue of what language should be used to process "event stores" is also a matter of opinion, some believe that SQL is the answer (however, for some patterns it is an awkward language), there is an attempt to extend the SQL language with pattern extensions, here I will quote a wise person, Paul Vincent, who wrote in a footnote to this posting : This will be especially good news for those who like their SQL statements to run to multiple pages… Another option is to use on-line pattern language that is used for on-line patterns, and translated it to SQL (or one of its variations).
There are several issues that still need deeper discussion - but enough for today.