Showing posts with label streams. Show all posts
Showing posts with label streams. Show all posts

Thursday, July 1, 2010

On VLDB 2010 -- events and streams related papers

After a few years of missing VLDB, I plan this year to participate in the VLDB conference in Singapore (which will be an opportunity to visit Singapore, I have never been there). I have a tutorial accepted entitled: Event Processing - past, present, future. VLDB is one of the major research conferences of the database community (my original home community).

The list of accepted papers is now on the website -- looking at it there are some papers whose title include either events or streams:

Complex Event Detection at Wire Speed with FPGAs
High-Performance Dynamic Pattern Matching over Disordered Streams
Achieving High Output Quality under Limited Resources through Structure-based Spilling in XML Streams
SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems
Recognizing Patterns in Streams with Imprecise Timestamps
On Dense Pattern Mining in Graph Streams
Database-support for Continuous Prediction Queries over Streaming Data
Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations
From a Stream of Relational Queries to Distributed Stream Processing
iFlow: An Approach for Fast and Reliable Internet-Scale Stream Processing Utilizing Detouring and Replication

And some demos:
Active Complex Event Processing: Applications in RealTime Health Care
Efficient Event Processing through Reconfigurable Hardware for Algorithmic Trading
Geospatial Stream Query Processing using Microsoft SQL Server StreamInsight


As you can notice - in the database community the term "streams" is more common than the term "events", and I'll go back to the discussion of streams vs. events soon.

IBM will have a substantial presence in VLDB with 9 research papers, 5 industrial papers, 3 demos and 1 tutorial.

More - later.



Monday, November 9, 2009

On Stream Data Processing book by Chkravarthy and Jiang


Another related book that arrived yesterday is the book entitled: "Stream Data Processing: A Quality of Service Perspective - modeling, scheduling, load shedding and complex event processing".

First - let's start with a lesson in economics. Looking at the Amazon query about "event processing books", one can realize that the Amazon price for the book of Chandy and Schulte that I described yesterday is $32.97, the new EDA book, by Taylor et al costs in Amazon $37.30, and the book I am talking about today has Amazon price of $112.45 -- roughly a price of four books. So the economic question is what makes it so expensive? My guess is that the answer is that books of the type of the two referred book (and probably our upcoming book is within the same category) relies on the fact that people will want to buy these books out of their own pocket, while academic books, especially part of Springer series (this one is part of the series "Advances in Database Systems"), have captive audience of university libraries. I wonder how many people are willing to pay this price out of their own pocket for that book.

Now -- from the business side to the book itself. Sharma is an old colleague from my active database days. The book takes a database approach and starts by explaining why data streams are paradigm shift relative to traditional databases, then it moves to explain the notion of data streams, and gets into QoS metrics, moving to data stream challenges, and introduces CEP as a complementary technology whose support as part of the data stream management system is posed as a challenge, follows by a literature review, including a survey of commercial and open sources stream and CEP systems, that seems to me to have false positives and false negatives. Then start the more academic oriented discussion about modeling continuous queries, with theorems and Greek letters, next is discussion about engineering oriented aspects of DSMS like scheduling and load shedding.

After discussing all this, the authors move to discuss integration between stream and complex event processing, starting with differences, and stating that it will be difficult to combine incompatible execution models, nevertheless, the authors are not afraid of difficulties and a page later describe an integrated architecture, which is a layered architecture, where the stream processing is done first, as a result there is a phase of event generation, as a second layer, where the event processing is a third layer, and rule processing as a fourth layer. I think that strict hierarchical architectures are somewhat simplistic for realistic scenarios (I'll need to write something about it at later point) , then the authors dedicate two chapters to describe their prototypes, and the books concludes with conclusions and future directions, but they seem to be ideas to extend the current issues discussed.

Bottom line -- seems like an academic journal paper that has scaled up (324 pages including long list of references (not lexicographically sorted), and index. May have interest to those who wants to study the formal aspects of stream processing.

I also got with the package two books about causality models, but I need to read them first before making any comment on them.

Saturday, December 20, 2008

Some footnotes to Luckham's "short history of CEP- part 3"


David Luckham, who took upon himself to survey the history of "complex event processing" has written a series of articles about the roots and developments the CEP area, while this is relatively young as discipline, it has roots in other disciplines, and people who worked separately from different perspectives, origin disciplines, and types of applications.

I recommend reading it. I'll make just some small comments for the recent third article :

(1). One of assertions in the article states:There were research groups within these companies engaged on similar projects to those in universities. Some of the groups ended up producing
prototype CEP products. In these cases, the decisions as to whether to let these
prototypes go forward to productization were haphazard, probably because they
were made by a management that didn’t understand the technology or its
potential.

Well - one must realize that corporate executives have infinite amount of wisdom, otherwise they have not been corporate executives, and if a decision looks haphazard, it is only due to the mental limitations of us, simple mortals.

(2). Another assertion is: But the largest number came out of an active database background. This is the reason why many of the commercial languages for event processing are
extensions of SQL
More accurately -- some of the current products we see are indeed descendants of "active databases" which was based on ECA (event condition action) rules. Among the products which apply this approach we can find RuleCore and Aptsoft (now Websphere Business Events); the products which extended SQL came from different paradigm in the database community of "data streams management" according to which, the queries are constant and data flows, instead of databases in which the data stands and queries flow. This has been later development which started with Jennifer Widom's Stream project; the two paradigms, though both came from the database community should not be mixed (although there are certain individuals who were involved in both).

Sunday, September 14, 2008

On sporadic events


I have never been a student in Stamford high school, but Stamford, CT is my home away from home for the next seven days. Starting tomorrow, I'll provide some impressions from the Gartner meeting and EPTS symposium, but I rely on other people in the blog-land to have a better coverage (e.g. Paul Vincent, with endnotes and references). I've Arrived earlier today, and resting before the busy week.


One of the thoughts that came to mind when looking at some of the discussions around the Stream-SQL standards, is one more observation.


While the claim that the difference between a "stream" and a "cloud" is that a stream is totally ordered and a cloud is partially ordered, I think that there are also some more distinctions. I'll discuss one of them --- sporadic events vs. known events. When dealing with "time series" type of input, then the timing of events are known, for each time unit (whatever it is) there is an event (or set of events) that are reported. This is true when the events are stock quotes provided periodically, or signals from sensors provide periodically. There are events that are not naturally organize themselves in time-series fashion, for example: bids for an auction, complaints from customers, irregular deposit, coffee machine failure etc...

From the point of view of functionality, there is no much difference --- one can create time series which reports a possibly empty set of events for each time-unit, but if in most time-unit the reported set is empty, this will not be very efficient way to handle it. On the other hand a system that does not support time series as a primitive can view all events as sporadic events, but there may be some optimization merit to the knowledge that events are indeed expected at every time unit. This is just one dimension -- but this leads me to reinforce my conclusion that there are various ways to provide event processing functionality, and the most efficient way is probably hybrid of approaches based on the semantics of functions and characteristics of the input. So this is the observation of today; it will be interesting to see what will be the main discussion topics in the coming conferences.
More later - from Stamford Hilton.