Event Processing Thinking: parallel event processing

Showing posts with label parallel event processing. Show all posts

Wednesday, July 18, 2012

DEBS 2012 - highlights of the first conference day

I'll start from the last session - the poster and demo session. During this session I wore the glasses shown in the picture, and it was the first glance towards the world of the glasses that Google is taking us. It has a potential, but getting used to it is not intuitive, one can see menu and then has to put a paper in the place to select among the menu options. The demo was of an EU project that does human attention detection and detect whether a person is interested in museum artifacts. There have been some other demos and posters, many of them from EU projects. Another interesting ones were one that provide safety alerts for monitoring underground trains, and another one which use events to help collaboration between team of software developers who work together.

Getting back to the beginning -- the day started with the interesting keynote of Sethu Raman (the industrial keynote), already written about yesterday. The next session has been an industry session. It is interesting to note that around 1/3 of the participants here are from industry and not academia, and the industry track became integral part of the program (not all academic people like it though!).

In the afternoon there were two scientific papers sessions -- one on pub/sub, the original topic of DEBS, now reduced to a single track per conference, the other one on "complex and spatial events". I'll write a few sentences about it - slides can already be found on the conference's site:

The first paper was presented by Martin Hirzel from the IBM System S team about parallel complex event processing, Martin started with the old assertion that CEP is part of stream processing, since it is doing only pattern matching, while stream processing can also do aggregations. I was never sure why this distinction is important, furthermore, as I have written before CEP is used by different people to mean different things, thus I would say that "event processing" is the name of functionality that does all. In any event, the talk about parallel incremental computation was interesting.
The second paper was presented by Alex Artikis from NCSR and event recognition (pattern matching) based on event calculus. Formal approaches can be useful for that domain, and the event calculus is one of the first attempts to do it. I have some terminology dispute with Alex, who equates the notion of fluent from event calculus to composite event. I think it is actually refers to a state that can be initiated and terminated by events, and may serve as context or state.
The third paper was presented by Michael Olson, Mani Chandy's PhD student in CalTech. He presented the latest on their going on project on geo-spatial events (relate to seismic events)
The fourth paper was present by Boris Koldehpfe from University of Stuttgart, who also active in the community for several years. Boris talked about range queries in distributed event processing systems, where range queries relate to spatial operators on events. It seems that the spatial dimension which was also discussed in the previous talk is gaining more traction.

I'll write about the sessions of today at a later point.

Tuesday, April 13, 2010

On the virtualization of event processing functions

There is some discussion about scale-up and scale-out for measures of systems scalability, as indicated by a recent Blog of Brenda Michelson, I would like to refer to the programming model aspects of it. Parallel computing becomes more and more a means of scalability due to hardware development and barriers in the scalability of a single processor that stem from energy consumption issues. In event processing both parallel and distributed computing will play important role, as we a large, and geographically distributed event processing networks.

The main issue in terms of programming model is that manually programming a combination of parallel programming and distributed programming is very difficult, since many considerations are playing here. The solution relies on the notion of virtualization. The event processing applications should be programmed in a conceptual level, providing both the application logic and flow, but also policies that define nonfunctional requirements, since different applications may have different important metrics. Then, given a certain distributed configuration that may also consist of multi-core machines, the conceptual model should be directly compiled into an efficient implementation based on the objectives set by the policies. This is not easy, but was already done on limited domains. The challenge is to make it work for multiple platforms. This is part of the grand challenge of "event processing anywhere" that I'll describe in more length in subsequent posts. Achieving both scale-up and scale-out in event processing require intelligence in the automatic creation of implementation, and ability to fully virtualize all functional and non-functional requirements. More - later.

Saturday, May 30, 2009

On Parallel processing and Event Processing - MapReduce

The "In Action" series of Manning Publications which hosts the "Event Processing in Action" book that Peter Niblett and myself are writing, also hosting another book "Hadoop in Action" that deals with the Hadoop, the open source MapReduce framework. In the framework of my series of postings about parallel processing and event processing, I would like today to drill down into MapReduce and make some observations about its possible relations with event processing.

The "Hadoop in Action" book describes MapReduce in the following way:

MapReduce programs are executed in two main phases, called mapping and reducing. Each phase is defined by a data processing function, and these functions are called mapper and reducer, respectively. In the mapping phase, MapReduce takes the input data and feeds each data element to the mapper. Afterward, in the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. In general, the mapper is meant to filter and transform the input into something that the reducer can aggregate over.

In the MapReduce framework you write applications by specifying the mapper and reducer. It’s instructive to understanding the complete data flow:

The input to your application must be structured as a list of (key, value) pairs, list(). This input format may seem open-ended but is often quite simple in practice. The input format for processing many files is usually just list(). The input format for processing one big file, such as a log file, is something like list().
The list of (key, value) pairs is broken up and each individual (key, value) pair, , is processed by calling the map function of the mapper. In practice, the key k1 is often ignored by mapper. The mapper transforms each pair into a list of pairs. The details of this transformation largely determines what the MapReduce program does. Note that the (key, value) pairs can be processed in arbitrary order, and the transformation must be self contained in that its output is dependent only on one single (key, value) pair..
The output of all the mappers are (conceptually) aggregated into one giant list of pairs. All pairs sharing the same k2 are grouped together into a new (key, value) pair, . The framework then asks the reducer to process each one of these “aggregated” (key, value) pairs individually.

All Lisp fans, like myself, can trace the origins of the Map and Reduce functions, as written in the description the main objective of MapReduce is to partition large collection of data in order to provide aggregations in each individual partition.

MapReduce as a model for parallel processing has many fans, but also some non-believers, most notably two of the prominent database figures - David DeWitt and Mike Stonebraker, who attacked MapReduce on the basis of programming model, pefromenace issues, and general hyping of old ideas see posting 1, posting 2, (both of them from early 2008) and their new SIGMOD paper benchmark comparing MapReduce to parallel databases.

Anyway, let everybody read both sides. It seems that MapReduce is becoming a religion, and like any professional religions of the past, I keep being egnostic, don't take anything as a religion, this issue is not black and white...

Getting back to my favorite topic, event processing. The two questions that arise are:

Are the functionality that MapReduce is intended to support, is part of, or useful, in any way for event processing?
Are the requirements of event processing for parallel processing are best satisfied by MapReduce ?

The answer to the first question is definitely positive -- transforming, filtering and aggregation are part of the event processing story ("mediated event processing"). Moreover, as I mentioned before, it may be useful for pre-processing events, so that the more sophisticated processing is done over aggregations and not over raw events, but can also be used in any phase on derived events.

The answer to the second question is probably negative - MapReduce kind of processing can be useful for "set at a time" operations, since it has data orientation, getting input from files, and providing output on files, and probably not so much for "event at a time" which makes the calculations incrementally. Furthermore, the "key data" partition may be too simplistic for context oriented partition that may be based on predicate (e.g. customers whose age is > 40), of course, this can be bridged, by putting the segments inside the event payload, but this has its price in flexibility. Also, the partition in event processing network is typically not to take up a problem and divide it, and then merge the results, but creating parallel branches of computation that are independent, which is slightly different thing.

Bottom line -- MapReduce can help in some but bot all of the parallel processing requirements for event processing, for other things, we need something else. I guess that the question of cost-effectiveness is based on the specific applications' loads and focus.

More about parallel processing and event processing - later.

Monday, May 25, 2009

On Parallel processing and Event Processing - Take I

My travels in the universe have taken me to Brussels (somehow, I have never been in Belgium), for meetings regarding an EU project we intend to propose for the next call, I'll Blog about this proposal when it will be more mature. As we are working for a while on parallel processing related to event processing, I am looking at some of the related work. I came across an interesting article in Gregor Hohpe's Blog, stating some of the assumptions behind the parallel programming model of cloud computing. Gregor defines these assumptions as the "new ACID", relating to the old ACID properties of transaction correctness (Atomicity, Consistency, Isolation, Durability). Gregor provides four new ACID properties for the cloud computing: Associative, Commutative, Idempotent, Distributed. Gregor explains the notions:
While the original ACID properties are all about correctness and precision, the new ones are about flexibility and robustness. Being associative means that a chain of operations can be performed in any order, the classical example being (A + B) + C = A + (B + C) for the associative operator ‘+’. Likewise, we remember commutative from junior high math as A + B = B + A, meaning an operator can be applied to a pair of operands in either direction. Finally, a function is idempotent if its repeated application does not change the result, i.e. f(x) = f(f(x)).

Having these properties enable paralleling the operations and collecting the results, the principle behind Google's MapReduce. In order to take advantage of this way of parallelism for event processing functions (or agents), we need to check whether these properties are satisfied. In event processing -- the temporal dimension and the fact that some of the patterns are not associative, some are not commutative, and some are not idempotent, may issue a challenge. This does not say that the parallelism idea is not valid for event processing applications, it is just saying that the parallelism is influenced from the semantics of the event processing network. I'll drill down into this issue in future postings.

Saturday, May 2, 2009

Packing for a one week (net) travel to the USA. We had recently been informed that the paper entitled: A stratified approach for supporting high throughput event processing application

has been accepted for presentation in the DEBS 2009 conference that will take place in early July in Nashville. The paper written by Yuri Rabinovich, Geetika Lakshmanan and myself describes results obtained last year in our scalability project, the project is still going on, and its results will be flowing to IBM products.

Here is the abstract of this paper:

The quantity of events that a single application needs to process is constantly increasing, RFID related events have been doubled within the past year and reached 4 trillion events per day,
financial applications in large banks are processing 400 million events per day, and Massively Multiplayer Online (MMO) games are monitoring in peaks 1 million events per second. It is evident that scalability in event throughput is a major requirement for these types of pplications. While the first generation of event processing systems has been centralized, we see various solutions that attempt to use both scale-up and scale-out techniques. Alas, partitioning of the processing manually is difficult due to the semantic dependencies among various event rocessing agents. It is also difficult to tune up the partition dynamically in a manual way. Manual partitioning is typically vertical, i.e. there is a single partition set with centralized routing. This paper proposes a horizontal partition that is automatically created by analyzing the semantic dependencies among agents using a stratification principle. Each stratum contains a collection of independent agents, and events are always routed to subsequent strata. We also implement a profiling-based technique for assigning agents to nodes in each stratum with the goal of aximizing throughput. A complementary step is to distribute the load among the different execution nodes dynamically based on performance characteristics of nodes and agents and the event traffic model. Experimental results show significant improvement in the ability to process high hroughput of events relative to both centralized solutions as well as vertical partitions. We find this to be a promising approach to achieve high scalability without requiring difficult manual tuning, especially when the traffic model and the topology of the event processing network is often changed.

More about event processing distribution and parallelization will be discussed in subsequent postings.

DEBS has also issued recently a call for fast abstracts, posters and demos, an opportunity to share with the community work that is in less mature phase. show interesting demos, and discuss ideas.

Saturday, January 17, 2009

On Distribution and parallelism in Event Processing

This picture, taken from the site of Nature Reviews as part of an article about "parallel processing in mammalian retina", illustrates that structures like the human body distribute the functions it needs to perform, and performs many of them in parallel and by specialized systems.

Getting to event processing, the producers and consumers of event processing can be distributed, as events can come to many sources, and situations may be consumed by many sinks. The first generation of event processing was mostly centralized in processing, the centralization has been twofold: functional centralization using an monolithic engine that performs all processing fucntions, and location centralization, this engine runs on a single server.

Today I'll concentrate on the second aspect of centralization, there are various reasons to decentralize the processing, one is to do some of the activities closer to the producers or consumers, example: if a producer produces events, where only 1% is relevant to the defined event processing, and it can be done by independent filtering that does not depend on other events, then it will be more efficient that the filtering will take place at or close to the consumer site, and thus eliminate the unnecessary network traffic.

Another reason to distribute the functionality is the scalability aspect, which is really an old idea to "divide and conquer" problems. The challenge is how to do a "good" partition. First there is a need to define what a "good" partition is, i.e. looking at it as an optimization problem, what is the goal function, then solving it is a function of the topology, semantics and behavior of a particular application which can be dynamic.

IBM has recently released the first version of WBEXS (Websphere Business Event Exterme Scale) and made a statement of direction for another product: Infostreme Streams
both are aimed to handle scalability by distribution in different environments. While details about IBM products you can obtain from the appropriate people in IBM, we in the IBM Haifa Research lab are working on related topics, we have exposed initial results in DEBS 2008, in the fast abstract session introducing the statification approach.The project has substantially advanced since that time, and I'll discuss it further in future Blogs (well - I need to go over the Blog and list all the topics I promised to discuss later and have not done so yet...).

More - later

Event Processing Thinking