Saturday, May 30, 2009

On Parallel processing and Event Processing - MapReduce

The "In Action" series of Manning Publications which hosts the "Event Processing in Action" book that Peter Niblett and myself are writing, also hosting another book "Hadoop in Action" that deals with the Hadoop, the open source MapReduce framework. In the framework of my series of postings about parallel processing and event processing, I would like today to drill down into MapReduce and make some observations about its possible relations with event processing.

The "Hadoop in Action" book describes MapReduce in the following way:

MapReduce programs are executed in two main phases, called mapping and reducing. Each phase is defined by a data processing function, and these functions are called mapper and reducer, respectively. In the mapping phase, MapReduce takes the input data and feeds each data element to the mapper. Afterward, in the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. In general, the mapper is meant to filter and transform the input into something that the reducer can aggregate over.

In the MapReduce framework you write applications by specifying the mapper and reducer. It’s instructive to understanding the complete data flow:

  1. The input to your application must be structured as a list of (key, value) pairs, list(). This input format may seem open-ended but is often quite simple in practice. The input format for processing many files is usually just list(). The input format for processing one big file, such as a log file, is something like list().
  2. The list of (key, value) pairs is broken up and each individual (key, value) pair, , is processed by calling the map function of the mapper. In practice, the key k1 is often ignored by mapper. The mapper transforms each pair into a list of pairs. The details of this transformation largely determines what the MapReduce program does. Note that the (key, value) pairs can be processed in arbitrary order, and the transformation must be self contained in that its output is dependent only on one single (key, value) pair..
  3. The output of all the mappers are (conceptually) aggregated into one giant list of pairs. All pairs sharing the same k2 are grouped together into a new (key, value) pair, . The framework then asks the reducer to process each one of these “aggregated” (key, value) pairs individually.
All Lisp fans, like myself, can trace the origins of the Map and Reduce functions, as written in the description the main objective of MapReduce is to partition large collection of data in order to provide aggregations in each individual partition.

MapReduce as a model for parallel processing has many fans, but also some non-believers, most notably two of the prominent database figures - David DeWitt and Mike Stonebraker, who attacked MapReduce on the basis of programming model, pefromenace issues, and general hyping of old ideas see posting 1, posting 2, (both of them from early 2008) and their new SIGMOD paper benchmark comparing MapReduce to parallel databases.

Anyway, let everybody read both sides. It seems that MapReduce is becoming a religion, and like any professional religions of the past, I keep being egnostic, don't take anything as a religion, this issue is not black and white...

Getting back to my favorite topic, event processing. The two questions that arise are:

  1. Are the functionality that MapReduce is intended to support, is part of, or useful, in any way for event processing?
  2. Are the requirements of event processing for parallel processing are best satisfied by MapReduce ?
The answer to the first question is definitely positive -- transforming, filtering and aggregation are part of the event processing story ("mediated event processing"). Moreover, as I mentioned before, it may be useful for pre-processing events, so that the more sophisticated processing is done over aggregations and not over raw events, but can also be used in any phase on derived events.

The answer to the second question is probably negative -
MapReduce kind of processing can be useful for "set at a time" operations, since it has data orientation, getting input from files, and providing output on files, and probably not so much for "event at a time" which makes the calculations incrementally. Furthermore, the "key data" partition may be too simplistic for context oriented partition that may be based on predicate (e.g. customers whose age is > 40), of course, this can be bridged, by putting the segments inside the event payload, but this has its price in flexibility. Also, the partition in event processing network is typically not to take up a problem and divide it, and then merge the results, but creating parallel branches of computation that are independent, which is slightly different thing.

Bottom line -- MapReduce can help in some but bot all of the parallel processing requirements for event processing, for other things, we need something else. I guess that the question of cost-effectiveness is based on the specific applications' loads and focus.

More about parallel processing and event processing - later.

No comments: