Event Processing Thinking: MapReduce

Showing posts with label MapReduce. Show all posts

Saturday, March 10, 2012

Event processing as reducer in Map Reduce

A relatively recent posting from MSFT. In this posting there is a question about the relationship between the batch oriented Map Reduce, and the on-line oriented event processing. The answer, according to MSFT is - event processing can be used as a reducer in the Map Reduce, where there are multiple copies of an event processing engine perform the reduce function.

I have written before about offline event processing, with the insight that event processing is useful not only in online, but in offline, since it provides both efficient implementation and high-level abstraction in certain functions (pattern matching, aggregation and more) that makes it also attractive to use in batch.

Of course, another synergy may be using event processing within real-time hadoop, like Darkstar. as is frequently articulated by Colin Clark.

Tuesday, February 14, 2012

Killing elephants - is MapReduce dying?

My English teacher in the last grade of high school had an interesting taste in literature, and taught us the story on "Shooting and Elephant" by Orwell. I was not a very good student in English and forgot about it until reading Colin Clark's Blog posting entitled : "It's time to kill the elephant". From time to time there are various people claiming that various things are dead or dying. Some of the readers may still remember the discussion about whether SOA is dead. Recently the Forbes Blog has announced the death of ERP. Colin's contribution to the hunt is the observation that MapReduce is dying (or should be dying) and the batch processing should be replace by more real-time processing. His evidence is that Google is dumping MapReduce and using Colossus for its search technology. While this fact is certainly true, I think that there are still many types of analytic procedures that are done off-line using batch processes, so while the use of real-time analytics will substantially increase given supporting infrastructure, I am not sure that batch will die soon (the same goes for SOA and ERP)... Old soldiers never die - they just fade away (s-l-o-w-l-y).

Sunday, June 5, 2011

On the countrywide database seminar, Jeff Ullman and MapReduce

Today I took a long train drive to the south part of Israel, to attend the countrywide database seminar. We used to have such seminars in the past, but they somehow disappeared, and Ehud Gudes from Ben-Gurion University took the initiative and hosted us today. The program included various local speakers, some of them graduate students which got 18 minutes to give a talk.

While the database community has significant intersection with the event processing community, still, many of the database folks are quite unfamiliar with the basics, some of the questions I was asked is about -- why do you have to re-invent the wheel, where everything can be done using triggers in databases. My answer was that everything could also be done with programming Turing machines, but it is a question of cost-effectiveness, in fact the database area itself is also a set of abstraction over file systems, and tools to implement them efficiently. This also goes for event processing. There were some other questions about -- could you do it with X, where X is quite diversified.

The keynote speaker in this event has been Jeff Ullman, who talked about MapReduce.

After explaining the principle idea of MapReduce, he spent half of the time talking about one his favorite topics from the distant past, computing transitive closures with Datalog, and the way that can be computed with MapReduce. His reasoning for going back to Datalog was the need to compute path of links for search engines, Blog postings that respond to one another, and social networks in general. I would not intuitively think on Datalog as a tool for that, but it was interesting.

Also interesting was Jeff's claim that the main benefit of MapReduce over other methods of parallel programming is the fact that if one task fails, it is possible to restart this task only, and not the entire process, which he claims to be a unique property of Mapreduce. I am not an expert in parallel programming, but it will be interesting to verify this claim.

Monday, May 2, 2011

Startup review: Hstreaming

Late last week I got a briefing on a new startup - HSTREAMING. The briefing was delivered by an ex-colleague in IBM Research, Volkmar Uhlig. The idea behind HSTREAMING is providing Hadoop-based platform that enables running aggregations, filtering and some forms of event pattern matching in real-time. The idea is that since there is a growing use in Hadoop, Hadoop-based applications, which is batch-oriented, will be developing more and more extensions that require online processing along with the batch processing. This is the Hadoop variation of using database and stream processing together. Certainly and interesting direction; I think that we are seeing variations of MapReduce coupled with event processing in various places. I'll continue to follow this direction.

Monday, December 13, 2010

On Hadoop and event processing

The region which I live in did not have much luck recently, first the big fire on the Carmel ridge, that lasted for three and half days until it got under control, and now a major storm, with winds running in velocity of > 100 KM/H and a lot of rain. These two pictures, taken from the Israeli news Internet sites, were taken in Haifa yesterday. The storm is now over and some nicer days are ahead of us.

Back to professional issues -- Alex Alves (who represents Oracle in the EPTS Steering committee among other things) wrote a nice posting in his blog explaining the Hadoop programming model, if you are still not familiar with it, it provides good explanation.

Hadoop is batch oriented and provides kind of imperative programming model, but can be wrapped and concealed by higher level language. I am working now with a graduate student who investigates the usability of the map-reduce model for some of the event processing functions (e.g. aggregation). I am curious to see the analysis of this work. More - later

Saturday, May 30, 2009

On Parallel processing and Event Processing - MapReduce

The "In Action" series of Manning Publications which hosts the "Event Processing in Action" book that Peter Niblett and myself are writing, also hosting another book "Hadoop in Action" that deals with the Hadoop, the open source MapReduce framework. In the framework of my series of postings about parallel processing and event processing, I would like today to drill down into MapReduce and make some observations about its possible relations with event processing.

The "Hadoop in Action" book describes MapReduce in the following way:

MapReduce programs are executed in two main phases, called mapping and reducing. Each phase is defined by a data processing function, and these functions are called mapper and reducer, respectively. In the mapping phase, MapReduce takes the input data and feeds each data element to the mapper. Afterward, in the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. In general, the mapper is meant to filter and transform the input into something that the reducer can aggregate over.

In the MapReduce framework you write applications by specifying the mapper and reducer. It’s instructive to understanding the complete data flow:

The input to your application must be structured as a list of (key, value) pairs, list(). This input format may seem open-ended but is often quite simple in practice. The input format for processing many files is usually just list(). The input format for processing one big file, such as a log file, is something like list().
The list of (key, value) pairs is broken up and each individual (key, value) pair, , is processed by calling the map function of the mapper. In practice, the key k1 is often ignored by mapper. The mapper transforms each pair into a list of pairs. The details of this transformation largely determines what the MapReduce program does. Note that the (key, value) pairs can be processed in arbitrary order, and the transformation must be self contained in that its output is dependent only on one single (key, value) pair..
The output of all the mappers are (conceptually) aggregated into one giant list of pairs. All pairs sharing the same k2 are grouped together into a new (key, value) pair, . The framework then asks the reducer to process each one of these “aggregated” (key, value) pairs individually.

All Lisp fans, like myself, can trace the origins of the Map and Reduce functions, as written in the description the main objective of MapReduce is to partition large collection of data in order to provide aggregations in each individual partition.

MapReduce as a model for parallel processing has many fans, but also some non-believers, most notably two of the prominent database figures - David DeWitt and Mike Stonebraker, who attacked MapReduce on the basis of programming model, pefromenace issues, and general hyping of old ideas see posting 1, posting 2, (both of them from early 2008) and their new SIGMOD paper benchmark comparing MapReduce to parallel databases.

Anyway, let everybody read both sides. It seems that MapReduce is becoming a religion, and like any professional religions of the past, I keep being egnostic, don't take anything as a religion, this issue is not black and white...

Getting back to my favorite topic, event processing. The two questions that arise are:

Are the functionality that MapReduce is intended to support, is part of, or useful, in any way for event processing?
Are the requirements of event processing for parallel processing are best satisfied by MapReduce ?

The answer to the first question is definitely positive -- transforming, filtering and aggregation are part of the event processing story ("mediated event processing"). Moreover, as I mentioned before, it may be useful for pre-processing events, so that the more sophisticated processing is done over aggregations and not over raw events, but can also be used in any phase on derived events.

The answer to the second question is probably negative - MapReduce kind of processing can be useful for "set at a time" operations, since it has data orientation, getting input from files, and providing output on files, and probably not so much for "event at a time" which makes the calculations incrementally. Furthermore, the "key data" partition may be too simplistic for context oriented partition that may be based on predicate (e.g. customers whose age is > 40), of course, this can be bridged, by putting the segments inside the event payload, but this has its price in flexibility. Also, the partition in event processing network is typically not to take up a problem and divide it, and then merge the results, but creating parallel branches of computation that are independent, which is slightly different thing.

Bottom line -- MapReduce can help in some but bot all of the parallel processing requirements for event processing, for other things, we need something else. I guess that the question of cost-effectiveness is based on the specific applications' loads and focus.

More about parallel processing and event processing - later.

Event Processing Thinking