The "In Action" series of Manning Publications which hosts the "Event Processing in Action" book that Peter Niblett and myself are writing, also hosting another book "Hadoop in Action" that deals with the Hadoop, the open source MapReduce framework. In the framework of my series of postings about parallel processing and event processing, I would like today to drill down into MapReduce and make some observations about its possible relations with event processing.
The "Hadoop in Action" book describes MapReduce in the following way:
MapReduce programs are executed in two main phases, called mapping and reducing. Each phase is defined by a data processing function, and these functions are called mapper and reducer, respectively. In the mapping phase, MapReduce takes the input data and feeds each data element to the mapper. Afterward, in the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. In general, the mapper is meant to filter and transform the input into something that the reducer can aggregate over.
In the MapReduce framework you write applications by specifying the mapper and reducer. It’s instructive to understanding the complete data flow:
- The input to your application must be structured as a list of (key, value) pairs, list(). This input format may seem open-ended but is often quite simple in practice. The input format for processing many files is usually just list(). The input format for processing one big file, such as a log file, is something like list().
- The list of (key, value) pairs is broken up and each individual (key, value) pair, , is processed by calling the map function of the mapper. In practice, the key k1 is often ignored by mapper. The mapper transforms each pair into a list of pairs. The details of this transformation largely determines what the MapReduce program does. Note that the (key, value) pairs can be processed in arbitrary order, and the transformation must be self contained in that its output is dependent only on one single (key, value) pair..
- The output of all the mappers are (conceptually) aggregated into one giant list of pairs. All pairs sharing the same k2 are grouped together into a new (key, value) pair, . The framework then asks the reducer to process each one of these “aggregated” (key, value) pairs individually.
All Lisp fans, like myself, can trace the origins of the Map and Reduce functions, as written in the description the main objective of MapReduce is to partition large collection of data in order to provide aggregations in each individual partition.
MapReduce as a model for parallel processing has many fans, but also some non-believers, most notably two of the prominent database figures - David DeWitt and Mike Stonebraker, who attacked MapReduce on the basis of programming model, pefromenace issues, and general hyping of old ideas see posting 1, posting 2, (both of them from early 2008) and their new SIGMOD paper benchmark comparing MapReduce to parallel databases.
Anyway, let everybody read both sides. It seems that MapReduce is becoming a religion, and like any professional religions of the past, I keep being egnostic, don't take anything as a religion, this issue is not black and white...
Getting back to my favorite topic, event processing. The two questions that arise are:
- Are the functionality that MapReduce is intended to support, is part of, or useful, in any way for event processing?
- Are the requirements of event processing for parallel processing are best satisfied by MapReduce ?
The answer to the first question is definitely positive -- transforming, filtering and aggregation are part of the event processing story ("mediated event processing"). Moreover, as I mentioned before, it may be useful for pre-processing events, so that the more sophisticated processing is done over aggregations and not over raw events, but can also be used in any phase on derived events.
The answer to the second question is probably negative - MapReduce kind of processing can be useful for "set at a time" operations, since it has data orientation, getting input from files, and providing output on files, and probably not so much for "event at a time" which makes the calculations incrementally. Furthermore, the "key data" partition may be too simplistic for context oriented partition that may be based on predicate (e.g. customers whose age is > 40), of course, this can be bridged, by putting the segments inside the event payload, but this has its price in flexibility. Also, the partition in event processing network is typically not to take up a problem and divide it, and then merge the results, but creating parallel branches of computation that are independent, which is slightly different thing.
Bottom line -- MapReduce can help in some but bot all of the parallel processing requirements for event processing, for other things, we need something else. I guess that the question of cost-effectiveness is based on the specific applications' loads and focus.
More about parallel processing and event processing - later.
Back home from my business trip (two hours later than I planned, I had worse..).
Today. catching up on some Blogs, I came across an interesting posting by Mark Palmer which discusses intraprenuring, which is the name of entrepreneurial behavior within a big company. I have spent most of my working life in three big organizations: The Israeli Air-Force, which has a mentality of the public sector, the Technion, which is an academic institute, and IBM, most time in IBM Haifa Research Lab. Counter to any intuitive: the air-force is organization most open to new ideas, and the academic institute is the least open, IBM is somewhere in the middle.
I have never became an entrepreneur, and when I considered it - 12 years ago, I decided to go to IBM and try intraprenuring instead. This may or may not have been the biggest mistake in my life, I don't know what would have happened if I would have chosen the alternative to go as a startup. John Bates started Apama around the same time, also coming from a university, and some times I envy him (some times I don't..).
Anyway - I have an experience of trying to act like intrapreneur in the three different organizations I mention, I also read Pinchot's book, mentioned in Mark's posting, many years ago. I find Pinchot's ten commandments, mentioned by Mark to be very valid. I am copying them again here:
- Come to work each day willing to be fired.
- Circumvent any orders aimed at stopping your mission.
- Do any job needed to make your project work, regardless of your job description.
- Find people to help you.
- Follow your intuition about the people you choose, and work only with the best.
- Work underground as long as you can – publicity triggers the corporate immune mechanism.
- Never bet on a race unless you are running in it.
- Remember it is easier to ask for forgiveness than for permission.
- Be true to your goal, but be realistic about the ways to achieve them.
- Honor your sponsors
All of these are true, and many of these are somewhat inconsistent with the way that big organizations think. I can write a paragraph about each of these ten bullets, but this will contradict the sixth commandment, so I'll leave it as is, however, at the end the result is what is being remembered, not the way taken to achieve it, go figure..
CDG Airport, Paris.
This is, more or less, how the A3 road in Paris has looked like today when I was on my way to CDG airport, when I asked the taxi driver how long it will take to the airport he said - no more than 40 minutes, well it took 2 hours, and could have taken more, unless at some point he lost his patient, drove through the next exit, and navigated with his GPS through some small streets
to get to the airport from a different direction. Since the price of the taxi is a function not only of distance but of time also, then the price also exceeded (by far) his estimate... This was not the only traffic related event. My airline sent me SMS to notify me that my flight home has been cancelled and they re-booked me on the next flight 2 hours later (that's why I was not pressed of the long trip to the airport). They sent me the SMS in French, while I was sitting in a train, and the passenger sitting opposite to me was kind enough to translate me, 5 minutes later they re-sent me the same SMS, this time in Hebrew.
I ended the day in visiting ILOG, acquired by IBM, and before that in the morning, together with some of our partners, we had a meeting with the EU commission in Brussels, about a proposed EU project. This is planned to be a relatively big project that will look at the future web services infrastructure which will be event-driven, and include event-enabled BPM. . We got good and useful feedback, and have a lot of action items, like finalizing the consortium. I'll write more about it in the next few months.
My travels in the universe have taken me to Brussels (somehow, I have never been in Belgium), for meetings regarding an EU project we intend to propose for the next call, I'll Blog about this proposal when it will be more mature. As we are working for a while on parallel processing related to event processing, I am looking at some of the related work. I came across an interesting article in Gregor Hohpe's Blog, stating some of the assumptions behind the parallel programming model of cloud computing. Gregor defines these assumptions as the "new ACID", relating to the old ACID properties of transaction correctness (Atomicity, Consistency, Isolation, Durability). Gregor provides four new ACID properties for the cloud computing: Associative, Commutative, Idempotent, Distributed. Gregor explains the notions:
While the original ACID properties are all about correctness and precision, the new ones are about flexibility and robustness. Being associative means that a chain of operations can be performed in any order, the classical example being (A + B) + C = A + (B + C) for the associative operator ‘+’. Likewise, we remember commutative from junior high math as A + B = B + A, meaning an operator can be applied to a pair of operands in either direction. Finally, a function is idempotent if its repeated application does not change the result, i.e. f(x) = f(f(x)).Having these properties enable paralleling the operations and collecting the results, the principle behind Google's MapReduce. In order to take advantage of this way of parallelism for event processing functions (or agents), we need to check whether these properties are satisfied. In event processing -- the temporal dimension and the fact that some of the patterns are not associative, some are not commutative, and some are not idempotent, may issue a challenge. This does not say that the parallelism idea is not valid for event processing applications, it is just saying that the parallelism is influenced from the semantics of the event processing network. I'll drill down into this issue in future postings.
The logo above (in Hebrew) is the logo of the academic college of Emek Yezreel , in which I am serving in the steering committee of the Information Systems department, which is a new department started this year, as part of my service to the community, I am helping various institutes to establish academic plans to give the opportunity to people that otherwise would not be able to acquire academic education. As Aliza Shenhar, the dominant president of this college has said: for many of the students they are the first in their family ever to obtain academic degrees. We had an interesting discussion about the challenges that they have in teaching a diversified population.
Anyway, today I would like to write a little bit about System S, which has been recently highlighted by IBM, and been covered in NY Times, ComputerWorld, and some of the community Blogs by Paul Vincent and Marc Adler. The picture below, taken from the NY Times show Steve Mills, the head of IBM Software Group (on the left hand side), and John Kelly, the head of my organization, IBM Research (on the right hand side), both senior vice presidents in IBM, reporting directly to the CEO. So what is System S, and how does it relate to event processing ? In the slide that Steve Mills points at, the title is "Stream Computing", and indeed, this system takes streams in the broad sense, anything that sends constant information from various types -- such as: video, audio, text, multi-media. The points in this slide are showing a data flow, and indeed, System S is a platform that can run data flow of processing elements in the system S terminology, each of them runs on stream of a certain type, and provide some form of analytics -- filtering, aggregation, extracting features out of video, interpreting voice and much more. The platform can take advantage of supercomputers to provide parallel processing, and digest high throughput of data. You can read more about it in the IBM Research website (I am not sure it is up to date). System S is a prelude to an IBM product already announced under the name -"Infosphere Streams".
Now, the question is what is the relationship between System S and event processing ? There are two different points. The one is that System S can take as an input large amount of streaming data, filter and aggregate it, and create a relatively small collection of events that can further be processes by some event processing engine. The other is that System S, as said, is a platform; the processing elements in this platform can be, in principle, event processing agents. In fact, while the semantics of the data flow is not identical to the semantic of an event processing network, it is possible to map event processing network to be implemented by the System S platform. The spade language provides some such capabilities, and may be extended in time to include more. IBM takes a portfolio approach to event processing (BTW - IBM does not use the "CEP" TLA, it uses its own TLA "BEP" as Business Event Processing, I tend to use event processing without prefixes and suffixes, as I stated before), since it believes that the "one size fits all", does not work, due to differences in functional and non-functional properties. System S is definitely aimed at the high end, in terms of throughput requirements. More - Later.