Monday, September 20, 2010

MapReducer

Almost a year back I started looking in to multiple options as how to provide better search in quick time against large sets of data in enterprise. for example you might want to search some keyword in exceptions, it could be like sysOut or error logs for enterprise applications.

I think MapReducer is an answer to these problem, here's some info about MapReducer

MapReduce is an parallel and distributed solution approach developed by Google for processing large datasets. MapReduce is utilized by Google and Yahoo to power their websearch. MapReduce was first describes in a research paper from Google .

MapReduce has two key components. Map and Reduce. A map is a function which is used on a set of input values and calculates a set of key/value pairs. Reduce is a function which takes these results and applies another function to the result of the map function. Or with other words: Map transforms a set of data into key value pairs and Reduce aggregates this data into a scalar. A reducer receives all the data for a individual "key" from all the mappers.

The approach assumes that their are no dependencies between the input data. This make it easy to parallelize the problem. The number of parallel reduce task is limited by the number of distinct "key" values which are emitted by the map function.

MapReduce incorporates usually also a framework. A master controls the whole MapReduce process. The MapReduce framework is responsible for load balancing, re-issuing task if a worker as failed or is to slow, etc. The master divides the input data into separate units, send individual chunks of data to the mapper machines and collects the information once a mapper is finished. If the mapper are finished then the reducer machines will be assigned work. All key/value pairs with the same key will be send to the same reducer.

Here's nice video about MapReducer :



Apache Hadoop which start with one large open source project is getting very popular around this space.

Here's link if you want to know more about Hadoop:

Apache Hadoop



If you want paid solution and dont have time to develop, take a look at SPLUNK, its nice and easy and very powerful tool, you can feed any data against this. I have played with splunk and I believe Splunk usages Hadoop APIs.

No comments:

Post a Comment