Principles of MapReduce
Programming model and Execution overview of MapReduce as described in the white paper by Jeffrey Dean and Sanjay Ghemawat [http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf].
Computation in MapReduce is expressed in two functions: Map and Reduce
Takes an input pair and produces a set of intermediate key/value pairs.
map (k1,v1) → list(k2,v2)
Accepts an intermediate key I and a set of values for that key.
reduce (k2,list(v2)) → list(v2)
When an MR job is submitted:
- The input files are split into M partitions based on split size which can be configured using mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize
- Job then forks multiple copies of the program on the cluster, one of which is Master which assigns the tasks(map/reduce) to workers.
- The worker assigned with the map task then reads input from the splits as key/value pairs and passes it to the user-defined Map function. The intermediate key/value pairs produced by the Map function are stored in memory.
- Periodically, these in-memory pairs are written to local disk partitioned into R pieces by partitioning function. The locations of these key/value pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
- A reduce worker is notified of these locations by the master, then it does remote procedure calls(RPCs) to get data from map worker disks. When a reduce worker has read all data from mappers, it then sorts the pairs by key so that all occurrences of a key can be grouped together. Different keys are mapped to a reducer task
- The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
- When all map tasks and reduce tasks have been completed, the control will be passed back to the user program.