Principles of MapReduce

Programming model

Computation in MapReduce is expressed in two functions: Map and Reduce

Map function

Takes an input pair and produces a set of intermediate key/value pairs.

map (k1,v1) → list(k2,v2)

Reduce function

Accepts an intermediate key I and a set of values for that key.

reduce (k2,list(v2)) → list(v2)

Execution overview

Execution overview
  1. Job then forks multiple copies of the program on the cluster, one of which is Master which assigns the tasks(map/reduce) to workers.
  2. The worker assigned with the map task then reads input from the splits as key/value pairs and passes it to the user-defined Map function. The intermediate key/value pairs produced by the Map function are stored in memory.
  3. Periodically, these in-memory pairs are written to local disk partitioned into R pieces by partitioning function. The locations of these key/value pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
  4. A reduce worker is notified of these locations by the master, then it does remote procedure calls(RPCs) to get data from map worker disks. When a reduce worker has read all data from mappers, it then sorts the pairs by key so that all occurrences of a key can be grouped together. Different keys are mapped to a reducer task
  5. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
  6. When all map tasks and reduce tasks have been completed, the control will be passed back to the user program.





Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
chandra praneeth

chandra praneeth

Interested in distributed systems and big data. Enjoys reading tech blogs and playing badminton