MapReduce is a core component of the Apache Hadoop software framework.
Hadoop enables resilient, distributed processing of massive unstructured data sets across commodity computer clusters, in which each node of the cluster includes its own storage. MapReduce serves two essential functions: it filters and parcels out work to various nodes within the cluster or map, a function sometimes referred to as the mapper, and it organizes and reduces the results from each node into a cohesive answer to a query, referred to as the reducer.
How MapReduce works
The original version of MapReduce involved several component daemons, including:
- JobTracker -- the master node that manages all the jobs and resources in a cluster;
- TaskTrackers -- agents deployed to each machine in the cluster to run the map and reduce tasks; and
- JobHistory Server -- a component that tracks completed jobs and is typically deployed as a separate function or with JobTracker.
With the introduction of MapReduce and Hadoop version 2, previous JobTracker and TaskTracker daemons have been replaced with components of Yet Another Resource Negotiator (YARN), called ResourceManager and NodeManager.
- ResourceManager runs on a master node and handles the submission and scheduling of jobs on the cluster. It also monitors jobs and allocates resources.
- NodeManager runs on slave nodes and interoperates with Resource Manager to run tasks and track resource usage. NodeManager can employ other daemons to assist with task execution on the slave node.
To distribute input data and collate results, MapReduce operates in parallel across massive cluster sizes. Because cluster size doesn't affect a processing job's final results, jobs can be split across almost any number of servers. Therefore, MapReduce and the overall Hadoop framework simplify software development.
MapReduce is available in several languages, including C, C++, Java, Ruby, Perl and Python. Programmers can use MapReduce libraries to create tasks without dealing with communication or coordination between nodes.
MapReduce is also fault-tolerant, with each node periodically reporting its status to a master node. If a node doesn't respond as expected, the master node reassigns that piece of the job to other available nodes in the cluster. This creates resiliency and makes it practical for MapReduce to run on inexpensive commodity servers.
MapReduce examples and uses
The power of MapReduce is in its ability to tackle huge data sets by distributing processing across many nodes, and then combining or reducing the results of those nodes.
As a basic example, users could list and count the number of times every word appears in a novel as a single server application, but that is time-consuming. By contrast, users can split the task among 26 people, so each takes a page, writes a word on a separate sheet of paper and takes a new page when they're finished. This is the map aspect of MapReduce. And if a person leaves, another person takes his or her place. This exemplifies MapReduce's fault-tolerant element.
When all the pages are processed, users sort their single-word pages into 26 boxes, which represent the first letter of each word. Each user takes a box and sorts each word in the stack alphabetically. The number of pages with the same word is an example of the reduce aspect of MapReduce.
There is a broad range of real-world uses for MapReduce involving complex and seemingly unrelated data sets. For example, a social networking site could use MapReduce to determine users' potential friends, colleagues and other contacts based on site activity, names, locations, employers and many other data elements. A booking website could use MapReduce to examine the search criteria and historical behaviors of users, and can create customized offerings for each. An industrial facility could collect equipment data from different sensors across the installation and use MapReduce to tailor maintenance schedules or predict equipment failures to improve overall uptime and cost-savings.
MapReduce services and alternatives
One challenge with MapReduce is the infrastructure it requires to run. Many businesses that could benefit from big data tasks can't sustain the capital and overhead needed for such an infrastructure. As a result, some organizations rely on public cloud services for Hadoop and MapReduce, which offer enormous scalability with minimal capital costs or maintenance overhead.
For example, Amazon Web Services (AWS) provides Hadoop as a service through its Amazon Elastic MapReduce (EMR) offering. Microsoft Azure offers its HDInsight service, which enables users to provision Hadoop, Apache Spark and other clusters for data processing tasks. Google Cloud Platform provides its Cloud Dataproc service to run Spark and Hadoop clusters.
For organizations that prefer to build and maintain private, on-premises big data infrastructures, Hadoop and MapReduce represent only one option. Organizations can opt to deploy other platforms, such as Apache Spark, High-Performance Computing Cluster and Hydra. The big data framework an enterprise chooses will depend on the types of processing tasks required, supported programming languages, and performance and infrastructure demands.