Konstantin Emelyanov - Fotolia
Every workload is unique, so a Hadoop cluster configuration that works well for one workload might not be optimal for another. The key to running jobs efficiently is to match your configuration to other active workloads.
One way to accomplish this is by creating a performance baseline. You can compare your configuration changes against that baseline to determine whether they had the desired effect.
To create a baseline, start with a default configuration and run a job; it should be representative of the types of jobs that you plan to run once you put the cluster into production. After you run a job, check its history to see how long it took, then calculate an average.
When you launch a job, the system creates two files: a job configuration XML file and a job status file. The job status file is useful for tracking the effect of your configuration changes because it contains status information and runtime metrics. You can view job status data with the Hadoop job command. Append the -List All parameter if you aren't sure of the job's identity.
Assuming you use the baseline comparison method to quantify your configuration changes, the most important thing to keep in mind is to make one change at a time. If you change several different parameters at the same time, then receive an unexpected result in your job status file, it can be difficult to determine which change -- or changes -- caused the unexpected behavior. Making one change at a time can be tedious, but it provides the most control over the cluster's performance.
There are several individual files that you might want to examine as you fine-tune your Hadoop cluster configuration. These are site-specific configuration files and default configuration files.
Site-specific configuration files
The core-site.xml file is an example of a site-specific Hadoop configuration file. It enables you to control the size of the read/write buffers, as well as the amount of memory that is allocated to the file system. You can also use it to set a limit for data storage memory.
Another file you can use to adjust key Hadoop cluster configuration details is hdfs-site.xml. HDFS is the Hadoop Distributed File System; the hdfs-site.xml file is where you change the location of your namenode path and datanode path. You can also use this file to set the data replication value.
Hadoop also enables you to configure a site's map reducer through the mapred-site.xml file. This file allows you to permit or exclude DataNodes and TaskTrackers, and it configures the path for the MapReduce framework.
Default configuration files
Default configuration files are the ones that the entire Hadoop cluster uses and they are treated as read-only. One such file is core-default.xml, which lets you to configure things such as health monitoring parameters and the zookeeper session timeout value.
Hdfs-default.xml enables you to configure the HDFS, limit the number of directory items, set a maximum number of blocks per file and perform other file system-related configuration tasks.
Mapred-default.xml provides configuration settings related to the map reducer. You can adjust the number of virtual cores and the amount of memory to request from the scheduler for each map task, for example.