Hadoop Performance Tuning Best Practices
This tutorial on Hadoop performance tuning best practices that will provide you ways for improving your Hadoop cluster performance and get best result from your programming in Hadoop. It will cover concepts like Memory Tuning inHadoop, Map Disk spill in Hadoop, tuning mapper tasks, Speculative execution inBig data hadoop and many other related concepts.
2. Introduction to Hadoop Performance Tuning
Performance tuning in Hadoop will help you in optimizing your Hadoop cluster performance and make it better to provide best results while doing Hadoop programming inBig Data companies. To perform the same, you need to repeat below process till desired output is achieved at optimal way.
Run Job ― > Identify Bottleneck ― > Address Bottleneck.
First step is to runHadoop job, Identify the bottlenecks and address them using below methods to get highest performance. You need to repeat above step till a level of performance is achieved.3. Tuning Hadoop run-time parameters
There are many options provided by hadoop on cpu, memory, disk, and network for performance tuning. Most hadoop tasks are not cpu bounded, what is most considered is to optimize usage of memory and disk spills.a. Memory Tuning
The most general and common rule for memory tuning is: use as much memory as you can without triggering swapping. The parameter for task memory is mapred.child.java.opts that can be put in your configuration file.
You can also monitor memory usage on server using Ganglia,Cloudera manager, or Nagios for better memory performance.b. Minimize the Map Disk Spill
Disk IO is usually the performance bottleneck in Hadoop. There are a lot of parameters you can tune for minimizing spilling like:Compression of mapper output Usage of 70% of heap memory ion mapper for spill buffer
But do you think frequent spilling is a good idea?
It’s highly suggested not to spill more than once as if you spill once, you need to re-read and re-write all data: 3x the IOc. Tuning mapper tasks
The number of mapper tasks is set implicitly unlike reducer tasks. The most common tuning way for the mapper is controlling the amount of mapper and the size of each job. When dealing with large files, hadoop split the file in to smaller chunk so that mapper can run it in parallel. However, initializing new mapper job usually takes few seconds that is also an overhead to be minimized. Below are the suggestions for the same:Reuse jvm task Aim for map tasks running 1-3 minutes each. For this if the average mapper running time is lesser than one minute, increase the mapred.min.split.size, to allocate less mappers in slot and thus reduce the mapper initializing overhead. Use Combine file input format for bunch of smaller files. 4. Tuning application-specific performance a. Minimize your mapper output
Minimizing the mapper output can improve the general performance a lot as this is sensitive to disk IO, network IO, and memory sensitivity on shuffle phase.
For achieving this, below are the suggestions:Filter the records on mapper side instead of reducer side. Use minimal data to form your map output key and map output value inMap Reduce. Compress mapper output b. Balancing reducer’s loading
Unbalanced reducer tasks create another performance issue. Some reducers take most of the output from mapper and ran extremely long compare to other reducers.
Below are the methods to do the same:Implement a better hash function in Partitioner class. Write a preprocess job to separate keys using MultipleOutputs. Then use another map-reduce job to process the special keys that cause the problem. c. Reduce Intermediate data with Combiner in Hadoop
Implement a combiner to reduce data which enables faster data transfer.d. Speculative Execution
MapReduce jobs get impacted when tasks take long time to finish the execution. This problem is being solved by approach of speculative execution by backing up slow tasks on alternate machines. You need to set the configuration parameters ‘mapreduce.map.tasks.speculative.execution’ and ‘mapreduce.reduce.tasks.speculative.execution’ to true for enabling speculative execution. This will reduce the job execution time if the task progress is slow due to memory unavailability.5. Conclusion
There are several performance tuning tips and tricks for a Hadoop Cluster and we have highlighted some of the important ones. For more tricks to improve Hadoop cluster performance, check Job optimization techniques in Big data.