未加星标

Hadoop Performance Tuning Best Practices

字体大小 | |
[数据库(综合) 所属分类 数据库(综合) | 发布者 店小二05 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏
1. Objective

This tutorial on Hadoop performance tuning best practices that will provide you ways for improving your Hadoop cluster performance and get best result from your programming in Hadoop. It will cover concepts like Memory Tuning inHadoop, Map Disk spill in Hadoop, tuning mapper tasks, Speculative execution inBig data hadoop and many other related concepts.


Hadoop Performance Tuning Best Practices
2. Introduction to Hadoop Performance Tuning

Performance tuning in Hadoop will help you in optimizing your Hadoop cluster performance and make it better to provide best results while doing Hadoop programming inBig Data companies. To perform the same, you need to repeat below process till desired output is achieved at optimal way.

Run Job ― > Identify Bottleneck ― > Address Bottleneck.

First step is to runHadoop job, Identify the bottlenecks and address them using below methods to get highest performance. You need to repeat above step till a level of performance is achieved.

3. Tuning Hadoop run-time parameters

There are many options provided by hadoop on cpu, memory, disk, and network for performance tuning. Most hadoop tasks are not cpu bounded, what is most considered is to optimize usage of memory and disk spills.

a. Memory Tuning

The most general and common rule for memory tuning is: use as much memory as you can without triggering swapping. The parameter for task memory is mapred.child.java.opts that can be put in your configuration file.

You can also monitor memory usage on server using Ganglia,Cloudera manager, or Nagios for better memory performance.

b. Minimize the Map Disk Spill

Disk IO is usually the performance bottleneck in Hadoop. There are a lot of parameters you can tune for minimizing spilling like:

Compression of mapper output Usage of 70% of heap memory ion mapper for spill buffer

But do you think frequent spilling is a good idea?

It’s highly suggested not to spill more than once as if you spill once, you need to re-read and re-write all data: 3x the IO

c. Tuning mapper tasks

The number of mapper tasks is set implicitly unlike reducer tasks. The most common tuning way for the mapper is controlling the amount of mapper and the size of each job. When dealing with large files, hadoop split the file in to smaller chunk so that mapper can run it in parallel. However, initializing new mapper job usually takes few seconds that is also an overhead to be minimized. Below are the suggestions for the same:

Reuse jvm task Aim for map tasks running 1-3 minutes each. For this if the average mapper running time is lesser than one minute, increase the mapred.min.split.size, to allocate less mappers in slot and thus reduce the mapper initializing overhead. Use Combine file input format for bunch of smaller files. 4. Tuning application-specific performance a. Minimize your mapper output

Minimizing the mapper output can improve the general performance a lot as this is sensitive to disk IO, network IO, and memory sensitivity on shuffle phase.

For achieving this, below are the suggestions:

Filter the records on mapper side instead of reducer side. Use minimal data to form your map output key and map output value inMap Reduce. Compress mapper output b. Balancing reducer’s loading

Unbalanced reducer tasks create another performance issue. Some reducers take most of the output from mapper and ran extremely long compare to other reducers.

Below are the methods to do the same:

Implement a better hash function in Partitioner class. Write a preprocess job to separate keys using MultipleOutputs. Then use another map-reduce job to process the special keys that cause the problem. c. Reduce Intermediate data with Combiner in Hadoop

Implement a combiner to reduce data which enables faster data transfer.

d. Speculative Execution

MapReduce jobs get impacted when tasks take long time to finish the execution. This problem is being solved by approach of speculative execution by backing up slow tasks on alternate machines. You need to set the configuration parameters ‘mapreduce.map.tasks.speculative.execution’ and ‘mapreduce.reduce.tasks.speculative.execution’ to true for enabling speculative execution. This will reduce the job execution time if the task progress is slow due to memory unavailability.

5. Conclusion

There are several performance tuning tips and tricks for a Hadoop Cluster and we have highlighted some of the important ones. For more tricks to improve Hadoop cluster performance, check Job optimization techniques in Big data.

本文数据库(综合)相关术语:系统安全软件

主题: HadoopMapReduce
分页:12
转载请注明
本文标题:Hadoop Performance Tuning Best Practices
本站链接:http://www.codesec.net/view/531732.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(44)