Home > other >  Hadoop structures optimization
Hadoop structures optimization

Time:09-28

? Operating system (OS)
1. To disable the Linux logical volume management (LVM)
2. Mount data partition atime to disable file and directory atime when
3. The Linux kernel parameter adjustment (/etc/sysctl. Conf) vm. Swappiness=0;
? HDFS parameter optimization
No special identification are: HDFS - site. XML attributes,
1. The fs. Default. The name (the core - site. XML)
Define the client USES the URL of the default file system
2. DFS. Name. Dir (important)
Defines a comma (without Spaces) among different local file path of the segmentation, the NameNode is on the path to save a HDFS file data backup,
3. DFS. Data. Dir
Defines the data block deposit where the DataNode HDFS
4. Io. File. The buffer. The size (the core - site. XML)
Used to set the cache size of hadoop access file IO, the default value is: 4 KB, reference: 64 MB (65536 byte)
5. DFS. Balance. BandwidthPerSec
Defines the HDFS balancer (balancer) balancing each DataNode operations allowed by the maximum bandwidth, the unit is byte
6. DFS. Block. The size
New file defines all the default data block size, the default value: 64 MB reference: 128 MB (134217728)
7. DFS. DataNode. Du. Reserved
Defines each DFS. Data. Dir, as defined by the hard disk space the size of the need to keep for byte size of the default value is 0, the reference: 10737418240
8. DFS. The NameNode. Handler. Count
Define the NameNode thread pool (to deal with the client's remote procedure call (RPC) and cluster daemon call) the worker thread size, the general principle is that the value will be set to cluster the natural logarithm of x 20:20 logn N as the cluster size,
9. DFS. DataNode. Failed. Volumes. Tolerated
Define the DataNode statement before failure, allowing many hard disk failure, the default value is: 0 reference: 1
10. The fs. Trash. The interval (the core - site. XML)
Directory file definition. Trash retention time, before the file is permanently delete default: 0 (no save) recommended value: 1440 (24 hours) note: garbage collection function is only in view of the imperative, to remove invalid javaApi call,
? Graphs
No special statement is mapred - site. XML attributes,
1. Mapred. Job. The tracker
Defines the host name and port information, jobtracker in the ports for remote procedure call (RPC),
2. Mapred. Local. Dir
Defines the map the intermediate results file storage path, if set the path for proprietary hard disk, then there is no need to configure HDFS - site. DFS in XML. The DataNode. Du. Reserved attribute of the
3. Mapred. Java. Child. Opts
Tasktracker is in a separate JVM process in the task of the promoter, startup can convey some parameters to it, more common is to set up the use of memory, the default value is: 200 MB reference for: - Xms2g.
4. Mapred. Child. Ulimit
To complement the JVM subtasks maximum stack size (i.e., it provides a process at the end of it to run before you can use one of the biggest virtual memory) this value can be generally set to 1.5 times the size of the JVM maximum stack,
5. Mapred. Tasktracker. The map (and reduce). The tasks. The maximum
6. Hadoop graphs can task concurrency equal mapred. Tasktracker. Map. The tasks.
7. Maximum and mapred. Tasktracker. Reduce. The tasks. The maximum sum of each task in a separate run on the JVM, according to the mapred. Java. Child. Opts to set the memory, if mapred. Java. Child. Opts set to 2 g, mapred. Tasktracker. Map. The tasks. The maximum
Set to 12 the map task is 24 g memory consumption, the value is generally 1.5 task for each physical CPU configuration, CPU * 1.5=graphs total number of concurrent tasks, and will be treated as two-thirds time slicing to map a third task to reduce task
8. IO. Sort. MB
Map task some output will be written in a ring buffer, the buffer size is specified by this attribute, when the buffer utilization rate reached 80%, will start a background process will overflow data writes mapred. Local. Dir under the specified path, the default value is: 100 MB data at the same time pay attention to the specified memory is contained in the subtasks in the JVM stack space,
9. IO. Sort. Factor
It defines the one-time can merge file number, the following two situations will trigger the graphs file merging: the first is a spill files need to merge the map task has been completed, the second is getting all the map tasks in reduces the output file and user calls to reduce code before, after increasing number each round to open the file to merge can reduce the time of the data read and write the hard disk, also reduced the hard disk IO operations, but with more files means need more memory,
10. Mapred.com press. The map. The output
To enable the map the compression reference: true
11. Mapred. Map. The output. The compression will. Codec
Defines the tasks in the graphs used in map output compression code, if the value is empty, org.apache.hadoop.io.com the defaultCodec is called
Reference: org.apache.hadoop.io.com. Press SnappyCodec
12. The mapred. The output. The compression will. Type the
If the output of the graphs tasks require writing SequenceFile format, then this parameter will determine the type of compression, can match the value has three: the first is a RECORD (RECORD), the value will make SequenceFile every data is compressed, the second value is BLOCK (BLOCK) this value will make all the key data according to the fixed size BLOCK compression, and the third is no (NONE) setting this value will not for data compression, reference: BLOCK,
13. Mapred. Job. Tracker. Handler. Count
Jobtracker will maintain a work thread pool to process remote procedure call (RPC), the value is used to control the thread pool size, reference 20 logn N as the cluster size,
14. The mapred. Reduce. The parallel. Copies
In the process of graphs homework reshuffle (shuffle), each reducer task need to obtain the same from the various tasktrackers operations run map tasks in the middle of the data, the parameter can be set number of reducer can be concurrent access to data, generally can be set to 4 * logN N as the cluster size,
15. Mapred. Reduce. The tasks
This parameter specifies the number of reduce homework to produce

16. Tasktracker. HTTP. Threads
Control each tasktracker to handle the request of parallel work thread count, the number of HTTP threads and the number of clusters reduce slot, and licensee data parallel processing is proportional to the,
17. Mapred.reduce.slowstart.com pleted. Maps
Once some map task to generate the intermediate results, can let the reducer in advance to shuffle shuffle (), the data as early as possible a copy of the data, can make the reduce tasks are completed on the final map task can be run directly, the attribute definition allows the reduce process starts the map completion percentage, reference value: 0.8 or 80%

  • Related