Home > database >  Hive on table joins, and solve the cause of data skew
Hive on table joins, and solve the cause of data skew

Time:09-17

1. The cause of the tilt:
The map output data according to the key in the distribution of the Hash to reduce, due to the uneven distribution of the key, the business characteristics of the data itself,
Ill-considered when building table. Causes such as the reduce of the amount of data on the difference is too big.
(1) of key distribution is not uniform;
(2). The characteristics of business data itself;
(3). Building table thoughtless;
(4). Some of the SQL statement itself has data skew

How to avoid: the key is empty data skew, can give a random value to it.
Solutions
2.(1) parameter adjustment
"Hive. The map. The aggr=true
hive.groupby.skewindata=https://bbs.csdn.net/topics/true
When there are data skew on load balancing, elected item set a true, the generated query plan have two MR JOB.
The first MR jobs, the Map will be randomly distributed to Reduce the output of collection, each Reduce do part of the aggregation operations,
And the output, so that the processing result is the same Group BY Key can be distributed to different Reduce, thus reach down,
The purpose of load balancing; The second MR Job again according to the results of data preprocessing in accordance with the Group By Key distribution to Reduce (the
Process can ensure that the same Group BY Key was assigned to the same Reduce), and finally completed the ultimate aggregation operations.
(2) the SQL statements to adjust:
1) choose the join key distribution is the most uniform table as the driver table. Make column cutting and filter operation, in order to achieve the join two tables do,
The effect of the amount of data is relatively smaller
2) size table Join:
Using the map to join small dimension tables (article 1000 the number of records) advanced memory. On the map to finish the reduce.
3) the big table Join table:
Put the key into a string of empty value, random Numbers, reduce the tilt data into different, because the null value relevance,
After processing does not affect the final result.
4) count distinct a lot of the same special values
Count distinct, the value is empty of individual treatment, if is to calculate the count distinct, can need not processing, direct filtration,
In the final result of 1, if there are other calculation, group by, can record value is empty of individual treatment first, again with the
Other results to the union.
  • Related