Everybody, is there any way to solve the problem?
CodePudding user response:
Repartition (1)CodePudding user response:
200 is the Spark of SQL default parallelism, he is too slow to abandon by high parallelism, you had to reduce its parallelism, if you really dislike him too much, you can reduce its parallelism, appropriately adjust parameterThe spark. SQL. Shuffle. Partitions
CodePudding user response:
You might as well before writing to coalesce (num) such a try, I tried, at least a small amount of data or can be, like this:SpecialDaysInMall. ToDF (" name ", "age", "address"). The coalesce (3)
InMall registerTempTable (" ")
HQX. SQL (" insert the overwrite table t1 select name, age, address the from inMall ")
So, in the use of SQL to insert into the hive table or overwrite form, so that the result should be only generates three parquet file, might as well try
CodePudding user response:
If each file is smaller than that of HDFS BlockSize, it is necessary to adjust, otherwise it is not necessary to write before reparation,Because a is a shuffle will increase, if two data set is sorted, shuffle will disrupt the order,
So if necessary, to adjust suggestion is one step before written to insert reparation (n),
For example,
RDD. Map (XXX). The filter (XXX). SortBy (XXX). Write (XXX) before sortBy reparation
CodePudding user response:
RDD on computation is completed, and coalesceCodePudding user response:
Coalesce (1) to adoptCodePudding user response:
Parallelism determines how many files, either in the number of files and find a balance between parallelism, or try the adaptive