After the spark SQL execution, be born small file too much-CodePudding

An insert overwrite statements, 200 small files, tried to configure parameters: spark. SQL. Shuffle. Partitions=1, but this cause all reduce phase only a parallelism, affected the execution efficiency,
Everybody, is there any way to solve the problem?

CodePudding user response:

Repartition (1)

CodePudding user response:

200 is the Spark of SQL default parallelism, he is too slow to abandon by high parallelism, you had to reduce its parallelism, if you really dislike him too much, you can reduce its parallelism, appropriately adjust parameter
The spark. SQL. Shuffle. Partitions

CodePudding user response:

You might as well before writing to coalesce (num) such a try, I tried, at least a small amount of data or can be, like this:
SpecialDaysInMall. ToDF (" name ", "age", "address"). The coalesce (3)
InMall registerTempTable (" ")
HQX. SQL (" insert the overwrite table t1 select name, age, address the from inMall ")
So, in the use of SQL to insert into the hive table or overwrite form, so that the result should be only generates three parquet file, might as well try

CodePudding user response:

If each file is smaller than that of HDFS BlockSize, it is necessary to adjust, otherwise it is not necessary to write before reparation,
Because a is a shuffle will increase, if two data set is sorted, shuffle will disrupt the order,
So if necessary, to adjust suggestion is one step before written to insert reparation (n),
For example,
RDD. Map (XXX). The filter (XXX). SortBy (XXX). Write (XXX) before sortBy reparation

CodePudding user response:

RDD on computation is completed, and coalesce

CodePudding user response:

Coalesce (1) to adopt

CodePudding user response:

Parallelism determines how many files, either in the number of files and find a balance between parallelism, or try the adaptive