Questions about when to spark partition?-CodePudding

Questions about when to partition

Spark partition was conducted in maptask to reduceTask partition, or is in the sc. Began to partition textFile?

. I had a test, sc textFile default partition after I found not print the contents of each partition hash algorithm partitions, but when I passed the shuffle operator then print the contents of each partition is in accordance with the hash algorithm partitions

Is so confused, if sc. TextFile began partitions, so suppose three block, block, and I in the sc. TextFile 5 partition is specified, it will be three block, block is divided into five partitions, that would be to take up memory and network resources (map to take one of each block, block partition), feeling a bit is not reasonable, and then after a shuffle operator, partition again, feel very slow;

. I wonder whether the beginning sc textFile, while reading HDFS data according to the way of average for each partition data (for example: three block, block, a total of 384 Mb, five partition is about 76.8 Mb, each map reading this 76 bMB data), and then after when began to shuffle operator according to the hash algorithm partitions, generate documents, again through the reduce partition value of each node, so also can say, the final five part - 0000 files, shuffle in the process of barrel is 5 * 5=25

CodePudding user response:

Spark into RDD, partition number is determined by the data source,
Such as the data source is HDFS file, partition is determined by the file format of InputSplit, generally partition number is equal to read the file to the total number of blocks and graphs MapTask (this is a same logic)
If the data source is HBASE table, a partition corresponding to a Region
If the data source is KAFKA, with topic partition
Similar to other truth,

As for the shuffle, there are two partitionor spark default: one is the hash, the other is a range, hash you mentioned, the range is from pool sampling algorithm, calculate the distribution of data through a data set, then according to the specified weight partition number into corresponding range, in accordance with the range data will fall on the partition,

CodePudding user response:

Excuse me: suppose three block, block, I in the sc. TextFile will specify five partitions textFile how to read this problem

CodePudding user response:

refer to 2 floor don't use hidden dragon - Martin's reply:

excuse me: suppose three block, block, and I in the sc textFile will specify five partitions textFile this kind of circumstance how to read

Sc. TextFile specified partition number, it is suggested that the lowest number, in addition this value, is determined by the InputSplit file how to meet,

CodePudding user response:

Thank you for your reply, but still want to know what will happen if the specified partition is 5, and how to read

CodePudding user response:

The

reference 4 floor don't use hidden dragon - Martin's reply:

thank you for your reply, but still want to know what will happen if the specified partition is 5, and how to read the

You can see the org, apache hadoop. Mapred. InputFormat specific implementation class code file format