Performance of pyspark hive when a table has many partition columns-CodePudding

My question boils down to understanding the performance impact on the partitioning scheme when spark is used to query a hive table with many partitioning columns.

To give a simple example, consider two tables: Table 1 has 3 partition columns

- year 
- month 
- day

And data is stored in paths like

year=2021/month=01/day=01/...data...

Table 2 has 1 partition column

- date

where date = concat(year, month, day) and the paths are for example

date=20210101/...data..

Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.

CodePudding user response：

Can you specify more details here like what type of compression is used, did you enable pruning for the compression type etc. If so pyspark hive generally gives good performance with orc with pruning enabled. Also the reason this is happening is because there are many files inside the final partition resulting large number of tasks being created. In such cases its generally advised to check and coalesce the partitions, also increasing shuffle partition and shuffle spill is also advised while reading the data, the general default value is 200 which can be increased for better parallelism. Check the below link for options that can be used for optimization. spark configurations

CodePudding user response：

Except all other factors which you did not mention: storage type, configuration, cluster capacity, the number of files in each case, your partitioning schema does not correspond to the use-case.

Partitioning schema should be chosen based on how the data will be selected or how the data will be written or both. In your case partitioning by year, month, day separately is over-partitioning. Partitions in Hive are hierarchical folders and all of them should be traversed (even if using metadata only) to determine the data path, in case of single date partition, only one directory level is being read. Two additional folders: year month day instead of date do not help with partition pruning because all columns are related and used together always in the where.

If you have one more column in the WHERE clause in the most of your queries, say category, which does not correlate with date and the data is big, then additional partition by it makes sense, you will benefit from partition pruning then.