PySpark/Hive - Partition strategy for monthly job that deals with daily data-CodePudding

I have a monthly Spark job that process data and save into Hive/Impala tables (file storage format is parquet). The granularity of the table is daily data, but source data for this job also comes monthly job.

I'm trying to see how to best partition the table. I'm thinking of partitioning the table base a month key, like in the data frame below. When there needs to be a re-run, for example a re-run of March's job will overwrite row 3 and 4. Wondering if anyone sees any problems with this approach, or have other suggestions? Thanks.

 --- ---------- ------------- 
| id|  date_key|partition_key|
 --- ---------- -------------
|100|20220105  |202201       |
|200|20220204  |202202       |
|300|20220301  |202203       |
|400|20220302  |203303       | 
 --- ---------- -------------

CodePudding user response：

Your approach is correct. Hive, Impala and Parquet as a file format. As for partitioning, remember that the worst thing you can do is have too many partitions because their metadata is stored in Hive Metastore (relational database) and if you have a very large number of partitions any metadata query can be heavy.

When choosing a partitioning key, check how much data you will have in each partition. You need to choose a granulation that on the one hand will allow you to quickly access data and on the other hand you will not have a large number of small partitions (i.e. with a small amount of data).

Please find more information about Hive & Spark: