How to write to Hive table with static partition using PySpark?-CodePudding

I've created a Hive table with a partition like this:

CREATE TABLE IF NOT EXISTS my_table
(uid INT, num INT) PARTITIONED BY (dt DATE)

Then with PySpark, I'm having a dataframe and I've tried to write it to the Hive table like this:

df.write.format('hive').mode('append').partitionBy('dt').saveAsTable('my_table')

Running this I'm getting an exception:

Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

I then added this config:

hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict

This time no exception but the table wasn't populated either!

Then I removed the above config and added this:

hive.exec.dynamic.partition=false

Also altered the code to be like:

df.write.format('hive').mode('append').partitionBy(dt='2022-04-29').saveAsTable('my_table')

This time I am getting:

Dynamic partition is disabled. Either enable it by setting hive.exec.dynamic.partition=true or specify partition column values

The Spark job I want to run is going to have daily data, so I guess what I want is the static partition, but how does it work?

CodePudding user response：

If you haven't predefined all the partitions you will need to use:

hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict

Remember that hive is schema on read, and it won't automagically fix your data into partitions. You need to inform the meta-store of the paritions. You will need to do that manually with one of the two commands:

alter table <db_name>.<table_name> add partition(`date`='<date_value>') location '<hdfs_location_of the specific partition>';

MSCK REPAIR TABLE [tablename]

CodePudding user response：

if the table is already created, and you are using append mode anyway, you can use insertInto instead of saveAsTable, and you don't even need .partitionBy('dt')

df.write.format('hive').mode('append').insertInto('my_table')