I've created a Hive table with a partition like this:
CREATE TABLE IF NOT EXISTS my_table
(uid INT, num INT) PARTITIONED BY (dt DATE)
Then with PySpark, I'm having a dataframe and I've tried to write it to the Hive table like this:
df.write.format('hive').mode('append').partitionBy('dt').saveAsTable('my_table')
Running this I'm getting an exception:
Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
I then added this config:
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
This time no exception but the table wasn't populated either!
Then I removed the above config and added this:
hive.exec.dynamic.partition=false
Also altered the code to be like:
df.write.format('hive').mode('append').partitionBy(dt='2022-04-29').saveAsTable('my_table')
This time I am getting:
Dynamic partition is disabled. Either enable it by setting hive.exec.dynamic.partition=true or specify partition column values
The Spark job I want to run is going to have daily data, so I guess what I want is the static partition, but how does it work?
CodePudding user response:
If you haven't predefined all the partitions you will need to use:
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
Remember that hive is schema on read, and it won't automagically fix your data into partitions. You need to inform the meta-store of the paritions. You will need to do that manually with one of the two commands:
alter table <db_name>.<table_name> add partition(`date`='<date_value>') location '<hdfs_location_of the specific partition>';
or
MSCK REPAIR TABLE [tablename]
CodePudding user response:
if the table is already created, and you are using append
mode anyway, you can use insertInto
instead of saveAsTable
, and you don't even need .partitionBy('dt')
df.write.format('hive').mode('append').insertInto('my_table')