I'm trying to write out dataframe as Avro files from PySpark dataframe to the path /my/path/
to HDFS, and partition by the col 'partition', so under /my/path/
, there should be the following sub directory structures
partition= 20230101
partition= 20230102
....
Under these sub directories, there should be the avro files. I'm trying to use
df1.select("partition","name","id").write.partitionBy("partition").format("avro").save("/my/path/")
It succeed the first time with , but when I tried to write another df with a new partition, it failed with error : path /my/path/ already exist. How should I achieve this? Many thanks for your help. The df format is as below:
partition name id
20230101. aa. 10 ---this row is the content in the first df
20230102. bb. 20 ---this row is the content in the second df
CodePudding user response:
You should change the SaveMode.
By default save mode is ErrorIfExists
, so you are getting an error.
Change it to append mode.
df1.select("partition","name","id") \
.write.mode("append").format("avro")\
.partitionBy("partition").save("/my/path/")