Home > Net >  Write multiple Avro files from pyspark to the same directory
Write multiple Avro files from pyspark to the same directory

Time:02-05

I'm trying to write out dataframe as Avro files from PySpark dataframe to the path /my/path/ to HDFS, and partition by the col 'partition', so under /my/path/ , there should be the following sub directory structures

partition= 20230101
partition= 20230102
....

Under these sub directories, there should be the avro files. I'm trying to use

df1.select("partition","name","id").write.partitionBy("partition").format("avro").save("/my/path/")

It succeed the first time with , but when I tried to write another df with a new partition, it failed with error : path /my/path/ already exist. How should I achieve this? Many thanks for your help. The df format is as below:

partition  name   id 
20230101.   aa.   10 ---this row is the content in the first df
20230102.   bb.   20 ---this row is the content in the second df

CodePudding user response:

You should change the SaveMode. By default save mode is ErrorIfExists, so you are getting an error. Change it to append mode.

df1.select("partition","name","id") \
  .write.mode("append").format("avro")\
  .partitionBy("partition").save("/my/path/")
  • Related