Home > Software engineering >  Overwrite in to same partition files after transformation based on the filename using spark
Overwrite in to same partition files after transformation based on the filename using spark

Time:07-01

Hi I have files in a S3 bucket MyBucket/object/file 1.csv, file 2.csv, file 3.csv,

I have loaded this data into single dataframe and need to do some transformation based on columns.Then I want to write to transform column values now I want to overwrite the files back in to same file1.csv, file2.csv,file3.csv. When I give overwrite commands its creating another file in same folder and loading values

How to write function or code using python and spark or scala

CodePudding user response:

Well, I'm not sure if my answer is the best, but I hope it is.
Basically To write output to file, Spark repects hadoop config, which is mapreduce.output.basename
Default value should be something like part-00000.
You can adjust this config but can't make exactly same as your file name convention.
So you have to write and rename to your file name convention.
So procedure is simple.

  1. Write file to the path.
  2. Rename output file to original name(may delete old and rename)

CodePudding user response:

Whenever you are saving a file in spark it creates directory then part files are created.

you can limit part files from many files to 1 using coalesce(1), but you can't control the directory creation.

df2.coalesce(1).write.mode("overwrite").csv("/dir/dir2/Sample2.csv")

it will create one directory namely Sample2.csv and will create one part file.

I hope it cleared your doubt.

  • Related