My requirement is simple, I need to write my spark DataFrame to S3 as a single csv file of a specified name Write now I am using .coalesce(1) which puts all the data in a single CSV but still creates a folder with some additional files and the name of the main csv file is some id. [I'm using java/scala]
dataFrame.coalesce(1).write.mode(SaveMode.Overwrite).option("header", "true").csv("s3a://<mypath>")
this is how the data is being saved
CodePudding user response:
I think you can just collect the records and save it through the driver, since you are coalescing 1, you need to transfer all records to one node in all cases.
But before you collect it to local, i think it's better to transform your dataframe to a dataset.
Just do something like:
dataframe.as[TypeToDefine].collect()
then you get an Array of TypeToDefine, and you can write your csv using any popular java/scala csv library you like with the name you want.
CodePudding user response:
I think this is what you're asking.
import org.apache.hadoop.fs.Path
val s3Path: String = ??? // full S3 path & file name
val textToWrite: String = ??? // collect your dataframe and convert to a single String
val path = new Path(s3Path)
val fs = path.getFileSystem(spark.sparkContext.hadoopConfiguration)
val out = fs.create(path, true)
out.write( textToWrite.getBytes )
out.close()