Home > Net >  Extension of compressed parquet file in Spark
Extension of compressed parquet file in Spark

Time:12-28

In my Spark job, I write a compressed parquet file like this:

df
  .repartition(numberOutputFiles)
  .write
  .option("compression","gzip")
  .mode(saveMode)
  .parquet(avroPath)

Then, my files has this extension : file_name .gz.parquet

How can I have ".parquet.gz" ?

CodePudding user response:

I don't believe you can. File extension is hardcoded in ParquetWrite.scala as concatenation of codec's extension and ".parquet", in that order:

  :
    override def getFileExtension(context: TaskAttemptContext): String = {
      CodecConfig.from(context).getCodec.getExtension   ".parquet"
    }
  :

So, unless you want to change the source and compile your own Spark version, or open a JIRA request against Spark... ;))

  • Related