Home > other >  Increase the number of partitions without repartition on Hadoop
Increase the number of partitions without repartition on Hadoop

Time:02-12

I have a directory containing a bunch of deflate compressed CSV files that are around 500mb. I would like to split them into smaller deflate compressed CSV files. For example I have 3 500mb files and I would like them to become 15 100mb files after the write. I am currently doing something like this:

spark.read.csv("/input/path")
  .repartition(15)
  .write.option("compression", "deflate").csv("output/path")

But this causes a whole unnecessary shuffle. Is there a way to get it to write 15 files without going through all this trouble?

CodePudding user response:

In short no. There is an open feature request for this. Spark uses spark internals and creating partitions is done via shuffle. If you really want to split this file without shuffling uses some code that isn't spark to do it. But really not worth the time.

Off topic but still valuable feedback -> @OneCricketeer is correct, you should consider a different file format that is more performant.(Parquet/Orc) It will vastly improve performance as the size of data increases and should be your first thought when it comes to file formats.

  • Related