Home > OS >  Sort data within one output directory created by partitionBy
Sort data within one output directory created by partitionBy

Time:09-27

I have a big geospatial dataset partitionBy quadkey's level 5. In each qk5 level directory, there are about 1-50 Gb of data, so it doesn't fit into one file. I want to benefit from pushdown filters when do my geospatial queries. So I want that files within one qk5 partition be sorted by higher qk resolution (let's say quadkey level 10). Question: Is there are a way to sort data within partitionBy batch? For example:

qk5=00001/
    part1.parquet
    part2.parquet
    part3.parquet
    part4.parquet
...

qk5=33333/
    part10000.parquet
    part20000.parquet
    part30000.parquet
    part40000.parquet

I want to have data from part1.parquet, part2.parquet, part3.parquet, part4.parquet to be sorted by column 'qk10'.

Here is the current code, but it only provides sorting within one particular partition (e.g. part1.parquet):

// Parquet save
preExportRdd.toDF
  .repartition(partitionsNumber, $"salt")
  .sortWithinPartitions($"qk10")
  .drop("salt")
  .write
  .partitionBy("qk")
  .format("parquet")
  .option("compression", "gzip")
  .mode(SaveMode.Append)
  .save(exportUrl)

CodePudding user response:

The problem is that you don't sort your Dataframe globally by qk field and it causes for the same qk values to be distributed in different spark partitions. During the write phase, due to partitionBy("qk"), the output written to a specific physical partition (folder) may arrive from different spark partitions, which causes your output data to be unsorted.

Try instead the following:

preExportRdd.toDF
  .repartitionByRange(partitionsNumber, $"qk", $"qk10", $"salt")
  .sortWithinPartitions($"qk10")
  .drop("salt")
  .write
  .partitionBy("qk")
  .format("parquet")
  .option("compression", "gzip")
  .mode(SaveMode.Append)
  .save(exportUrl)

The repartitionByRange will sort your Dataframe by the provided columns and split the sorted Dataframe to the desired number of partitions.

  • Related