Home > Blockchain >  determine written object paths with Pyspark 3.2.1 hadoop 3.3.2
determine written object paths with Pyspark 3.2.1 hadoop 3.3.2

Time:03-22

When writing dataframes to S3 using the s3a connector, there seems to be no official way of determining the object paths on s3 that were written in the process. What I am trying to achieve is simply determining what objects have been written when writing to s3 (using pyspark 3.2.1 with hadoop 3.3.2 and the directory committer).

The reason this might be useful:

  • partitionBy might add a dynamic amount of new paths
  • spark creates it's own "part..." parquet files with cryptic names and number depending on the partitions when writing

With pyspark 3.1.2 and Hadoop 3.2.0 it used to be possible to use the not officially supported "_SUCCESS" file which was written at the path before the first partitioning on S3, which contained all the paths of all written files. Now however, the number of paths seems to be limited to 100 and this is not a option anymore.

Is there really no official, reasonable way of achieving this task?

CodePudding user response:

Now however, the number of paths seems to be limited to 100 and this is not a option anymore.

we had to cut that in HADOOP-16570...one of the scale problems which surfaced during terasorting at 10-100 TB. the time to write the _SUCCESS file started to slow down job commit times. it was only ever intended for testing. sorry.

it is just a constant in the source tree. if you were to provided a patch to make it configurable, I'll be happy to review and merge, provided you follow the "say which aws endpoint you ran all the tests or we ignore your patch" policy.

I don't know where else this stuff is collected. the spark driver is told of the number of files and their total size from each task commit, but isn't given the list by tasks, not AFAIK.

spark creates it's own "part..." parquet files with cryptic names and number depending on the partitions when writing

the part-0001- bit of the filename comes from the task id; the bit afterwards is a uuid created to ensure every filename is unique -see SPARK-8406 Adding UUID to output file name to avoid accidental overwriting. you can probably turn that off

  • Related