What are the Spark S3 (or S3A) Committers in simple words and when I should use them?-CodePudding

I came across an article mentioning the Spark S3 Magic Committer.

Could someone explain what are Spark S3 Committers and how the Magic committer is different from the others? When should I use one rather than another?

CodePudding user response：

it's complex. Key point: you can't use rename to safely/rapidly commit the output of multiple task attempts to the aggregate job output.

instead special "s3 committers" use the multipart upload APIs of S3 to upload all the data in a task but not the final POST to materialize it; in job commit those POSTs are completed

S3A Staging committer: tasks stage all output to the localfs before starting the uploads. needs an HDFS or other "real" filesystem to pass manifest files from tasks to job committer
S3A magic committer: uses/abuses "magic" path rewriting to work out final destination of a file. App creates a file to s3a://bucket/dest/__magic/task1/__base/dir/file.parquet and the final file will go to s3a://bucket/dest/dir/file.parquet. Does not need HDFS, but does more S3 IO.
EMR spark committer. EMR's copy of the staging committer. Use this on EMR.

between staging & magic? I'd use hadoop-3.3.1 binaries for either, staging if HDFS is present, magic if not.

I'd also look at Apache Iceberg as a long term alternative

everything you always wanted to know: https://github.com/steveloughran/zero-rename-committer/releases/tag/tag_release_2021-05-17