I came across an article mentioning the Spark S3 Magic Committer.
Could someone explain what are Spark S3 Committers and how the Magic committer is different from the others? When should I use one rather than another?
CodePudding user response:
it's complex. Key point: you can't use rename to safely/rapidly commit the output of multiple task attempts to the aggregate job output.
instead special "s3 committers" use the multipart upload APIs of S3 to upload all the data in a task but not the final POST to materialize it; in job commit those POSTs are completed
- S3A Staging committer: tasks stage all output to the localfs before starting the uploads. needs an HDFS or other "real" filesystem to pass manifest files from tasks to job committer
- S3A magic committer: uses/abuses "magic" path rewriting to work out final destination of a file. App creates a file to
s3a://bucket/dest/__magic/task1/__base/dir/file.parquet
and the final file will go tos3a://bucket/dest/dir/file.parquet
. Does not need HDFS, but does more S3 IO. - EMR spark committer. EMR's copy of the staging committer. Use this on EMR.
between staging & magic? I'd use hadoop-3.3.1 binaries for either, staging if HDFS is present, magic if not.
I'd also look at Apache Iceberg as a long term alternative
everything you always wanted to know: https://github.com/steveloughran/zero-rename-committer/releases/tag/tag_release_2021-05-17