EMRFS S3-optimized committer when using RDD and Datasets-CodePudding

I want to use the EMRFS S3-optimized committer. I set "spark.sql.parquet.fs.optimized.committer.optimization-enabled" to true when running a new step in spark EMR. But I don't think it uses the optimized committer (_SUCCESS is 0 bytes). How does EMR choose which committer to use? Can it use the optimized for datasets and unoptimized for RDD? Because I have both in the same spark run.

CodePudding user response：

The optimized output committer is built-in EMR and used by default. AWS optimized committer is activated only when it can: Until EMR 6.4.0 it worked only on some conditions: "Starting with Amazon EMR 6.4.0, this committer can be used for all common formats including parquet, ORC, and text-based formats (including CSV and JSON). For release versions prior to Amazon EMR 6.4.0, only the Parquet format is supported" - From AWS docs.

For me, there was an improvement of 50-60 percent in execution time.

The optimized committer requeirments.

_SUCCESS files will still be 0 bytes. There are other OutputCommitters that do write content into them like the magic committer but are not fit for EMR and AWS advises not to use them.