Is it possible to use a custom hadoop version with EMR?-CodePudding

As of today (2022-06-28), AWS EMR latest version is 6.6.0, which uses Hadoop 3.2.1.

I need to use a different Hadoop version (3.2.2). I tried the following approach, but it doesn't work. You can either set ReleaseLabel or Hadoop version, but not both.

client = boto3.client("emr", region_name="us-west-1")

response = client.run_job_flow(
    ReleaseLabel="emr-6.6.0",
    Applications=[{"Name": "Hadoop", "Version": "3.2.2"}]
)

Another approach that seems to not be an option, is loading a specific hadoop jar with SparkSession.builder.getOrCreate(), like so:

spark = SparkSession \
        .builder \
        .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.2') \
        .config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
        .getOrCreate()

Is it even possible to run an EMR cluster with a different Hadoop version? If so, how does one go about doing that?

CodePudding user response：

I'm afraid not. AWS don't want the support headache of allowing unsupported Hadoop versions, so they're always a little bit behind as they presumably take time to test each new release and its compatibility with other Hadoop tools. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-660-release.html.

You'd have to build your own cluster from scratch in EC2.

CodePudding user response：

You just need to add the script to the Bootstrap section when you spin up your cluster, this one -> spark-patch-s3a-fix_emr-6.6.0.sh =) Amazon provided this fix only for EMR 6.6.0.