Home > Software engineering >  Is it possible to use a custom hadoop version with EMR?
Is it possible to use a custom hadoop version with EMR?

Time:06-30

As of today (2022-06-28), AWS EMR latest version is 6.6.0, which uses Hadoop 3.2.1.

I need to use a different Hadoop version (3.2.2). I tried the following approach, but it doesn't work. You can either set ReleaseLabel or Hadoop version, but not both.

client = boto3.client("emr", region_name="us-west-1")

response = client.run_job_flow(
    ReleaseLabel="emr-6.6.0",
    Applications=[{"Name": "Hadoop", "Version": "3.2.2"}]
)

Another approach that seems to not be an option, is loading a specific hadoop jar with SparkSession.builder.getOrCreate(), like so:

spark = SparkSession \
        .builder \
        .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.2') \
        .config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
        .getOrCreate()

Is it even possible to run an EMR cluster with a different Hadoop version? If so, how does one go about doing that?

CodePudding user response:

I'm afraid not. AWS don't want the support headache of allowing unsupported Hadoop versions, so they're always a little bit behind as they presumably take time to test each new release and its compatibility with other Hadoop tools. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-660-release.html.

You'd have to build your own cluster from scratch in EC2.

CodePudding user response:

You just need to add the script to the Bootstrap section when you spin up your cluster, this one -> spark-patch-s3a-fix_emr-6.6.0.sh =) Amazon provided this fix only for EMR 6.6.0.

  • Related