I have an EMR serverless app that cannot connect to S3 bucket in another region. Is there a workaround for that? Maybe a parameter to set in Job parameters or Spark parameters when submitting a new job. The error is this:
ExitCode: 1. Last few exceptions: Caused by: java.net.SocketTimeoutException: connect timed out Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectTimeoutException
CodePudding user response:
In order to connect to an S3 bucket in another region or access external services, the EMR Serverless application needs to be created with a VPC.
This is mentioned on the considerations page:
Without VPC connectivity, a job can access some AWS service endpoints in the same AWS Region. These services include Amazon S3, AWS Glue, Amazon DynamoDB, Amazon CloudWatch, AWS KMS, and AWS Secrets Manager.
Here's an example AWS CLI command to create an application in a VPC - you need to provide a list of Subnet IDs and Security Group IDs. More details can be found in configuring VPC access.
aws emr-serverless create-application \
--type SPARK \
--name etl-jobs \
--release-label "emr-6.6.0" \
--network-configuration '{
"subnetIds": ["subnet-01234567890abcdef","subnet-01234567890abcded"],
"securityGroupIds": ["sg-01234566889aabbcc"]
}'