Home > database >  Disable Multipart Upload to S3 On Spark
Disable Multipart Upload to S3 On Spark

Time:02-11

I'm trying to write on a bucket that access is granted anonymously (policy allows our VPC). For a small workload, it works fine, but for a big one, I get the following exception:

22/02/08 19:22:56 INFO AWSCredentialProviderList:V3: Using credentials from AnonymousAWSCredentialsProvider
22/02/08 19:22:56 WARN ApacheUtils: NoSuchMethodException was thrown when disabling normalizeUri. This indicates you are using an old version (< 4.5.8) of Apache http client. It is recommended to use http client version >= 4.5.9 to avoid the breaking change introduced in apache client 4.5.7 and the latency in exception handling. See https://github.com/aws/aws-sdk-java/issues/1919 for more information
22/02/08 19:22:56 ERROR DatabricksS3LoggingUtils$:V3: S3 request failed with com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: GET https://prd-ifood-data-lake-transient-groceries.bucket.vpce-08b663a29475cd9f4-wcex1383.s3.us-east-1.vpce.amazonaws.com  {key=[null]} Hadoop 2.7.4, aws-sdk-java/1.11.655 Linux/5.4.0-1063-azure OpenJDK_64-Bit_Server_VM/25.302-b08 java/1.8.0_302 scala/2.12.10 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.GetBucketLocationRequest; Request ID: 8DWH9HBP7NFTS93R, Extended Request ID: 6nX89K2Vewdfyw77iX5oi84LgEW2 ZhS4FAuCrq7u3bTXf73w2Y3kmu9xX2TLdXwlFKvWL9BxGI=, Cloud Provider: Azure, Instance ID: 18ab9b06b3644cf484af692aedccf14b (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 8DWH9HBP7NFTS93R; S3 Extended Request ID: 6nX89K2Vewdfyw77iX5oi84LgEW2 ZhS4FAuCrq7u3bTXf73w2Y3kmu9xX2TLdXwlFKvWL9BxGI=), S3 Extended Request ID: 6nX89K2Vewdfyw77iX5oi84LgEW2 ZhS4FAuCrq7u3bTXf73w2Y3kmu9xX2TLdXwlFKvWL9BxGI=; Request ID: null, Extended Request ID: null, Cloud Provider: Azure, Instance ID: 18ab9b06b3644cf484af692aedccf14b
com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: GET https://prd-ifood-data-lake-transient-groceries.bucket.vpce-08b663a29475cd9f4-wcex1383.s3.us-east-1.vpce.amazonaws.com  {key=[null]} Hadoop 2.7.4, aws-sdk-java/1.11.655 Linux/5.4.0-1063-azure OpenJDK_64-Bit_Server_VM/25.302-b08 java/1.8.0_302 scala/2.12.10 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.GetBucketLocationRequest; Request ID: 8DWH9HBP7NFTS93R, Extended Request ID: 6nX89K2Vewdfyw77iX5oi84LgEW2 ZhS4FAuCrq7u3bTXf73w2Y3kmu9xX2TLdXwlFKvWL9BxGI=, Cloud Provider: Azure, Instance ID: 18ab9b06b3644cf484af692aedccf14b (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 8DWH9HBP7NFTS93R; S3 Extended Request ID: 6nX89K2Vewdfyw77iX5oi84LgEW2 ZhS4FAuCrq7u3bTXf73w2Y3kmu9xX2TLdXwlFKvWL9BxGI=), S3 Extended Request ID: 6nX89K2Vewdfyw77iX5oi84LgEW2 ZhS4FAuCrq7u3bTXf73w2Y3kmu9xX2TLdXwlFKvWL9BxGI=
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4926)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4872)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4866)
    at com.amazonaws.services.s3.AmazonS3Client.getBucketLocation(AmazonS3Client.java:1000)
    at com.amazonaws.services.s3.AmazonS3Client.getBucketLocation(AmazonS3Client.java:1006)
    at shaded.databricks.org.apache.hadoop.fs.s3a.EnforcingDatabricksS3Client.getBucketLocation(EnforcingDatabricksS3Client.scala:192)
    at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$verifyBucketExists$1(S3AFileSystem.java:673)
    at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
    at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
    at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
    at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
    at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:236)
    at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:670)
    at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:511)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:104)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.<init>(MicroBatchExecution.scala:57)
    at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:342)
    at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:386)
    at org.apache.spark.sql.streaming.DataStreamWriter.startQuery(DataStreamWriter.scala:498)
    at org.apache.spark.sql.streaming.DataStreamWriter.startInternal(DataStreamWriter.scala:446)
    at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:306)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:295)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)

Since the problem is with multipart upload, I've tried to disable it. I've already tried:

  • set spark.hadoop.fs.s3.multipart.uploads.enabled to false
  • set spark.hadoop.fs.s3a.multipart.uploads.enabled to false
  • set spark.hadoop.fs.s3n.multipart.uploads.enabled to false
  • set spark.hadoop.fs.s3.multipart.threshold to a very very big value
  • set spark.hadoop.fs.s3a.multipart.threshold to a very very big value
  • set spark.hadoop.fs.s3n.multipart.threshold to a very very big value

Everything on cluster startup and nothing seems to work, resulting in the same error. It is worthwhile to say that:

  • It is an Azure Databricks instance.
  • I'm using Pyspark.
  • There's a Security restriction on creating users (access through accessKey/secretKey) thus, the access through anonymous user.

Does anyone had a similar issue and disabled with success the multipart upload? Cheers!

CodePudding user response:

As a performance optimization for very small updates, by default the commit service sometimes pushes small updates directly from the control plane to S3. To disable this optimization, set the Spark parameter spark.hadoop.fs.s3a.databricks.s3commit.directPutFileSizeThreshold to 0. You can apply this setting in the cluster’s Spark config or set it in a global init script.

Link: docs.databricks.com

CodePudding user response:

the stack trace shows the failure is in the AWS SDK call getBucketLocation. this is used in signing and checked on filesystem creation. you are not going to get past it unless you can disable that check or make it available to anonymous users

this has nothing to do with multipart uploads

work out what permissions are needed to ask an s3 bucket for its location, then grant them to anonymous users

  • Related