What could cause AWS S3 MultiObjectDeleteException?-CodePudding

In our Spring Boot app, we are using AmazonS3Client.deleteObjects() to delete multiple objects in a bucket. From time to time, the request throws MultiObjectDeleteException and one or many objects won't be deleted. It is not often, about 5 failures among thousands of requests. But still it could be a problem. What could lead to the exception?

And I have no idea how to debug. The log from our app follows the data flow but not showing much useful information. It suddenly throws the exception after the request. Please help.

Another thing is that the exception comes back with a 200 code. How could this be possible?

com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null; Request ID: xxxx; S3 Extended Request ID: yyyy; Proxy: null)

CodePudding user response：

TLDR: Some error rates are normal and the application should handle them. 500 and 503 errors are retriable. The MultiObjectDeleteException should provide a clue and getDeletedObjects() gives you a list of the deleted objects. The rest you should mostly try later.

In the MultiObjectDeleteException documentation is said that exception should have an explanation of the issue which caused the error

https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html

Exception for partial or total failure of the multi-object delete API, including the errors that occurred. For successfully deleted objects, refer to getDeletedObjects().

According to https://aws.amazon.com/s3/sla/ AWS does not guarantee 100% availability. Again, according to that document:

• “Error Rate” means: (i) the total number of internal server errors returned by the Amazon S3 Service as error status “InternalError” or “ServiceUnavailable” divided by (ii) the total number of requests for the applicable request type during that 5-minute interval. We will calculate the Error Rate for each Amazon S3 Service account as a percentage for each 5-minute interval in the monthly billing cycle. The calculation of the number of internal server errors will not include errors that arise directly or indirectly as a result of any of the Amazon S3 SLA Exclusions.

Usually we think about SLA in the terms of downtimes so it is easy to assume that AWS does mean the same. But that's not the case here. Some number of errors is normal and should be expected. In many documents AWS does suggest that you should implement a combination of slowdowns and retries e.g. here https://docs.aws.amazon.com/AmazonS3/latest/userguide/ErrorBestPractices.html

Some 500 and 503 errors are, again, part of the normal operation https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/ The documents specifically says:

Because Amazon S3 is a distributed service, a very small percentage of 5xx errors is expected during normal use of the service. All requests that return 5xx errors from Amazon S3 can be retried. This means that it's a best practice to have a fault-tolerance mechanism or to implement retry logic for any applications making requests to Amazon S3. By doing so, S3 can recover from these errors.

Edit: Later was added a question: "How is it possible that the API call returned status code 200 while some objects were not deleted."

And the answer to that is very simple: This is how the API is defined. From the JDK reference page for deleteObjects you can go directly to the AWS API documentation page https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html Which says that this is the expected behavior. Status code 200 means that the high level API code succeeded and was able to request the deletion of the listed objects. Well, some of these actions did fail and, but the API call did create a report about it in the response.

Why does the Java API throw an exception then? Again, the authors of the AWS Java SDK tried to translate the response to the Java programming language and they clearly thought that while AWS API works with a non-zero error rate as part of the service agreement, Java developers are more used to a situation that anything but 100% success should end up by an exception.

Both of the abstractions are well documented and it is the programmer who is responsible for a precise implementation. The engineering rule is cheap, fast, reliable - chose two. AWS was able to provide a service which has all three with a reasonable concession that part of the reliability will be implemented on the client side - retries and slow-downs.