Home > Enterprise >  Duplicate AWS S3 files differing only by a carriage return at the end of the Object URL
Duplicate AWS S3 files differing only by a carriage return at the end of the Object URL

Time:12-06

I have an S3 bucket with nearly duplicate files:

enter image description here

If I run the AWS CLI, I get the same file paths, differing only by a few bytes:

2021-09-23 16:36:36     134626 Original/53866358.xml
2021-09-23 16:36:36     134675 Original/53866358.xml

If I look at the individual object pages, both have the same key:

enter image description here

enter image description here

The only difference is that one has (ASCII carriage return) at the end of its Object URL. Presumably, this is the larger file. My question is: How can I get a unique reference to each of these using the AWS S3 CLI? I'd like to delete the ones with the carriage-return at the end.

CodePudding user response:

This is an interesting problem, just to lay the ground work of how my solution will help, I recreated the issue with a simple python script:

import boto3
s3 = boto3.client('s3')
s3.put_object(Bucket='example-bucket', Key='temp/key', Body=b'normal key')
s3.put_object(Bucket='example-bucket', Key='temp/key\r', Body=b'this is not the normal key')

From there, you can see the issue as you describe:

$ aws s3 ls s3://example-bucket/temp/
2021-12-03 20:14:45         10 key
2021-12-03 20:14:45         26 key

You can list the objects with more details using the cli (some details have been removed from the output here):

$ aws s3api list-objects --bucket example-bucket --prefix temp/
{
    "Contents": [
        {
            "Key": "temp/key",
            "Size": 10
        },
        {
            "Key": "temp/key\r",
            "Size": 26
        }
    ]
}

To remove the object with the CR in the key name, a script would be easiest, but you can delete it with the CLI, just with a somewhat awkward syntax:

## If you're using Unix or Mac
$ aws s3api delete-object --cli-input-json '{"Bucket": "example-bucket", "Key": "temp/key\r"}'

## If you're using Windows:
C:> aws s3api delete-object --cli-input-json "{""Bucket"": ""example-bucket"", ""
Key"": ""temp/key\r""}"

Note that required syntax to quote the JSON object, and escape the quotes on Windows.

From there, it's simple to verify this worked as expected:

$ aws s3 ls s3://example-bucket/temp/
2021-12-03 20:14:45         10 key

$ aws s3 cp s3://example-bucket/temp/key final_check.txt
download: s3://example-bucket/temp/key to ./final_check.txt

$ type final_check.txt
normal key
  • Related