Home > Software design >  How can I get ONLY files from S3 with python aioboto3 or boto3?
How can I get ONLY files from S3 with python aioboto3 or boto3?

Time:10-16

I have this code and I want only paths that end to a file without intermediate empty folders. For example:

data/folder1/folder2
data/folder1/folder3/folder4/file1.txt
data/folder5/file2.txt

From those paths I only want:

data/folder1/folder3/folder4/file1.txt
data/folder5/file2.txt

I am using this code but it gives me paths that end to directories as well:

    subfolders = set()
    current_path = None

    result = await self.s3_client.list_objects(Bucket=bucket, Prefix=prefix)
    objects = result.get("Contents")

    try:
        for obj in objects:
            current_path = os.path.dirname(obj["Key"])
            if current_path not in subfolders:
                subfolders.add(current_path)
    except Exception as exc:
        print(f"Getting objects with prefix: {prefix} failed")
        raise exc

CodePudding user response:

Cant you check whether there is an extension or not? By the way, you dont need to check existence of the path in the set since set will always keep the unique items.

list_objects does not return any indicator whether the item is folder or file. So, this looks the practical way.

Please check: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects

subfolders = set()
current_path = None

result = await self.s3_client.list_objects(Bucket=bucket, Prefix=prefix)
objects = result.get("Contents")

try:
    for obj in objects:
        current_path = os.path.dirname(obj["Key"])
        if "." in current_path:
            subfolders.add(current_path)
except Exception as exc:
    print(f"Getting objects with prefix: {prefix} failed")
    raise exc

CodePudding user response:

I would recommend using the boto3 Bucket resource here, because it simplifies pagination.

Here is an example of how to get a list of all files in an S3 bucket:

import boto3

bucket = boto3.resource("s3").Bucket("mybucket")
objects = bucket.objects.all()

files = [obj.key for obj in objects if not obj.key.endswith("/")]
print("Files:", files)

It's worth noting that getting a list of all folders and subfolders in an S3 bucket is a more difficult problem to solve, mainly because folders don't typically exist in S3. They are logically present, but not physically present, because of the presence of objects with a given hierarchical key such as dogs/small/corgi.png. For ideas, see retrieving subfolder names in S3 bucket.

  • Related