Home > Software design >  list S3 objects till only first level
list S3 objects till only first level

Time:04-02

I am trying to list s3 obejcts like this:

for key in s3_client.list_objects(Bucket='bucketname')['Contents']:
    logger.debug(key['Key'])

I just want to print the folder names or file names that are present on the first layer.

For example, if my bucket has this:

bucketname
     folder1
     folder2
        text1.txt
        text2.txt
    catallog.json

I only want to print folder1, folder2 and catalog.json. I don't want to include text1.txt etc.

However, my current solution also prints the files names present within the folders in my bucketname.

How can I modify this? I saw that there's a 'Prefix' parameter but not sure how to use it.

CodePudding user response:

You can split the keys on "/" and only keep the first level:

level1 = set()  #Using a set removes duplicates automatically 
for key in s3_client.list_objects(Bucket='bucketname')['Contents']:
        level1.add(key["Key"].split("/")[0])  #Here we only keep the first level of the key 

#then print your level1 set
logger.debug(level1)

/!\ Warnings

  1. list_object method has been revised and it is recommended to use list_objects_v2 according to AWS S3 documentation
  2. this method only returns some or all (up to 1,000) keys. If you want to make sure you get all the keys, you need to use the continuation_token returned by the function:
level1 = set()
continuation_token = ""
while continuation_token is not None:
    extra_params = {"ContinuationToken": continuation_token} if continuation_token else {}
    response = s3_client.list_objects_v2(Bucket="bucketname", Prefix="", **extra_params)
    continuation_token = response.get("NextContinuationToken")
    for obj in response.get("Contents", []):
        level1.add(obj.get("Key").split("/")[0])

logger.debug(level1)

CodePudding user response:

You use the Delimiter option, for example:

import boto3

s3 = boto3.client("s3")
BUCKET = "bucketname"

rsp = s3.list_objects_v2(Bucket=BUCKET, Delimiter="/")

objects = [obj["Key"] for obj in rsp["Contents"]]
folders = [fld["Prefix"] for fld in rsp["CommonPrefixes"]]

for obj in objects:
    print("Object:", obj)

for folder in folders:
    print("Folder:", folder)

Result:

Object: catalog.json
Folder: folder1/
Folder: folder2/

Note that if you have a large number of keys at your top-level (over 1000) then you will need to paginate your requests.

Also, note that list_objects is essentially deprecated and you should use list_objects_v2.

  • Related