I have more than 1050 json files in a s3 location that contains a field 'id'. I am looping over these json files getting these id's using get_object. I will use these id's and pass along with a url to get another json response which contain a field that has snapshot location i.e, link to download file. I am capturing the downloaded object and writing to s3 location using s3_client.upload_fileobj(BytesIO(response.content), bucket_name, api_download_file_path file_name) all good but I am getting only 1000 csv files in the destination s3 location everytime I do it when I am expecting 1050. Is this due to any limit on upload_fileobj .
full code here
result = s3_client.list_objects(Bucket=bucket_name, Prefix=api_target_read_path)
for res in result.get('Contents'):
data = s3_client.get_object(Bucket=bucket_name, Key=res.get('Key'))
contents = data['Body'].read().decode('utf-8')
json_data = json.loads(contents)
print(json_data['id'])
json_id = json_data['id']
geturl = inv_avail_get_api_url json_id
response = requests.get(geturl, headers=headers)
print(response.text)
durl = response.json()["response"]["snapshotLocation"]
response = requests.get(durl)
segments = durl.rpartition('/')
file_name = str(segments[2]).split('?')[0]
print(file_name)
s3_client.upload_fileobj(BytesIO(response.content), bucket_name, api_download_file_path file_name)
python
CodePudding user response:
You need to use the paginator class if you are trying to get more than 1000 objects at a time, as per docs:
Some AWS operations return results that are incomplete and require subsequent requests in order to attain the entire result set. The process of sending subsequent requests to continue where a previous request left off is called pagination. For example, the
list_objects
operation of Amazon S3 returns up to 1000 objects at a time, and you must send subsequent requests with the appropriate Marker in order to retrieve the next page of results.
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='bucket', Prefix='prefix')
for page in pages:
for obj in page['Contents']:
print(obj['Size'])