Home > Mobile >  S3 bucket: Easy way to remove all the files older than 15min when there are already millions of file
S3 bucket: Easy way to remove all the files older than 15min when there are already millions of file

Time:03-19

I am trying to delete all the files in s3 bucket which are older than 15min. The below python script only gets the filesnames. The number of files are in millions.

import boto3
import datetime

client = boto3.client('s3')

paginator = client.get_paginator("list_objects_v2")

for page in paginator.paginate(Bucket='raw-data-ingestion-us-west-2-dev'):
    print(page["Contents"])

    for file in page["Contents"]:
        file_name =file.get("Key")
        modified_time = file.get("LastModified").replace(tzinfo=None)
            
        difference_days_delta = today_date_time - modified_time
        difference_minutes = difference_days_delta.seconds/60
        if difference_minutes > 15:
            print("difference_minutes---", difference_minutes)
            print("file more than 15 minutes older : - ", file_name)
        else:
            print("file less than 15 minutes older : - ", file_name)

The above script which prints the files names older than 15min itself is taking hours.

I have to stop the script in between.

So anyidea how to get the deleting done without interruption.

I am storing the files as follows:

DEV001_STEL_FOOTMODE/2022/03/02/03/40/1646192437.755104-1646192439.467863-DEV001_STEL_FOOTMODE

where

2022/03/02 (refers date)
03/40 (refers hr and minute)

DEV001_STEL_FOOTMODE will be some kind of main subfolder. There are many such subfolders inside bucket and each for everyhour files are tored and the file is also suffixed with the same main subfolder

CodePudding user response:

Amazon S3 offers the Object Lifecycle that can delete objects after a specified period.

It would offer the easiest way of deleting objects. However, the resolution is only one day and it might take 24-48 for the objects to be deleted.

You have not provided any information about how the objects are created or 'used', so my other suggestions would be:

  • If the objects aren't being used, then don't create them. (Simple!)
  • If a process is 'using' the files (eg an AWS Lambda function being triggered when each object is created), then that process can also delete the object when it has finished processing it.
  • Store the objects in separate subdirectories, so you know that you can always delete the objects in a given subdirectory (eg a new one being used every 15-30 minutes).
  • Related