I have an s3 bucket which has 4 folders now of which is input/. After the my airflow DAG Runs at the end of the py code are few lines which attempt to delete all files in the input/.
response_keys = self._s3_hook.delete_objects(bucket=self.s3_bucket, keys=s3_input_keys)
deleted_keys = [x['Key'] for x in response_keys.get("Deleted", []) if x['Key'] not in ['input/']]
self.log.info("Deleted: %s", deleted_keys)
if "Errors" in response_keys:
errors_keys = [x['Key'] for x in response_keys.get("Errors", [])]
raise AirflowException("Errors when deleting: {}".format(errors_keys))
Now, this sometimes deletes all files and sometimes deletes the directory itself. I am not sure why it is deleting even though I have specifically excluded the same.
Is there any other way I can try to achieve the deletion?
PS I tried using BOTO, but the AWS has a security which will not let both access the buckets. so Hook is all I got. Please help
CodePudding user response:
Directories do not exist in Amazon S3. Instead, the Key
(filename) of an object includes the full path. For example, the Key might be invoices/january.xls
, which includes the path.
When an object is created in a path, the directory magically appears. If all objects in a directory are deleted, then the directory magically disappears (because it never actually existed).
However, if you click the Create Folder button in the Amazon S3 management console, a zero-byte object is created with the name of the directory. This forces the directory to 'appear' since there is an object in that path. However, the directory does not actually exist!
So, your Airflow job might be deleting all the objects in a given path, which causes the directory to disappear. This is quite okay and nothing to be worried about. However, if the Create Folder button was used to create the folder, then the folder will still exist when all objects are deleted (assuming that the delete operation does not also delete the zero-length object).