After coalesce() function in pyspark(databricks), the file was saved as a single csv with a weird name that starts with part-00000 or ends with .csv extension. I would like to rename it to a more user-friendly name in a function.
I trying the approach suggested below: https://medium.com/plusteam/move-and-rename-objects-within-an-s3-bucket-using-boto-3-58b164790b78
import boto3
s3_resource = boto3.resource(‘s3’)
# Copy object A as object B
s3_resource.Object(“bucket_name”, “newpath/to/object_B.txt”).copy_from(
CopySource=”path/to/your/object_A.txt”)
# Delete the former object A
s3_resource.Object(“bucket_name”, “path/to/your/object_A.txt”).delete()
The above code says to copy the object with the new name and delete the original file. However, after several tries, it only works when I put the whole weird name within the copy_source.
What I would like to do is since there is only one weirdly-named file, is to use the *.csv just like the way it works with pandas. I tried the endswith() function but seems to cannot work.
The answer from this Rename Pyspark output files in s3 renames each partition hence there is a obvious pattern.
import datetime
import boto3
s3 = boto3.resource('s3')
for i in range(5):
date = datetime.datetime(2019,4,29)
date = datetime.timedelta(days=i)
date = date.strftime("%Y-%m-%d")
print(date)
old_date = 'file_path/FLORIDA/DATE={}/part-00000-1691d1c6-2c49-4cbe-b454-d0165a0d7bde.c000.csv'.format(date)
print(old_date)
date = date.replace('-','')
new_date = 'file_path/FLORIDA/allocation_FLORIDA_{}.csv'.format(date)
print(new_date)
s3.Object('my_bucket', new_date).copy_from(CopySource='my_bucket/' old_date)
s3.Object('my_bucket', old_date).delete()
I think with pandas, it would have been: (note the use of *)
import boto3
s3_resource = boto3.resource(‘s3’)
# Copy object A as object B
s3_resource.Object(“bucket_name”, “newpath/to/object_B.csv”).copy_from(
CopySource=”path/to/your/*.csv”)
# Delete the former object A
s3_resource.Object(“bucket_name”, “path/to/your/*.csv”).delete()
but if used within databricks, it returns none
CodePudding user response:
you may be able to integrate this to your function. Use the variables as arguments of your function:
bucket = 'bucket_name'
prefix = 'file/path/down/to/last/folder'
filename = 'new_filename'
s3 = boto3.resource('s3')
s3_bucket = s3.Bucket(bucket)
# list the objects filtered to the prefix
for file in s3_bucket.objects.filter(Prefix = prefix):
# look for the weirdly-named file after saving as one csv
# this line below will look for the original file that ends in csv
if file.key.endswith('.csv'):
# copy that file
s3.object(bucket, prefix '/' filename '.csv').copy_from(bucket '/' file.key)
# additionally, when you want to delete the original file
## but be careful with this line especially if you have more than 1 csv file as it will be deleted as well.
s3.object(bucket, file.key).delete()
let me know if this does not work and we will make the changes based on the error you will get, if any.