I have a spark workflow which outputs a csv file into its own directory with a randomized filename and a few accessory files which are not .csv
files. I need to read that file in through a separate python workflow, which if I knew the exact filename I would use:
bucket = "bucketName"
file_name = "/user/myName/output/date/dataset/file_name.csv"
s3 = boto3.client('s3')
obj = s3.read_object(Bucket= bucket, Key= file_name)
Since I dont know the exact file name, what I need to do is simply read the only file in that s3 path that has a .csv extension
CodePudding user response:
You will need to provide the exact Key to S3 to access the object.
Therefore, you will first need to list the contents of the bucket. Here's some code that can prints the name of the first CSV object in a given directory.
import boto3
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket=bucket,Prefix='folder1/')
objects = [object['Key'] for object in response['Contents'] if object['Key'].endswith('.csv')]
if len(objects) > 0:
print(objects[0])