Let's say three files in s3 folder and whether read through spark.read.csv(s3:bucketname/folder1/*.csv) reads the files in order or not ? If not, is there way to order the files while reading the whole folder with multiple files received at different time internal.
File name | s3 file uploaded/Last modified time |
---|---|
s3:bucketname/folder1/file1.csv | 01:00:00 |
s3:bucketname/folder1/file2.csv | 01:10:00 |
s3:bucketname/folder1/file3.csv | 01:20:00 |
CodePudding user response:
You can achive this using following
- Iterate over all the files in the bucket and load that csv with adding a new column
last_modified
. Keep a list of all the dfs that will be loaded indfs_list
. Since pyspark does lazy evaluation it will not load the data instantly.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname')
dfs_list = []
for file_object in my_bucket.objects.filter(Prefix="folder1/"):
df = spark.read.parquet('s3a://' file_object.name).withColumn("modified_date", file_object.last_modified)
dfs_list.append(df)
- Now take the union of all the dfs using pyspark
unionAll
function and then sort the data according tomodified_date
.
from functools import reduce
from pyspark.sql import DataFrame
df_combined = reduce(DataFrame.unionAll, dfs_list)
df_combined = df_combined.orderBy('modified_date')