Spark s3 csv files read order-CodePudding

Let's say three files in s3 folder and whether read through spark.read.csv(s3:bucketname/folder1/*.csv) reads the files in order or not ? If not, is there way to order the files while reading the whole folder with multiple files received at different time internal.

File name	s3 file uploaded/Last modified time
s3:bucketname/folder1/file1.csv	01:00:00
s3:bucketname/folder1/file2.csv	01:10:00
s3:bucketname/folder1/file3.csv	01:20:00

CodePudding user response：

You can achive this using following

Iterate over all the files in the bucket and load that csv with adding a new column last_modified. Keep a list of all the dfs that will be loaded in dfs_list. Since pyspark does lazy evaluation it will not load the data instantly.

import boto3

s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname')

dfs_list = []

for file_object in my_bucket.objects.filter(Prefix="folder1/"):
    df = spark.read.parquet('s3a://'   file_object.name).withColumn("modified_date", file_object.last_modified)
    dfs_list.append(df)

Now take the union of all the dfs using pyspark unionAll function and then sort the data according to modified_date.

from functools import reduce
from pyspark.sql import DataFrame

df_combined = reduce(DataFrame.unionAll, dfs_list)

df_combined = df_combined.orderBy('modified_date')