Home > Mobile >  How to filter s3 path while reading data from s3 using pyspark
How to filter s3 path while reading data from s3 using pyspark

Time:11-29

I have a s3 folder structure like this:

bucketname/20211127123456/.parquet files
bucketname/20211127456789/.parquet files
bucketname/20211126123455/.parquet files
bucketname/20211126746352/.parquet files
bucketname/20211124123455/.parquet files
bucketname/20211124746352/.parquet files

Basically for each day there are two folders and inside that I have multiple parquet files which I want to read. Let's say I want to read all files from the folders for 27th and 26th Nov.

Right now I have boto3 function which is giving me a python list that includes all parquet files complete s3 path which has 20211126 and 20211127 in the s3 path and that list I am passing to spark.read. Is there any better way to achieve this?

CodePudding user response:

Yes, you should be partitioning your data based on date. Then your spark queries would only need to include date parameters and only the files related to that date would be read for the query.

Here's an example of how that works with Athena; It will work with Glue and Spark too.

  • Related