I have an s3 directory that looks something like this:
my-data/data-events/year/month/day/hour/file.json
I have the years 2018, 2019, 2020, 2021 and 2022. I want to read in everything from 2020 onward. How would I do that without moving files around? I currently have my read function like this, which reads in all files:
spark.read.json("s3a://my-data/data-events/*/*/*/*/*")
CodePudding user response:
how about spark.read.json("s3a://my-data/data-events/{2020,2021,2022}/*/*/*/*")
to make it dynamic based on current_year
from datetime import datetime
current_year = datetime.now().year
years_to_query = ",".join([str(x) for x in range(2020, current_year 1)])
f"s3a://my-data/data-events/{{{years_to_query}}}/*/*/*/*"
Also if you have partitioned your data, e.g:
the path would have been my-data/data-events/year=YYYY/month=MM/day=DD/hour=HH/file.json you could have spark.read.json("s3a://my-data/data-events") and then .filter(col("year") >= '2020')
, this filter would have been run against the folders paths and not by looking into the jsons themselves.