Home > Back-end >  spark.read.json() how to read files using a dynamic year parameter
spark.read.json() how to read files using a dynamic year parameter

Time:06-10

I have an s3 directory that looks something like this:

my-data/data-events/year/month/day/hour/file.json

I have the years 2018, 2019, 2020, 2021 and 2022. I want to read in everything from 2020 onward. How would I do that without moving files around? I currently have my read function like this, which reads in all files:

spark.read.json("s3a://my-data/data-events/*/*/*/*/*")

CodePudding user response:

how about spark.read.json("s3a://my-data/data-events/{2020,2021,2022}/*/*/*/*")

to make it dynamic based on current_year

from datetime import datetime

current_year = datetime.now().year
years_to_query = ",".join([str(x) for x in range(2020, current_year   1)])
f"s3a://my-data/data-events/{{{years_to_query}}}/*/*/*/*"

Also if you have partitioned your data, e.g: the path would have been my-data/data-events/year=YYYY/month=MM/day=DD/hour=HH/file.json you could have spark.read.json("s3a://my-data/data-events") and then .filter(col("year") >= '2020'), this filter would have been run against the folders paths and not by looking into the jsons themselves.

  • Related