I am trying to read parquet files using spark, if I want to read the data for June, I'll do the following:
"gs://bucket/Data/year=2021/month=6/file.parquet"
if I want to read the data for all the months, I'll do the following:
"gs://bucket/Data/year=2021/month=6/file.parquet"
if I want to read the first two days of May:
"gs://bucket/Data/year=2021/month=5/day={1,2}file.parquet"
if I want to read November and December:
"gs://bucket/Data/year=2021/month={11,12}/file.parquet"
you get the idea... but what if I have a dictionary of month, days key, value pairs..
for example {1: [1,2,3], 4: [10,11,12,13]}
--> which means that I need to read the days [1,2,3]
from January
, and the days [10,11,12,13]
from April
. how would I reflect that as a wildcard to the path.
Thank you
CodePudding user response:
You can pass a list of paths to DataFrameReader:
months_dict = {1: [1, 2, 3], 4: [10, 11, 12, 13]}
paths = [
f"gs://bucket/Data/year=2021/month={k}/day={{{','.join([str(d) for d in v])}}}/*.parquet"
for k, v in months_dict.items()
]
print(paths)
# ['gs://bucket/Data/year=2021/month=1/day={1,2,3}/*.parquet', 'gs://bucket/Data/year=2021/month=4/day={10,11,12,13}/*.parquet']
df = spark.read.parquet(*paths)