Home > Software engineering >  Reading parquet files in GCP using wildcards in spark
Reading parquet files in GCP using wildcards in spark

Time:12-19

I am trying to read parquet files using spark, if I want to read the data for June, I'll do the following:

"gs://bucket/Data/year=2021/month=6/file.parquet"

if I want to read the data for all the months, I'll do the following:

"gs://bucket/Data/year=2021/month=6/file.parquet"

if I want to read the first two days of May:

"gs://bucket/Data/year=2021/month=5/day={1,2}file.parquet"

if I want to read November and December:

"gs://bucket/Data/year=2021/month={11,12}/file.parquet"

you get the idea... but what if I have a dictionary of month, days key, value pairs.. for example {1: [1,2,3], 4: [10,11,12,13]} --> which means that I need to read the days [1,2,3] from January, and the days [10,11,12,13] from April. how would I reflect that as a wildcard to the path.

Thank you

CodePudding user response:

You can pass a list of paths to DataFrameReader:

months_dict = {1: [1, 2, 3], 4: [10, 11, 12, 13]}

paths = [
    f"gs://bucket/Data/year=2021/month={k}/day={{{','.join([str(d) for d in v])}}}/*.parquet"
    for k, v in months_dict.items()
]

print(paths)
# ['gs://bucket/Data/year=2021/month=1/day={1,2,3}/*.parquet', 'gs://bucket/Data/year=2021/month=4/day={10,11,12,13}/*.parquet']

df = spark.read.parquet(*paths)
  • Related