Home > Software design >  How to load partitioned parquet dataset with no partition names (in directory names)?
How to load partitioned parquet dataset with no partition names (in directory names)?

Time:09-17

I have a list of files in parquet format

-- s3:\\my-bucket\files\14\09\12\file.pq
-- s3:\\my-bucket\files\14\09\11\file.pq
# 14 = day, 09 = month, 11 = hour.

if I pass the absolute path to my spark context it can read the file without any issue

spark.read.parquet('s3:\\my-bucket\files\14\09\12\file.pq')

if I pass in

spark.read.parquet('s3:\\my-bucket\files\14')

then I will get the following error:

AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'

This is, I believe due to the fact the partitions are unnamed, I have no control over the source so I can't change it unfortunately.

my hacky work around is to list all the files and then take the unique set of lowest level paths and pass that into spark .

Is there a better work around?

CodePudding user response:

The easier way is putting * to list all directories within path:

df = spark.read.parquet('s3://my-bucket/files/*/*/*/')

If you want retrieve day, month and hour, follow my answer here

  • Related