I have a list of files in parquet format
-- s3:\\my-bucket\files\14\09\12\file.pq
-- s3:\\my-bucket\files\14\09\11\file.pq
# 14 = day, 09 = month, 11 = hour.
if I pass the absolute path to my spark context it can read the file without any issue
spark.read.parquet('s3:\\my-bucket\files\14\09\12\file.pq')
if I pass in
spark.read.parquet('s3:\\my-bucket\files\14')
then I will get the following error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
This is, I believe due to the fact the partitions are unnamed, I have no control over the source so I can't change it unfortunately.
my hacky work around is to list all the files and then take the unique set of lowest level paths and pass that into spark .
Is there a better work around?
CodePudding user response:
The easier way is putting *
to list all directories within path:
df = spark.read.parquet('s3://my-bucket/files/*/*/*/')
If you want retrieve day
, month
and hour
, follow my answer here