I am continuously add parquet data sets to an S3 folder with a structure like this:
s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3
At the beginning I only have set1
and my crawler is configured to run on the whole bucket s3:::my-bucket
. This leads to the creation of a partitioned tabled named my-bucket
with partitions named public
, data
and set1
. What I actually want is to have a table named set1
without any partitions.
I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?. But when a new data set is uploaded (e.g. set2
) I don't want it to be another partition (because it is completely different data with a different schema).
How can I force the Glue crawler to NOT create partitions?
I know I could define the crawler path as s3:::my-bucket/public/data/
but unfortunately I don't know where the new data sets will be created (e.g. could also be s3:::my-bucket/other/folder/set2
).
Any ideas how to solve this?
CodePudding user response:
You can use the TableLevelConfiguration
to specify in which folder level the crawler should look for tables.
More information on that here.