Excluded folder in glue crawler throws HIVE_BAD_DATA error in Athena-CodePudding

I'm trying to create a glue crawler to crawl a specific path pattern. I have the following paths:

bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet

The same pattern is repeated every day, i.e. we have the above for

bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*

I only want to crawl what's in the **/predictions folders each day. I've set up a glue crawler pointing to bucket/inference/, and have the following exclude patterns:

**/modelling/**
**/extract/**

The logs correctly show that the bucket/inference/2022/04/28/modelling/metadata.tar.gz and bucket/inference/2022/04/28/extract/data.parquet files are being excluded, and the DDL metadata shows that it's picking up the correct number of objects and rows in the data.

However, when I go to SELECT * in Athena, I get the following error:

HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1

I've tried every combo of the above exclude patterns, but it always seems to be picking up what's in the modelling folder, despite the logs explicitly excluding it. Am I missing something here?

Many thanks.

CodePudding user response：

This is a known issue with Athena. From AWS troubleshooting documentation:

Athena does not recognize exclude patterns that you specify an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.

Reference: Athena reads files that I excluded from the AWS Glue crawler (AWS)