Home > Net >  How to exclude either files or folder paths on S3 within an AWS Glue job when reading an Athena tabl
How to exclude either files or folder paths on S3 within an AWS Glue job when reading an Athena tabl

Time:09-27

We have an AWS Glue job that is attempting to read data from an Athena table that is being populated by HUDI. Unfortunately, we are running into an error that relates to create_dynamic_frame.from_catalog trying to read from these tables.

An error occurred while calling o82.getDynamicFrame. s3://bucket/folder/.hoodie/20220831153536771.commit is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [32, 125, 10, 125]

This appears to be a known issue on GitHub: https://github.com/apache/hudi/issues/5891

Unfortunately, no workaround was provided. We are attempting to see if we can exclude either the folder or file(s) of .hoodie or *.commit, respectively within the additional_options of the create_dynamic_frame.from_catalog connection. Unfortunately, we are not having any success either with exclusion a file or folder. Note: we have .hoodie files in the root directory as well as a .hoodie folder that contains a .commit file, among other files. We prefer to exclude them all.

Per AWS:

"exclusions": (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, "["**.pdf"]" excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see Include and Exclude Patterns.

Question: how do we exclude both file and folder from a connection?

  • Folder

datasource0 = glueContext.create_dynamic_frame.from_catalog(database=args['ENV'] "_some_database", table_name="some_table", transformation_ctx="datasource_x1", additional_options={"exclusions": "[\".hoodie/**\"]"})

  • File

datasource0 = glueContext.create_dynamic_frame.from_catalog(database=args['ENV'] "_some_database", table_name="some_table", transformation_ctx="datasource_x1", additional_options={"exclusions": "[\"**.commit\"]"})

CodePudding user response:

Turns out the original attempted solution of {"exclusions": "[\"**.commit\"]"} worked. Unfortunately, I wasn't paying close enough attention and there were multiple tables that needed to be excluded. After hacking through all of the file types, here are two working solutions:

  • Exclude folder

additional_options={"exclusions": "[\"s3://bucket/folder/.hoodie/*\"]"}

  • Exclude file(s)

additional_options={"exclusions": "[\"**.commit\",\"**.inflight\",\"**.properties\"]"}

  • Related