We have an AWS Glue job that is attempting to read data from an Athena table that is being populated by HUDI. Unfortunately, we are running into an error that relates to create_dynamic_frame.from_catalog
trying to read from these tables.
An error occurred while calling o82.getDynamicFrame. s3://bucket/folder/.hoodie/20220831153536771.commit is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [32, 125, 10, 125]
This appears to be a known issue on GitHub: https://github.com/apache/hudi/issues/5891
Unfortunately, no workaround was provided. We are attempting to see if we can exclude either the folder or file(s) of .hoodie
or *.commit
, respectively within the additional_options
of the create_dynamic_frame.from_catalog
connection. Unfortunately, we are not having any success either with exclusion a file or folder. Note: we have .hoodie
files in the root directory as well as a .hoodie
folder that contains a .commit
file, among other files. We prefer to exclude them all.
"exclusions": (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, "["**.pdf"]" excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see Include and Exclude Patterns.
Question: how do we exclude both file and folder from a connection?
- Folder
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=args['ENV'] "_some_database", table_name="some_table", transformation_ctx="datasource_x1", additional_options={"exclusions": "[\".hoodie/**\"]"})
- File
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=args['ENV'] "_some_database", table_name="some_table", transformation_ctx="datasource_x1", additional_options={"exclusions": "[\"**.commit\"]"})
CodePudding user response:
Turns out the original attempted solution of {"exclusions": "[\"**.commit\"]"}
worked. Unfortunately, I wasn't paying close enough attention and there were multiple tables that needed to be excluded. After hacking through all of the file types, here are two working solutions:
- Exclude folder
additional_options={"exclusions": "[\"s3://bucket/folder/.hoodie/*\"]"}
- Exclude file(s)
additional_options={"exclusions": "[\"**.commit\",\"**.inflight\",\"**.properties\"]"}