I am working on a project which is using Glue 3.0 & PySpark to process large amounts of data between S3 buckets. This is being achieved using GlueContext.create_dynamic_frame_from_options to read the data from an S3 bucket to a DynamicFrame using the recurse connection option set to True as the data is nested heavily. I only wish to read files which end in meta.json therefore I have set the exclusions filter to exclude any files which end in data.csv "exclusions": ['**.{txt, csv}', '**/*.data.csv', '**.data.csv', '*.data.csv']
however I am consistently getting the following error:
An error occurred while calling o90.pyWriteDynamicFrame. Unable to parse file: <filename>.data.csv
Is it possible to log the full S3 uri to the output logs or keep a track of the files which have/have not been processed? What is the reason it is still trying to parse this file even though it is included in the exclusions?
CodePudding user response:
Exclusions has to be a string
"exclusions": "[\"**/*.txt\", \"**/*.csv\"]",