How to prevent AWS Glue crawler from reading wrong data types?-CodePudding

I am running AWS Glue crawler on a CSV file. This CSV file has a string column which has alpahanumeric values. The crawler is setting the data type for this columns as INT (instead of string). This is causing my ETL to fail. Is there anyway to force glue to correct this ? I don't want to put schema manually in crawler as that defeats the purpose of automatic data catalogging.

CodePudding user response：

CSV is always tricky format to handle, especially if all of your columns are strings. Crawler would not be able to differentiate between headers and rows. To avoid this, you can use Glue classifier. Set the classifier with format as CSV, use Column headings as has headings. Add the classifier to Glue crawler.

Make sure to delete the crawler and re-run. Crawler will sometimes fail to pick up the modifications after running.

CodePudding user response：

Crawler Schema Detection

During the first crawler run, the crawler reads either the first 1,000 records or the first megabyte of each file to infer the schema. The amount of data read depends on the file format and availability of a valid record.

From the Glue developer docs:

The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.

To bypass this, create a new CSV classifer, set Column headings to Has headings and add classifier to the crawler.

If the classifer still does not create your AWS Glue table as you want, edit the table definition, adjust any inferred types to STRING, set the SchemaChangePolicy to LOG, and set the partitions output configuration to InheritFromTable for future crawler runs.