Home > OS >  Reading Dynamic DataTpes from S3 with AWS Glue
Reading Dynamic DataTpes from S3 with AWS Glue

Time:11-24

I have json stored in S3. Sometimes units is stored as a string, sometimes it's stored as an integer. Unfortunately, this was a bug, and I now have billions of records with mixmatched datatypes in the source json.

example:

{
  "other_stuff": "stuff"
  "units": 2,
{
{
  "other_stuff": "stuff"
  "units": "2",
{

I want to dynamically determine if it's a string / integer, and then target it as an integer into AWS Redshift.

If my mappings is: ("units", "string", "units", "int"), only the "string" values will be converted correctly. If i do ("units", "int", "units", "int") then it's the opposite, only the "integer" ones will work.

How do I dynamically cast the source record, and always load it as a integer into Redshift. You can assume, that all values are numeric, not null, and the attribute is guaranteed to be there.

CodePudding user response:

You can use the ResolveChoices function from Glue.

resolved_choices = df.resolveChoice(
    specs=[
        ('units', 'cast:int')
    ]
)
  • Related