Home > Net >  AWS Athena query struct property in array
AWS Athena query struct property in array

Time:08-19

I have json files in S3 bucket generated by AWS Textract service and I'm using Athena to query data from those files. Every file has the same structure and I created a table in Athena where I have column "blocks" that is array of struct:

  "blocks": [{
      "BlockType": "LINE",
      "Id": "12345",
      "Text": "Text from document",
      "Confidence": 98.7022933959961,
      "Page": "1",
      "SourceLanguage": "de",
      "TargetLanguage": "en",
    },
    ...100  blocks]

How can I query just for the "Text" property from every block that has one?

Thanks in advance!

CodePudding user response:

It looks like column stores array of rows, so you can process it as one (array functions):

select transform(
     filter(block_column, t -> t.text is not null), 
     r => cast(row(r.text) as row(text varchar))) texts
from table

CodePudding user response:

I have defined a table with exact schema of yours using sample JSON provided.

_col0
#   
array(row(blocktype varchar, id varchar, text varchar, confidence double, page varchar, sourcelanguage varchar, targetlanguage varchar))

I have used unnest operator to flatten the array of blocks and fetch the Text column from it using below query:

select block.text from <table-name> CROSS JOIN UNNEST(blocks) as t(block)
  • Related