I have json files in S3 bucket generated by AWS Textract service and I'm using Athena to query data from those files. Every file has the same structure and I created a table in Athena where I have column "blocks" that is array of struct:
"blocks": [{
"BlockType": "LINE",
"Id": "12345",
"Text": "Text from document",
"Confidence": 98.7022933959961,
"Page": "1",
"SourceLanguage": "de",
"TargetLanguage": "en",
},
...100 blocks]
How can I query just for the "Text" property from every block that has one?
Thanks in advance!
CodePudding user response:
It looks like column stores array of rows, so you can process it as one (array functions):
select transform(
filter(block_column, t -> t.text is not null),
r => cast(row(r.text) as row(text varchar))) texts
from table
CodePudding user response:
I have defined a table with exact schema of yours using sample JSON provided.
_col0
#
array(row(blocktype varchar, id varchar, text varchar, confidence double, page varchar, sourcelanguage varchar, targetlanguage varchar))
I have used unnest operator to flatten the array of blocks and fetch the Text column from it using below query:
select block.text from <table-name> CROSS JOIN UNNEST(blocks) as t(block)