Home > Blockchain >  AWS Athena regexp_extract() broken
AWS Athena regexp_extract() broken

Time:10-04

I am using AWS Athena to extract some statistics from CloudWatch logs. However, attempting to use the Presto regexp_extract() is generating empty result sets even though the rexexp looks good according to online regexp testers.

The source cloudwatch log sample is as follows:

2021-10-04 00:10:56.201 INFO 10711 --- [io-5000-exec-31] au.com.crecy.VP4CStatistics : {"atlassianLicense" : {"key" : "visio-publisher-for-confluence","version" : "1.1.5-AC","state" : "ENABLED","installedDate" : 1619695028000,"lastUpdated" : 1632692975000,"license" : {"active" : true,"type" : "COMMERCIAL","evaluation" : false,"supportEntitlementNumber" : "SEN-0123456789"},"valid" : true,"host" : {"product" : "Confluence","contacts" : [ ]},"links" : {"marketplace" : [{"href" : "https://marketplace.atlassian.com/plugins/visio-publisher-for-confluence"}],"self" : [{"href" : "https://acme.atlassian.net/wiki/rest/atlassian-connect/1/addons/visio-publisher-for-confluence"}]}},"viewAttachments" : [{"height" : "1000","width" : "100%","scrolling" : "no","frameBorder" : "hide","url" : "/download/attachments/574160906/Foo.html.zip?version=22&modificationDate=1632311039065&cacheVersion=1&api=v2","space" : "VM","page" : 574160906,"id" : "att568885320","frameBorderStyle" : "border:none;"}],"durations" : {"1" : {"method" : "ModelGenAtlassianConnectPlugin.loadHtmlAttachment","startTime" : 2542004145837271,"endTime" : 2542005346331840,"durationMillis" : 1200,"durationNanos" : 1200494569},"2" : {"method" : "AtlassianHostRestClientsHelper.getLicense","startTime" : 2542004145845740,"endTime" : 2542004523777555,"durationMillis" : 377,"durationNanos" : 377931815},"3" : {"method" : "AtlassianHostRestClientsHelper.processJwt","startTime" : 2542004523813282,"endTime" : 2542004525757229,"durationMillis" : 1,"durationNanos" : 1943947},"4" : {"method" : "AttachmentLoaderHelper.loadAttachment","startTime" : 2542004525774026,"endTime" : 2542005346321184,"durationMillis" : 820,"durationNanos" : 820547158},"5" : {"method" : "AtlassianHostRestClientsHelper.getAttachments","startTime" : 2542004525784513,"endTime" : 2542004796450920,"durationMillis" : 270,"durationNanos" : 270666407},"6" : {"method" : "AtlassianHostRestClientsHelper.getCompressedPageSource","startTime" : 2542004796503557,"endTime" : 2542005341655641,"durationMillis" : 545,"durationNanos" : 545152084},"7" : {"method" : "ChecksumHelper.checksumValid","startTime" : 2542005341695482,"endTime" : 2542005341889382,"durationMillis" : 0,"durationNanos" : 193900},"8" : {"method" : "UnzipCompressionHelper.unzipCompressedPageSource","startTime" : 2542005341899585,"endTime" : 2542005346303984,"durationMillis" : 4,"durationNanos" : 4404399},"9" : {"method" : "VP4CResponseHeaderFilter.doFilter","startTime" : 2542008074147431,"endTime" : 2542008074514454,"durationMillis" : 0,"durationNanos" : 367023}},"uncompressedSize" : 520631,"compressedSize" : 48836}

the AWS Athena / presto query is as follows:

select regexp_extract(message, '(au.com.crecy.VP4CStatistics : )({.*}$)', 2)
FROM "VP4C_Statistics_Catalog"."/aws/elasticbeanstalk/vp4c-prod/var/log/web.stdout.log"."all_log_streams" 
where message LIKE '%visio-publisher-for-confluence%'
order by time desc

In short I want to extract the JSON payload at the end of the log message. The above query is generating empty result sets.

Thanks and regards, Andrew

CodePudding user response:

Note that {, }, and . are regex metacharacters, and probably need to be escaped via backslash.

SELECT REGEXP_EXTRACT(message, 'au\.com\.crecy\.VP4CStatistics : (\{.*\})$', 1)
FROM "VP4C_Statistics_Catalog"."/aws/elasticbeanstalk/vp4c-prod/var/log/web.stdout.log"."all_log_streams" 
WHERE message LIKE '%visio-publisher-for-confluence%'
ORDER BY time DESC;

CodePudding user response:

Ok, figured it out - the regexp wants a match for multiple spaces or tabs (even though the sample log appears to have a single space. The following pattern works:

select regexp_extract(message, '(au\.com\.crecy\.VP4CStatistics[ \t]*:[ \t]*)(\{.*\})', 2)
FROM "VP4C_Statistics_Catalog"."/aws/elasticbeanstalk/vp4c-prod/var/log/web.stdout.log"."all_log_streams" 
where message LIKE '%visio-publisher-for-confluence%'
order by time desc
  • Related