Home > database >  Filter only elements that matches a regex in Athena
Filter only elements that matches a regex in Athena

Time:07-18

I have a column in a table which is of string type and the column is , separated.

Sample Input: 'ASEIAW,1245555,asda2dd,TPOIBV'
Expected output: ['ASEIAW,TPOIBV'] - An array with all matching elements which is an alphabet in upper case with exactly 6 charterers.

What I tried;

select REGEXP_EXTRACT('ASEIOW,ASDWQB,TPOIBV,2' , '(\b[A-Z]{6,6}\b) ');
Output: ASEIOW

select REGEXP_LIKE('ASEIOW,ASDWQB,TPOIBV,2' , '(\b[A-Z]{6,6}\b) ');
Output: [v]

select REGEXP_SPLIT('ASEIOW,ASDWQB,TPOIBV,2' , '(\b[A-Z]{6,6}\b) ');
Output: ['',',',',',',2']

Using NOT in front of regex

select REGEXP_SPLIT('ASEIOW,ASDWQB,TPOIBV,2' , '^(\b[A-Z]{6,6}\b) ');
Output: ['',',ASDWQB,TPOIBV,2']

select REGEXP_REPLACE('ASEIOW,ASDWQB,TPOIBV,2' , '^(\b[A-Z]{6,6}\b) ');
Output: ,ASDWQB,TPOIBV,2

CodePudding user response:

You can use a REGEX_EXTRACT_ALL with a bit simplified regex:

REGEXP_EXTRACT_ALL('ASEIOW,ASDWQB,TPOIBV,2' , '\b([A-Z]{6})\b');

The REGEXP_EXTRACT_ALL function will extract all occurrences of the pattern matches and \b([A-Z]{6})\b just matches six letters enclosed with word boundaries, no need to specify the identical min and max values in range quantifiers. Nor do you need to additionally quantify the pattern.

CodePudding user response:

You can use REGEX_EXTRACT_ALL:

select regexp_extract_all('ASEIOW,ASDWQB,TPOIBV,2', '\b([A-Z]{6})\b');

Output:

_col0
[ASEIOW, ASDWQB, TPOIBV]
  • Related