Home > other >  find all occurrences of a regex as an array
find all occurrences of a regex as an array

Time:02-17

have the following string (it's a salesforce query, but not important):

IF(OR(CONTAINS(EmailDomain,"yahoo"),CONTAINS(EmailDomain,"gmail"),
CONTAINS("protonmail.com,att.net,chpmail.com,smail.com",EmailDomain)),
"Free Mail","Business Email")

and I want to get an array of all substrings that are encapsulated between double quotes like so:

['yahoo',
'gmail',
'protonmail.com,att.net,chpmail.com,smail.com',
'Free Mail',
'Business Email']

in python I do:

re.findall(r'"(. ?)"', <my string>)

but is there a way to replicate this in Snowflake?

I've tried

SELECT 
REGEXP_SUBSTR('IF(OR(CONTAINS(EmailDomain,"yahoo"),CONTAINS(EmailDomain,"gmail"),
CONTAINS("protonmail.com,att.net,chpmail.com,smail.com",EmailDomain)),
"Free Mail","Business Email")', '"(. ?)"') as emails;

but I get this:

"yahoo"),CONTAINS(EmailDomain,"gmail"

CodePudding user response:

You can use

select split(trim(regexp_replace(regexp_replace(col, '"([^"] )"|.', '\\1|'),'\\| ','|'), '|'), '|');

Details:

  • regexp_replace(col, '"([^"] )"|.', '\\1|') - finds any strings between the closest double quotes while capturing the part inside quotes into Group 1, or matching any single char and replaces each match with Group 1 contents | char (see the regex demo)
  • regexp_replace(...,'\\| ','|') - this shrinks all consecutive pipe symbols into a single occurrence of a | char (see this regex demo)
  • trim(..., '|') - removes | chars on both ends of the string
  • split(..., '|') - splits the string with a | char.

CodePudding user response:

Wiktor's answer works great. I'm adding an alternate answer for anyone who needs to do this and their quoted strings may contain the pipe | character. Using the replacement method on strings containing pipe(s) will split the string into more than one array member. Here's a way (not the only way) to do it that will work in case the quoted strings could potentially contain pipe characters:

set col = $$IF(OR(CONTAINS(EmailDomain,"yahoo"),CONTAINS(EmailDomain,"gmail"),CONTAINS("protonmail.com,att.net,chpmail.com,smail.com",EmailDomain)),"Free Mail","Business Email | Other")$$;

create or replace function GET_QUOTED_STRINGS("s" string)
returns array
language javascript
strict immutable
as
$$

var re = /(["'])(?:\\.|[^\\])*?\1/g;
var m;
var out = [];

do {
    m = re.exec(s);
    if (m) {
        out.push(m[0].replace(/['"] /g, ''));
    }
} while (m);

return out;

$$;

select get_quoted_strings($col);
  • Related