I receive text data in a field in Snowflake in the following format :
Phone number 99999999999 was flagged by: XXXXXXXXX
Phone number 99999999999 was unflagged by: XXXXXXXXX
i.e. the word between 'was' and 'by' can be either flagged or unflagged.
How do I extract the word between 'was' and 'by'
CodePudding user response:
If you strictly want to match the content between the words "was" and "by", you can do:
var str1 = "Phone number 99999999999 was flagged by: XXXXXXXXX";
var str2 = "Phone number 99999999999 was unflagged by: XXXXXXXXX";
var regex = /(?<=\Wwas\s). (?=\sby\W)/;
console.log(str1.match(regex)[0]);
console.log(str2.match(regex)[0]);
Explanation:
(?<=)
is look-behind which means: only match if preceded by what ever regex is after the?<=
.\W
matches any character that isn't a letter or a number, this is so that "way" can't just be the end of words like "highway", but could have punctuation directly before it.was
matches the literal string "was".\s
is any whitespace character..
matches any character (the.
) 1 or more times ((?=)
is look-ahead which, as you can imagine, means only match if the regex is proceeded by the regex after the?=
.
Or if you simply need to match "flagged" or "unflagged", you can do:
var str1 = "Phone number 99999999999 was flagged by: XXXXXXXXX";
var str2 = "Phone number 99999999999 was unflagged by: XXXXXXXXX";
var regex = /(un)?flagged/;
console.log(str1.match(regex)[0]);
console.log(str2.match(regex)[0]);
Explanation:
(un)
matches "un" in a character group.?
means match the previous match zero or one times. If we hadn't put "un" in parentheses, it would match a "u" then zero or one n's.flagged
just matches the string "flagged".
CodePudding user response:
Snowflake regular expression functions do not support lookarounds. If you want to use a lookahead or lookbehind, you'd need to use a JavaScript UDF. To use Snowflake SQL, you can do this (for any word between two other words):
set str = 'Phone number 99999999999 was flagged by: XXXXXXXXX';
select split(regexp_substr($str, 'was (.*) by'), ' ')[1]::string;
The expression will also grab the start and end words, so this splits the result by space into an array and uses only the second (zero based) word.
1 to tybocopperkettle for this part:
Or if you simply need to match "flagged" or "unflagged", you can do:
(un)?flagged
In Snowflake SQL, that would be like this:
select regexp_substr($str, '(un)?flagged');