Home > Software design >  Extract word between two words
Extract word between two words

Time:06-11

I receive text data in a field in Snowflake in the following format :

Phone number 99999999999 was flagged by: XXXXXXXXX

Phone number 99999999999 was unflagged by: XXXXXXXXX

i.e. the word between 'was' and 'by' can be either flagged or unflagged.

How do I extract the word between 'was' and 'by'

CodePudding user response:

If you strictly want to match the content between the words "was" and "by", you can do:

var str1 = "Phone number 99999999999 was flagged by: XXXXXXXXX";
var str2 = "Phone number 99999999999 was unflagged by: XXXXXXXXX";
var regex = /(?<=\Wwas\s). (?=\sby\W)/;
console.log(str1.match(regex)[0]);
console.log(str2.match(regex)[0]);

Explanation:

  • (?<=) is look-behind which means: only match if preceded by what ever regex is after the ?<=.
  • \W matches any character that isn't a letter or a number, this is so that "way" can't just be the end of words like "highway", but could have punctuation directly before it.
  • was matches the literal string "was".
  • \s is any whitespace character.
  • . matches any character (the .) 1 or more times ( ).
  • (?=) is look-ahead which, as you can imagine, means only match if the regex is proceeded by the regex after the ?=.

Or if you simply need to match "flagged" or "unflagged", you can do:

var str1 = "Phone number 99999999999 was flagged by: XXXXXXXXX";
var str2 = "Phone number 99999999999 was unflagged by: XXXXXXXXX";
var regex = /(un)?flagged/;
console.log(str1.match(regex)[0]);
console.log(str2.match(regex)[0]);

Explanation:

  • (un) matches "un" in a character group.
  • ? means match the previous match zero or one times. If we hadn't put "un" in parentheses, it would match a "u" then zero or one n's.
  • flagged just matches the string "flagged".

CodePudding user response:

Snowflake regular expression functions do not support lookarounds. If you want to use a lookahead or lookbehind, you'd need to use a JavaScript UDF. To use Snowflake SQL, you can do this (for any word between two other words):

set str = 'Phone number 99999999999 was flagged by: XXXXXXXXX';
select split(regexp_substr($str, 'was (.*) by'), ' ')[1]::string;

The expression will also grab the start and end words, so this splits the result by space into an array and uses only the second (zero based) word.

1 to tybocopperkettle for this part:

Or if you simply need to match "flagged" or "unflagged", you can do:

(un)?flagged

In Snowflake SQL, that would be like this:

select regexp_substr($str, '(un)?flagged');
  • Related