Home > Enterprise >  how to extract part of a sentence to the left of one or more keywords using bigquery regexp_extract
how to extract part of a sentence to the left of one or more keywords using bigquery regexp_extract

Time:09-09

I want to return the text to the left of department of, school of to return pomona college rutgers college

I am trying to get it to work where I can pass in multile options. ie split on "school of" or "department of" I have tried the | OR operator in the line before_words_or but I am doing something wrong.

regexp_extract(institute_name,r'^(.*)department of|school of',1,1)

If there is no "school of" or "department of" in the institute_name it should return the institute_name

with t1 as 
(
    select 'pomona college department of chemistry' institute_name,
    union all select 'rutgers college school of engineering chemical engineering' 
)

select 
regexp_extract(institute_name,r'^(.*)department of',1,1) before_words1,
regexp_extract(institute_name,r'^(.*)school of',1,1) before_words2,
regexp_extract(institute_name,r'^(.*)department of|school of',1,1) before_words_or
from t1;

Actual result

before_words1 before_words2 before_words_or
pomona college pomona college
rutgers college

Desired result

before_words1 before_words2 before_words_or
pomona college pomona college department of chemistry' institute_name pomona college
rutgers college school of engineering chemical engineering rutgers college rutgers college

CodePudding user response:

You can use

r'^(.*?)\s*(?:department|school)\s of'

Or, with a word boundary to make sure of is matched as a whole word:

r'^(.*?)\s*(?:department|school)\s of\b'

See the regex demo.

Details:

  • ^ - start of string
  • (.*?) - Capturing group: any zero or more chars other than line break chars as few as possible
  • \s* - zero or more whitespaces
  • (?:department|school) - a non-capturing group that matches either department or (|) school
  • \s - one or more whitespaces
  • of - an of word
  • \b - a word boundary.
  • Related