Home > Software design >  Regex to match some words before and after a word gives capturing group error on BigQuery
Regex to match some words before and after a word gives capturing group error on BigQuery

Time:09-23

I have a regex that matches 5 words before and after the word university. But I wanted to change this regex so it may work for multiple spells of the word university like Universidad or universita.

The regex which I'm using is ((?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5}.?universidad.?(?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5})

I use this on BigQuery and it works as expected.

But when I add one more group, ((?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5}.?(universidad|university|universidad).?(?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5}) This regex shows error - Regular expressions passed into extraction functions must not have more than 1 capturing group on BigQuery.

The query which I'm using is

with t1 as (
  select "a a a a a my university hospital of besancon" as entity_name
  union all select "universidad de madrid"
  union all select "universidad de madrid some word"
  union all select "university de dublin usa ireland world word"
  union all select "bacon university"
)
select regexp_extract(
  entity_name, r"((?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5}.?universidad.?(?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5})"
) from t1

The expected results will be no nulls in all rows.

If you currently replace universidad with universita or university then you can see all rows can be populated. But I'm finding a correct way to accumulate all types of regex into one, which I think can be possible by using a group.

Any idea, how this can be resolved?

If anything is unclear, let me know.

CodePudding user response:

Just turn off the capture group for the alternation for the various spellings of university:

SELECT REGEXP_EXTRACT(
    entity_name,
    r"((?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5}.?(?:universidad|university|universidad).?(?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5})"
)
FROM t1;

You already have a capture group for the words prior to "university," so you can't add another capture group in the call to REGEXP_EXTRACT().

  • Related