I have a regex that matches 5 words before and after the word university
. But I wanted to change this regex so it may work for multiple spells of the word university
like Universidad
or universita
.
The regex which I'm using is
((?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5}.?universidad.?(?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5})
I use this on BigQuery and it works as expected.
But when I add one more group,
((?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5}.?(universidad|university|universidad).?(?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5})
This regex shows error -
Regular expressions passed into extraction functions must not have more than 1 capturing group
on BigQuery.
The query which I'm using is
with t1 as (
select "a a a a a my university hospital of besancon" as entity_name
union all select "universidad de madrid"
union all select "universidad de madrid some word"
union all select "university de dublin usa ireland world word"
union all select "bacon university"
)
select regexp_extract(
entity_name, r"((?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5}.?universidad.?(?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5})"
) from t1
The expected results will be no nulls in all rows.
If you currently replace universidad
with universita
or university
then you can see all rows can be populated. But I'm finding a correct way to accumulate all types of regex into one, which I think can be possible by using a group.
Any idea, how this can be resolved?
If anything is unclear, let me know.
CodePudding user response:
Just turn off the capture group for the alternation for the various spellings of university:
SELECT REGEXP_EXTRACT(
entity_name,
r"((?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5}.?(?:universidad|university|universidad).?(?:(?:\b[a-zA-Z_'] \b)[ ]*){0,5})"
)
FROM t1;
You already have a capture group for the words prior to "university," so you can't add another capture group in the call to REGEXP_EXTRACT()
.