Home > front end >  Regex Contain in BigQuery
Regex Contain in BigQuery

Time:11-02

Let's say I have this name table:

select 'anya taylor' name union all
select 'anya' name union all
select 'shawn' name union all
select 'shawn mendes' name

I want to categorize this data:

select 'shawn mendes' col1 union all
select 'shawn the sheep' col1 union all
select 'anya stuart' col1 union all
select 'anya taylor joy' col1

Into the name table data. I have this code:

WITH nm as (select 'anya taylor' name union all
select 'anya' name union all
select 'shawn' name union all
select 'shawn mendes' name),

dt as (select 'shawn mendes' col1 union all
select 'shawn the sheep' col1 union all
select 'anya stuart' col1 union all
select 'anya taylor joy' col1)

select
name,
col1
from nm,dt
where REGEXP_CONTAINS(col1, CONCAT(r'(?i)', name, r'')) 

But I end up with doubled row because anya taylor joy both contains anya taylor and anya. Is there any way I could use regex and match to the longest expression? I'm new to regex and would appreciate the help.

CodePudding user response:

You could use a window function to only select the rows with the longest value for col 1, like this:

select
  name,
  col1
from nm,dt
where REGEXP_CONTAINS(col1, CONCAT(r'(?i)', name, r'')) 
QUALIFY ROW_NUMBER() OVER(PARTITION BY name ORDER BY length(col1) DESC) = 1
  • Related