Home > Back-end >  BigQuery function to remove words in string by lookup table
BigQuery function to remove words in string by lookup table

Time:01-11

Given a string, I want to create a function to remove any word/word phrase in the string if it exists in a lookup table.

For example, given a string s1 s2 s3 s4 s5 s6 and a lookup table

word
s2
s4 s5

Expected result:

select fn.remove_substring('s1 s2 s3 s4 s5 s6')

-- Expected output: 's1  s3  s6'

In PostgreSQL, I actually have a working function implemented, however, I am not sure how to rewrite it in BigQuery, as BigQuery UDF does not allow cursor and looping.

CREATE OR REPLACE FUNCTION fn.remove_substring(s text)
 RETURNS text
 LANGUAGE plpgsql
AS $function$
declare
    replaced_string text := s;
    t_cur cursor for
        select word from public.lookup order by word desc;
begin
    for row in t_cur loop
        replaced_string := regexp_replace(replaced_string, '\y'||row.word||'\y', '', 'gi');
    end loop;
    return replaced_string;
end;
$function$
;

CodePudding user response:

You might consider below.

WITH sample_table AS (
  SELECT 's1 s2 S3 S4 s5 s6' str
),
lookup_table AS (
  SELECT 's2' word UNION ALL
  SELECT 's4 s5'
)
SELECT str, 
       REGEXP_REPLACE(
         str, (SELECT '(?i)(' || STRING_AGG(word, '|' ORDER BY LENGTH(word) DESC) || ')' FROM lookup_table), ''
       ) AS removed_str
  FROM sample_table;

Query results

enter image description here

If implemented in an UDF,

CREATE TEMP TABLE lookup_table AS 
  SELECT 's2' word UNION ALL
  SELECT 's4 s5'
;

CREATE TEMP FUNCTION remove_substring(str STRING) AS (
  REGEXP_REPLACE(
    str, (SELECT '(?i)(' || STRING_AGG(word, '|' ORDER BY LENGTH(word) DESC) || ')' FROM lookup_table), ''
  )
);

SELECT remove_substring('s1 s2 s3 s4 s5 s6');

CodePudding user response:

Using the same approach of @jaytiger . However you can also create the regular expression by executing the select string_agg only once, this way if your lookup table is too big you don’t have to execute the same query for each row.
Example:

declare regex String default '';
 
create temp table main AS (
 select 's1 s2 s3 s4 s5 s6' str
);
create temp table lookup_table AS (
 select 's2' word union all
 select 's4' union all
 select 's5'
);
 
set regex = ( select string_agg(word, '|' order by length(word) desc) from lookup_table ) ;
 
select regexp_replace(str, regex, '')new_str from main;

Output:

enter image description here

N.B above query is case sensitive.You should modify it for your requirement.

  • Related