Home > other >  How to extract a substring of a URL with regex in snowflake
How to extract a substring of a URL with regex in snowflake

Time:11-08

I need some help with RegEx. I have a URL:

article/life/food/woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story/275-88a8538f-4c62-4204-aed8-4381d618208d

This URL is not fixed and this changes every time. I only need to extract woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story and store it in table.

Here's my attempt:

select parse_url(c1:uri):path as PATH, 
       count(1) as CNT 
from "OPE_RAW_DATA"."TAGGER_DEFAULT"."INTERACTION_20221107" 
where c1:uri like'%www.kens5.com%'
group by PATH
order by CNT desc

CodePudding user response:

One approach would be to use a JavaScript UDF to split the string into an array and find the longest one:

create or replace function PARSE_LONGEST_TEXT_PATH(URL string)
returns string
language javascript
strict immutable
as
$$
    let a = URL.split('/');
    let longest = a.reduce(
        function (a, b) {
            return a.length > b.length ? a : b;
        });
    return longest;
$$;

select parse_longest_text_path
    ('article/life/food/woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story/275-88a8538f-4c62-4204-aed8-4381d618208d');

This doesn't handle the case if a the longest string is a hex value, but if you need that I can modify the code.

CodePudding user response:

If you want to get the longest string between 3rd and 4th item, you can use something like this:

IFF( length(split(c1,'/')[2]) > length(split(c1,'/')[3]), split(c1,'/')[2], split(c1,'/')[3] )

For example:

select IFF( length(split(c1,'/')[2]) > length(split(c1,'/')[3]), split(c1,'/')[2], split(c1,'/')[3] )
from values 
('article/life/food/woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story/275-88a8538f-4c62-4204-aed8-4381d618208d'),
('article/life/woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story/275-88a8538f-4c62-4204-aed8-4381d618208d')
tmp(c1);

CodePudding user response:

I suggest flattening the url so you have more control over the criteria for parsing the topic.

set str='article/life/food/woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story/275-88a8538f-4c62-4204-aed8-4381d618208d';

select distinct 
       url,
       first_value(b.value) over (partition by url order by length(translate(b.value,'0123456789-','')) desc) as topic --pick the one that's longest after removing numbers and hyphens
from (select $str as url) a, 
     lateral split_to_table(url,'/') b
where b.index between 3 and 4 --change as needed
  • Related