I need some help with RegEx. I have a URL:
article/life/food/woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story/275-88a8538f-4c62-4204-aed8-4381d618208d
This URL is not fixed and this changes every time. I only need to extract woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story
and store it in table.
Here's my attempt:
select parse_url(c1:uri):path as PATH,
count(1) as CNT
from "OPE_RAW_DATA"."TAGGER_DEFAULT"."INTERACTION_20221107"
where c1:uri like'%www.kens5.com%'
group by PATH
order by CNT desc
CodePudding user response:
One approach would be to use a JavaScript UDF to split the string into an array and find the longest one:
create or replace function PARSE_LONGEST_TEXT_PATH(URL string)
returns string
language javascript
strict immutable
as
$$
let a = URL.split('/');
let longest = a.reduce(
function (a, b) {
return a.length > b.length ? a : b;
});
return longest;
$$;
select parse_longest_text_path
('article/life/food/woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story/275-88a8538f-4c62-4204-aed8-4381d618208d');
This doesn't handle the case if a the longest string is a hex value, but if you need that I can modify the code.
CodePudding user response:
If you want to get the longest string between 3rd and 4th item, you can use something like this:
IFF( length(split(c1,'/')[2]) > length(split(c1,'/')[3]), split(c1,'/')[2], split(c1,'/')[3] )
For example:
select IFF( length(split(c1,'/')[2]) > length(split(c1,'/')[3]), split(c1,'/')[2], split(c1,'/')[3] )
from values
('article/life/food/woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story/275-88a8538f-4c62-4204-aed8-4381d618208d'),
('article/life/woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story/275-88a8538f-4c62-4204-aed8-4381d618208d')
tmp(c1);
CodePudding user response:
I suggest flattening the url so you have more control over the criteria for parsing the topic.
set str='article/life/food/woman-who-called-911-over-pink-bbq-tells-her-side-of-the-story/275-88a8538f-4c62-4204-aed8-4381d618208d';
select distinct
url,
first_value(b.value) over (partition by url order by length(translate(b.value,'0123456789-','')) desc) as topic --pick the one that's longest after removing numbers and hyphens
from (select $str as url) a,
lateral split_to_table(url,'/') b
where b.index between 3 and 4 --change as needed