I'm working in BigQuery and I have some URLs, let's say they looks like this:
URL
https://www.newssite.com/news/biden-rail-strike/12345/UTM=company-Library/
https://www.newssite.com/news/news-about-today-exiting/55690/
https://www.nytimes.com/2022/11/29/us/politics/biden-rail-strike.html
https://www.differentnewssite.com/news/news-about-yesterday-exiting/55690/UTM=facebook
https://www.newssite.com/opinion/opinion-about-today-sad-insightful/578932/
I want to extract the path section of it into a different column so it looks like this:
Path
news/biden-rail-strike/12345/
news/news-about-today-exiting/55690/
2022/11/29/us/politics/biden-rail-strike.html
news/news-about-yesterday-exiting/55690/
opinion/opinion-about-today-sad-insightful/578932
I've tried options with variations on select url, Regexp_extract(st_destination_url,'regex') as path
and have also played with splitting the URL as well, but haven't landed on a solution. Any thoughts?
CodePudding user response:
You might consider below as well
WITH sample_data AS (
select 'https://www.newssite.com/news/biden-rail-strike/12345/UTM=company-Library/' url union all
select 'https://www.newssite.com/news/news-about-today-exiting/55690/' url union all
select 'https://www.nytimes.com/2022/11/29/us/politics/biden-rail-strike.html' url union all
select 'https://www.differentnewssite.com/news/news-about-yesterday-exiting/55690/UTM=facebook&UTM=company-Library' url union all
select 'http://www.newssite.com/opinion/opinion-about-today-sad-insightful/578932/' url
)
SELECT REPLACE(REGEXP_REPLACE(url, r'(https?:\/\/|\w =[\w-] [\/\&]?)', ''), NET.HOST(url) || '/', '') Path
FROM sample_data;
Query results
CodePudding user response:
Consider approach below:
with sample_data as (
select 'https://www.newssite.com/news/biden-rail-strike/12345/UTM=company-Library/' as url
union all select 'https://www.newssite.com/news/news-about-today-exiting/55690/' as url
union all select 'https://www.nytimes.com/2022/11/29/us/politics/biden-rail-strike.html' as url
union all select 'https://www.differentnewssite.com/news/news-about-yesterday-exiting/55690/UTM=facebook' as url
union all select 'https://www.newssite.com/opinion/opinion-about-today-sad-insightful/578932/' as url
),
remove_host as (
select
url,
regexp_replace(right(url,length(url)-length(concat('https://',NET.HOST(url)))),r'(/)$','') as trimmed,
from sample_data
),
with_split as (
select
url,
trimmed,
split(trimmed,'/') splitted,
from remove_host
)
select
url,
if(regexp_contains(trimmed,r'=')=true,
array_to_string(array(
select * except(offset)
from with_split.splitted with offset
where offset < array_length(with_split.splitted) - 1
),'/'),trimmed
) as path
from with_split
Output: