Home > front end >  Extracting path from url in BigQuery with regex
Extracting path from url in BigQuery with regex

Time:11-30

I'm working in BigQuery and I have some URLs, let's say they looks like this:

URL
https://www.newssite.com/news/biden-rail-strike/12345/UTM=company-Library/
https://www.newssite.com/news/news-about-today-exiting/55690/
https://www.nytimes.com/2022/11/29/us/politics/biden-rail-strike.html
https://www.differentnewssite.com/news/news-about-yesterday-exiting/55690/UTM=facebook
https://www.newssite.com/opinion/opinion-about-today-sad-insightful/578932/

I want to extract the path section of it into a different column so it looks like this:

Path
news/biden-rail-strike/12345/
news/news-about-today-exiting/55690/
2022/11/29/us/politics/biden-rail-strike.html
news/news-about-yesterday-exiting/55690/
opinion/opinion-about-today-sad-insightful/578932

I've tried options with variations on select url, Regexp_extract(st_destination_url,'regex') as path and have also played with splitting the URL as well, but haven't landed on a solution. Any thoughts?

CodePudding user response:

You might consider below as well

WITH sample_data AS (
  select 'https://www.newssite.com/news/biden-rail-strike/12345/UTM=company-Library/' url union all
  select 'https://www.newssite.com/news/news-about-today-exiting/55690/' url union all
  select 'https://www.nytimes.com/2022/11/29/us/politics/biden-rail-strike.html' url union all
  select 'https://www.differentnewssite.com/news/news-about-yesterday-exiting/55690/UTM=facebook&UTM=company-Library' url union all
  select 'http://www.newssite.com/opinion/opinion-about-today-sad-insightful/578932/' url
)
SELECT REPLACE(REGEXP_REPLACE(url, r'(https?:\/\/|\w =[\w-] [\/\&]?)', ''), NET.HOST(url) || '/', '') Path
  FROM sample_data;

Query results

enter image description here

CodePudding user response:

Consider approach below:

with sample_data as (
select 'https://www.newssite.com/news/biden-rail-strike/12345/UTM=company-Library/' as url
union all select 'https://www.newssite.com/news/news-about-today-exiting/55690/' as url
union all select 'https://www.nytimes.com/2022/11/29/us/politics/biden-rail-strike.html' as url
union all select 'https://www.differentnewssite.com/news/news-about-yesterday-exiting/55690/UTM=facebook' as url
union all select 'https://www.newssite.com/opinion/opinion-about-today-sad-insightful/578932/' as url
),

remove_host as (
  select 
    url,
    regexp_replace(right(url,length(url)-length(concat('https://',NET.HOST(url)))),r'(/)$','') as trimmed,
  from sample_data
),
with_split as (
  select 
    url,
    trimmed,
    split(trimmed,'/') splitted,
  from remove_host
)

select 
url,
if(regexp_contains(trimmed,r'=')=true,
  array_to_string(array(
    select * except(offset) 
    from with_split.splitted with offset
    where offset < array_length(with_split.splitted) - 1
  ),'/'),trimmed 
) as path

from with_split

Output:

enter image description here

  • Related