I have a process that is scraping news articles. My headline field is a bit messy. Since the scraper works every hour, I have may have duplicate articles at different times of the day.
The issue I have is that one headline source captures the length of time that an article has been live. I need to dynamically remove these to accurately pull all unique articles from the table.
Clowns have taken over the world Author's Name 3h
The elapsed time component may look like any of the following examples:
3h, 25m, 50s, etc...
How do I dynamically remove the elapsed time from the headline
field? I can't find any questions or examples related to this.
select
date
, source
, lean
, headline --This is the issue field
from
headline_table
CodePudding user response:
You can use regular expressions to do it. Example for you:
with tbl as (
select 'Clowns have taken over the world Authors Name 3h' as ptext
union all
select 'My test My Test 25m' as ptext
union all
select 'test test test 32s test test' as ptext
)
select
ptext,
REGEXP_REPLACE(ptext, '(?=[^ ]*[a-z])(?=[^ ]*[0-9])[^ ] ', '', 'g') as newtext
from tbl
-- Result:
ptext newtext
---------------------------------------------------------------------------------------------------------------
Clowns have taken over the world Authors Name 3h Clowns have taken over the world Authors Name
My test My Test 25m My test My Test
test test test 32s test test test test test test test
P.S.
This script will be remove any characters which begining numbers. For example: 412ds
, 6521t
, 25hhgc
and etc.