Home > Back-end >  Removing similar instances of text within a text field in PostgreSQL (dynamic)
Removing similar instances of text within a text field in PostgreSQL (dynamic)

Time:04-05

I have a process that is scraping news articles. My headline field is a bit messy. Since the scraper works every hour, I have may have duplicate articles at different times of the day.

The issue I have is that one headline source captures the length of time that an article has been live. I need to dynamically remove these to accurately pull all unique articles from the table.

Clowns have taken over the world    Author's Name   3h

The elapsed time component may look like any of the following examples:

3h, 25m, 50s, etc...

How do I dynamically remove the elapsed time from the headline field? I can't find any questions or examples related to this.

select
   date
   , source
   , lean
   , headline --This is the issue field
from 
headline_table

CodePudding user response:

You can use regular expressions to do it. Example for you:

with tbl as ( 
    select 'Clowns have taken over the world    Authors Name   3h' as ptext
    union all 
    select 'My test My Test 25m'  as ptext
    union all 
    select 'test test test 32s test test'  as ptext
)
select
    ptext, 
    REGEXP_REPLACE(ptext, '(?=[^ ]*[a-z])(?=[^ ]*[0-9])[^ ] ', '', 'g') as newtext 
from tbl

-- Result: 
ptext                                                       newtext
---------------------------------------------------------------------------------------------------------------
Clowns have taken over the world    Authors Name   3h       Clowns have taken over the world    Authors Name   
My test My Test 25m                                         My test My Test 
test test test 32s test test                                test test test  test test

P.S. This script will be remove any characters which begining numbers. For example: 412ds, 6521t, 25hhgc and etc.

  • Related