Home > Software engineering >  Removing a list of keywords from string
Removing a list of keywords from string

Time:10-24

I have a data column that contains a blob of text. For example, 'John Doe lives in Emerald Street'. I would like to remove a list of texts from the string. I have some list like this (lives, in, street). I would like to select a new data column converting the text to 'John Doe Emerald'.

CodePudding user response:

First you can create an AGGREGATE function based on the replace function :

CREATE OR REPLACE FUNCTION replace(x text, y text, old_txt text, new_txt text)
RETURNS text LANGUAGE sql IMMUTABLE AS
$$
    SELECT replace(COALESCE(x,y), old_txt, new_txt)
$$ ;

DROP AGGREGATE IF EXISTS replace_agg (text, text, text) CASCADE ;
CREATE AGGREGATE replace_agg (text, text, text)
(
  sfunc = replace
, stype = text
);

Then you can iterate on the replace_agg function within a SELECT statement :

SELECT replace_agg (your_table.your_column, k.keyword, '')
FROM (SELECT 'John Doe lives in Emerald Street' as your_column) as your_table
CROSS JOIN (SELECT unnest(array['lives', 'in', 'Street']) as keyword) as k

with the following result :

'John Doe   Emerald'

CodePudding user response:

Assuming the simple case:

  • Words are delimited by a single space character - in the table and in the replacement string.
  • No punctuation like in natural language. No leading or trailing noise.
  • Case sensitive matching.
  • Remove all matches (as opposed to just the first).

And a table like:

CREATE TABLE strings(id serial PRIMARY KEY, string text);

INSERT INTO strings(string) VALUES
  ('John Doe lives in Emerald Street')
, ('John Doe lives in Emerald Street as john DOE');

A short solution with regexp_replace():

SELECT *, rtrim(regexp_replace(string, '(John|Doe|Emerald) ?', '', 'g')) FROM strings;

| separates alternative branches in a regular expression.

Related:

Or, with your original replacement string as input:

SELECT *, rtrim(regexp_replace(string, '(' || replace('John Doe Emerald', ' ', '|') || ') ?', '', 'g')) FROM strings;

Regular expressions are typically expensive. This may be faster (minimal form):

SELECT s.id, string_agg(word, ' ') AS string2
FROM   strings s, unnest(string_to_array(s.string, ' ')) word
WHERE  word <> ALL (string_to_array('John Doe Emerald', ' '))
GROUP  BY 1
ORDER  BY 1;

To avoid any ambiguity and make sure to preserve original order:

SELECT s.id, string_agg(word, ' ' ORDER BY ord) AS string2
FROM   strings s, unnest(string_to_array(s.string, ' ')) WITH ORDINALITY AS t(word, ord)
WHERE  t.word <> ALL (string_to_array('John Doe Emerald', ' '))
GROUP  BY 1
ORDER  BY 1;

See:

Which is typically faster with ORDER BY in a separate subquery:

SELECT sub.id, string_agg(sub.word, ' ') AS string2
FROM  (
   SELECT s.id, t.word
   FROM   strings s, unnest(string_to_array(s.string, ' ')) WITH ORDINALITY AS t(word, ord)
   WHERE  t.word <> ALL (string_to_array('John Doe Emerald', ' '))
   ORDER  BY s.id, t.ord
   ) sub
GROUP  BY 1
ORDER  BY 1;

db<>fiddle here

  • Related