Removing a list of keywords from string-CodePudding

I have a data column that contains a blob of text. For example, 'John Doe lives in Emerald Street'. I would like to remove a list of texts from the string. I have some list like this (lives, in, street). I would like to select a new data column converting the text to 'John Doe Emerald'.

CodePudding user response：

First you can create an AGGREGATE function based on the replace function :

CREATE OR REPLACE FUNCTION replace(x text, y text, old_txt text, new_txt text)
RETURNS text LANGUAGE sql IMMUTABLE AS
$$
    SELECT replace(COALESCE(x,y), old_txt, new_txt)
$$ ;

DROP AGGREGATE IF EXISTS replace_agg (text, text, text) CASCADE ;
CREATE AGGREGATE replace_agg (text, text, text)
(
  sfunc = replace
, stype = text
);

Then you can iterate on the replace_agg function within a SELECT statement :

SELECT replace_agg (your_table.your_column, k.keyword, '')
FROM (SELECT 'John Doe lives in Emerald Street' as your_column) as your_table
CROSS JOIN (SELECT unnest(array['lives', 'in', 'Street']) as keyword) as k

with the following result :

'John Doe   Emerald'

CodePudding user response：

Assuming the simple case:

Words are delimited by a single space character - in the table and in the replacement string.
No punctuation like in natural language. No leading or trailing noise.
Case sensitive matching.
Remove all matches (as opposed to just the first).

And a table like:

CREATE TABLE strings(id serial PRIMARY KEY, string text);

INSERT INTO strings(string) VALUES
  ('John Doe lives in Emerald Street')
, ('John Doe lives in Emerald Street as john DOE');

A short solution with regexp_replace():

SELECT *, rtrim(regexp_replace(string, '(John|Doe|Emerald) ?', '', 'g')) FROM strings;

| separates alternative branches in a regular expression.

Trim trailing spaces with PostgreSQL

Or, with your original replacement string as input:

SELECT *, rtrim(regexp_replace(string, '(' || replace('John Doe Emerald', ' ', '|') || ') ?', '', 'g')) FROM strings;

Regular expressions are typically expensive. This may be faster (minimal form):

SELECT s.id, string_agg(word, ' ') AS string2
FROM   strings s, unnest(string_to_array(s.string, ' ')) word
WHERE  word <> ALL (string_to_array('John Doe Emerald', ' '))
GROUP  BY 1
ORDER  BY 1;

To avoid any ambiguity and make sure to preserve original order:

SELECT s.id, string_agg(word, ' ' ORDER BY ord) AS string2
FROM   strings s, unnest(string_to_array(s.string, ' ')) WITH ORDINALITY AS t(word, ord)
WHERE  t.word <> ALL (string_to_array('John Doe Emerald', ' '))
GROUP  BY 1
ORDER  BY 1;

See:

PostgreSQL unnest() with element number

Which is typically faster with ORDER BY in a separate subquery:

SELECT sub.id, string_agg(sub.word, ' ') AS string2
FROM  (
   SELECT s.id, t.word
   FROM   strings s, unnest(string_to_array(s.string, ' ')) WITH ORDINALITY AS t(word, ord)
   WHERE  t.word <> ALL (string_to_array('John Doe Emerald', ' '))
   ORDER  BY s.id, t.ord
   ) sub
GROUP  BY 1
ORDER  BY 1;

db<>fiddle here