I have a data column that contains a blob of text. For example, 'John Doe lives in Emerald Street'. I would like to remove a list of texts from the string. I have some list like this (lives, in, street). I would like to select a new data column converting the text to 'John Doe Emerald'.
CodePudding user response:
First you can create an AGGREGATE
function based on the replace
function :
CREATE OR REPLACE FUNCTION replace(x text, y text, old_txt text, new_txt text)
RETURNS text LANGUAGE sql IMMUTABLE AS
$$
SELECT replace(COALESCE(x,y), old_txt, new_txt)
$$ ;
DROP AGGREGATE IF EXISTS replace_agg (text, text, text) CASCADE ;
CREATE AGGREGATE replace_agg (text, text, text)
(
sfunc = replace
, stype = text
);
Then you can iterate on the replace_agg
function within a SELECT statement :
SELECT replace_agg (your_table.your_column, k.keyword, '')
FROM (SELECT 'John Doe lives in Emerald Street' as your_column) as your_table
CROSS JOIN (SELECT unnest(array['lives', 'in', 'Street']) as keyword) as k
with the following result :
'John Doe Emerald'
CodePudding user response:
Assuming the simple case:
- Words are delimited by a single space character - in the table and in the replacement string.
- No punctuation like in natural language. No leading or trailing noise.
- Case sensitive matching.
- Remove all matches (as opposed to just the first).
And a table like:
CREATE TABLE strings(id serial PRIMARY KEY, string text);
INSERT INTO strings(string) VALUES
('John Doe lives in Emerald Street')
, ('John Doe lives in Emerald Street as john DOE');
A short solution with regexp_replace()
:
SELECT *, rtrim(regexp_replace(string, '(John|Doe|Emerald) ?', '', 'g')) FROM strings;
|
separates alternative branches in a regular expression.
Related:
Or, with your original replacement string as input:
SELECT *, rtrim(regexp_replace(string, '(' || replace('John Doe Emerald', ' ', '|') || ') ?', '', 'g')) FROM strings;
Regular expressions are typically expensive. This may be faster (minimal form):
SELECT s.id, string_agg(word, ' ') AS string2
FROM strings s, unnest(string_to_array(s.string, ' ')) word
WHERE word <> ALL (string_to_array('John Doe Emerald', ' '))
GROUP BY 1
ORDER BY 1;
To avoid any ambiguity and make sure to preserve original order:
SELECT s.id, string_agg(word, ' ' ORDER BY ord) AS string2
FROM strings s, unnest(string_to_array(s.string, ' ')) WITH ORDINALITY AS t(word, ord)
WHERE t.word <> ALL (string_to_array('John Doe Emerald', ' '))
GROUP BY 1
ORDER BY 1;
See:
Which is typically faster with ORDER BY
in a separate subquery:
SELECT sub.id, string_agg(sub.word, ' ') AS string2
FROM (
SELECT s.id, t.word
FROM strings s, unnest(string_to_array(s.string, ' ')) WITH ORDINALITY AS t(word, ord)
WHERE t.word <> ALL (string_to_array('John Doe Emerald', ' '))
ORDER BY s.id, t.ord
) sub
GROUP BY 1
ORDER BY 1;
db<>fiddle here