I'm looking for some tips to get this PostgreSQL plpgsql function to work. I would like to do a couple of regex's on each record and update only one time as doing them one by one takes 15 min. ...
CREATE OR REPLACE FUNCTION clean_column() RETURNS void AS $$
DECLARE
r record;
reg text[] := array[
['search','replace'],
['search','replace'],
['search','replace'],
['search','replace']
];
var text[];
tmp text;
BEGIN
for r in
select column from mytable
loop -- loop over all records
tmp = r;
FOREACH var SLICE 1 IN ARRAY reg
LOOP -- loop over all changes
tmp = regexp_replace(tmp,var[1],var[2]);
END LOOP;
UPDATE mytable SET r = tmp;
end loop;
END
$$ LANGUAGE plpgsql;
... there is a problem with r as it is not assigned. Probably my lack of understanding of how plpgsql works.
Maybe there is an other way to do multiple changes on a record field?
CodePudding user response:
Your function would update the row repeatedly, writing a new row version every time. That would still be hugely inefficient.
The point must be to update every row only once. And only if anything actually changes.
CREATE OR REPLACE FUNCTION clean_column(INOUT _text text)
LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
DECLARE
_reg CONSTANT text[] := ARRAY[
['search','replace']
, ['search','replace']
, ['search','replace']];
_var text[];
BEGIN
FOREACH _var SLICE 1 IN ARRAY _reg
LOOP
_text := regexp_replace(_text, _var[1], _var[2]);
END LOOP;
END
$func$;
The function does not run the UPDATE
itself, just the string processing. Use that function in your UPDATE
like so:
UPDATE tbl
SET col = clean_column(col)
WHERE col IS DISTINCT FROM clean_column(col) -- ① !
AND col IS NOT NULL -- ② ?
① Skip updates that would not change anything.
② Skip rows with NULL
early (without even evaluating the function). Only relevant if column can be NULL
, of course.
Performance will differ by orders of magnitude.