Do multiple regex changes in one update with the help of plpgsql-CodePudding

I'm looking for some tips to get this PostgreSQL plpgsql function to work. I would like to do a couple of regex's on each record and update only one time as doing them one by one takes 15 min. ...

CREATE OR REPLACE FUNCTION clean_column() RETURNS void  AS $$
    DECLARE
        r record;
        reg text[] := array[ 
          ['search','replace'], 
          ['search','replace'],
          ['search','replace'],
          ['search','replace'] 
        ];
        var text[];
        tmp text;
    BEGIN
        for r in 
            select column from mytable
            loop -- loop over all records
                tmp = r;
                FOREACH var SLICE 1 IN ARRAY reg
                LOOP -- loop over all changes
                    tmp = regexp_replace(tmp,var[1],var[2]);
                END LOOP;
                UPDATE mytable SET r = tmp;
        end loop;
    END
$$ LANGUAGE plpgsql;

... there is a problem with r as it is not assigned. Probably my lack of understanding of how plpgsql works.

Maybe there is an other way to do multiple changes on a record field?

CodePudding user response：

Your function would update the row repeatedly, writing a new row version every time. That would still be hugely inefficient.

The point must be to update every row only once. And only if anything actually changes.

CREATE OR REPLACE FUNCTION clean_column(INOUT _text text)
  LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
DECLARE
   _reg CONSTANT text[] := ARRAY[ 
       ['search','replace']
     , ['search','replace']
     , ['search','replace']];
   _var text[];
BEGIN
   FOREACH _var SLICE 1 IN ARRAY _reg
   LOOP
      _text := regexp_replace(_text, _var[1], _var[2]);
   END LOOP;
END
$func$;

The function does not run the UPDATE itself, just the string processing. Use that function in your UPDATE like so:

UPDATE tbl
SET    col = clean_column(col)
WHERE  col IS DISTINCT FROM clean_column(col)  -- ① !
AND    col IS NOT NULL  -- ② ?

① Skip updates that would not change anything.

② Skip rows with NULL early (without even evaluating the function). Only relevant if column can be NULL, of course.

Performance will differ by orders of magnitude.