Apply function to all columns in a Postgres table dynamically-CodePudding

Using Postgres 13.1, I want to apply a forward fill function to all columns of a table. The forward fill function is explained in my earlier question:

How to do forward fill as a PL/PGSQL function

However, in that case the columns and table are specified. I want to take that code and apply it to an arbitrary table, ie. specify a table and the forward fill is applied to each of the columns.

Using this table as an example:

CREATE TABLE example(row_num int, id int, str text, val integer);
INSERT INTO example VALUES
  (1, 1, '1a', NULL)
, (2, 1, NULL,    1)
, (3, 2, '2a',    2)
, (4, 2, NULL, NULL)
, (5, 3, NULL, NULL)
, (6, 3, '3a',   31)
, (7, 3, NULL, NULL)
, (8, 3, NULL,   32)
, (9, 3, '3b', NULL)
, (10,3, NULL, NULL)
;

I start with the following working base for the function. I call it passing in some variable names. Note the first is a table name not a column name. The function takes the table name and creates an array of all the column names and then outputs the names.

create or replace function col_collect(tbl text, id text, row_num text)
    returns text[]
    language plpgsql as
$func$
declare
    tmp text[];
    col text;
begin
    select array (
            select column_name
            from information_schema."columns" c
            where table_name = tbl
            ) into tmp;
    foreach col in array tmp
    loop
        raise notice 'col: %', col;
    end loop;
    return tmp;
end
$func$;

I want to apply the "forward fill" function I got from my earlier question to each column of a table. UPDATE seems to be the correct approach. So this is the preceding function where I replace raise notice by an update using execute so I can pass in the table name:

create or replace function col_collect(tbl text, id text, row_num text)
    returns void
    language plpgsql as
$func$
declare
    tmp text[];
    col text;
begin
    select array (
            select column_name
            from information_schema."columns" c
            where table_name = tbl
            ) into tmp;
    foreach col in array tmp
    loop
        execute 'update '||tbl||' 
                set '||col||' = gapfill('||col||') OVER w AS '||col||' 
                where '||tbl||'.row_num = '||col||'.row_num
                window w as (PARTITION BY '||id||' ORDER BY '||row_num||') 
                returning *;';
    end loop;
end
$func$;

-- call the function
select col_collect('example','id','row_num')

The preceding errors out with a syntax error. I have tried many variations on this but they all fail. Helpful answers on SO were here and here. The aggregate function I'm trying to apply (as window function) is:

CREATE OR REPLACE FUNCTION gap_fill_internal(s anyelement, v anyelement)
  RETURNS anyelement
  LANGUAGE plpgsql AS
$func$
BEGIN
RETURN COALESCE(v, s);  -- that's all!
END
$func$;

CREATE AGGREGATE gap_fill(anyelement) ( 
  SFUNC = gap_fill_internal, 
  STYPE = anyelement 
);

My questions are:

is this a good approach and if so what am I doing wrong; or
is there a better way to do this?

CodePudding user response：

What you ask is not a trivial task. You should be comfortable with PL/pgSQL. I do not advise this kind of dynamic SQL queries for beginners, too powerful.

That said, let's dive in. Buckle up!

CREATE OR REPLACE FUNCTION f_gap_fill_update(_tbl regclass, _id text, _row_num text, OUT nullable_columns int, OUT updated_rows int)
  LANGUAGE plpgsql AS
$func$
DECLARE
   _pk  text  := quote_ident(_row_num);
   _sql text;
BEGIN   
   SELECT INTO _sql, nullable_columns
          concat_ws(E'\n'
          , 'UPDATE ' || _tbl || ' t'
          , 'SET   (' || string_agg(        quote_ident(a.attname), ', ') || ')'
          , '    = (' || string_agg('u.' || quote_ident(a.attname), ', ') || ')'
          , 'FROM  (' 
          , '   SELECT ' || _pk
          , '        , ' || string_agg(format('gap_fill(%1$I) OVER w AS %1$I', a.attname), ', ')
          , '   FROM   ' || _tbl
          , format('   WINDOW w AS (PARTITION BY %I ORDER BY %s)', _id, _pk)
          , '   ) u'
          , format('WHERE t.%1$s = u.%1$s', _pk)
          , 'AND  (' || string_agg('t.' || quote_ident(a.attname), ', ') || ') IS DISTINCT FROM'
          , '     (' || string_agg('u.' || quote_ident(a.attname), ', ') || ')'
          )
        , count(*) -- AS _col_ct
   FROM  (
      SELECT a.attname
      FROM   pg_attribute a
      WHERE  a.attrelid = _tbl
      AND    a.attnum > 0
      AND    NOT a.attisdropped
      AND    NOT a.attnotnull
      ORDER  BY a.attnum
      ) a;

   IF nullable_columns = 0 THEN
      RAISE EXCEPTION 'No nullable columns found in table >>%<<', _tbl;
   ELSIF _sql IS NULL THEN
      RAISE EXCEPTION 'SQL string is NULL. Should not occur!';
   END IF;
   
   -- RAISE NOTICE '%', _sql;       -- debug
   EXECUTE _sql;              -- execute
   GET DIAGNOSTICS updated_rows = ROW_COUNT; 
END
$func$;

Example call:

SELECT * FROM f_gap_fill_update('example', 'id', 'row_num');

db<>fiddle here

The function is state of the art. Generates and executes a query of the form:

UPDATE tbl t
SET   (str, val, col1)
    = (u.str, u.val, u.col1)
FROM  (
   SELECT row_num
        , gap_fill(str) OVER w AS str, gap_fill(val) OVER w AS val
        , gap_fill(col1) OVER w AS col1
   FROM   tbl
   WINDOW w AS (PARTITION BY id ORDER BY row_num)
   ) u
WHERE t.row_num = u.row_num
AND  (t.str, t.val, t.col1) IS DISTINCT FROM
     (u.str, u.val, u.col1)

Using pg_catalog.pg_attribute instead of the information schema. See:

"Information schema vs. system catalogs"

Note the final WHERE clause to prevent (possibly expensive) empty updates. Only rows that actually change will be written. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

Moreover, only nullable columns (not defined NOT NULL) will even be considered to avoid unnecessary work.

Using ROW syntax in UPDATE to keep the code simple. See:

SQL update fields of one table from fields of another one

The function returns two integer values: nullable_columns and updated_rows, reporting what the name suggests.

The function defends against SQL injection properly. See:

About GET DIAGNOSTICS:

Calculate number of rows affected by batch query in PostgreSQL

The above function updates, but does not return rows. Here is a basic demo how to return rows of varying type:

CREATE OR REPLACE FUNCTION f_gap_fill_select(_tbl_type anyelement, _id text, _row_num text)
  RETURNS SETOF anyelement
  LANGUAGE plpgsql AS
$func$
DECLARE
   _tbl regclass := pg_typeof(_tbl_type)::text::regclass;
   _sql text;
BEGIN   
   SELECT INTO _sql
          'SELECT ' || string_agg(CASE WHEN a.attnotnull
                                  THEN format('%I', a.attname)
                                  ELSE format('gap_fill(%1$I) OVER w AS %1$I', a.attname) END
                                , ', ' ORDER BY a.attnum)
        || E'\nFROM ' || _tbl
        || format(E'\nWINDOW w AS (PARTITION BY %I ORDER BY %I)', _id, _row_num)
   FROM   pg_attribute a
   WHERE  a.attrelid = _tbl
   AND    a.attnum > 0
   AND    NOT a.attisdropped;
   
   IF _sql IS NULL THEN
      RAISE EXCEPTION 'SQL string is NULL. Should not occur!';
   END IF;

   RETURN QUERY EXECUTE _sql;
   -- RAISE NOTICE '%', _sql;       -- debug
END
$func$;

Call (note special syntax!):

SELECT * FROM f_gap_fill_select(NULL::example, 'id', 'row_num');

db<>fiddle here

About returning a polymorphic row type:

Refactor a PL/pgSQL function to return the output of various SELECT queries