Home > Blockchain >  Running query against all schemas in Postgres
Running query against all schemas in Postgres

Time:12-10

Context

We do schema-based multi-tenancy in our Postgres DB. Each and every schema is associated to a different tenant and will have the exact same structure except for one schema named public.

To obtain the list of all the relevant schemas, we can use:

SELECT tenant_schema FROM public.tenant_schema_mappings;

Problem

We need to periodically clean up a specific table for all the tenants:

DELETE FROM a_tenant_schema.parent WHERE expiration_date_time < NOW(); 

(The actual query is a bit more complex because we need to also delete the linked children entries associated with the parent, but let us keep things simple for the sake of this question.)

Constraints

  1. We cannot use pg_cron because our DB server is hosted on Azure, and that extension isn't supported yet.
  2. We do not want to deploy an entire service/application just to execute a cron job.
  3. We have thus decided to use a CronJob pod deployed within our k8s namespace and can thus communicate directly with the DB using the psql client via shell commands.

Question

What would be the best way to execute the given DELETE statement against all of the relevant schemas using psql in the shell?

Please keep in mind: since there may potentially be hundreds of tenants, it could be interesting to run the clean up queries for each tenant in parallel.

Current potential solutions

So far there seems to mostly be 2 approaches that could be interesting (although I'm not quite sure how to parallelize the query execution):

  1. Figure out how to do everything from within a single Stored Procedure and simply call that SP using psql -c.
  2. Use psql -c "SELECT tenant_schema FROM public.tenant_schema_mappings;" to gather the list of all relevant tenant schemas, and then use shell commands to iterate through that list by dynamically constructing the appropriate queries. With the result set of queries, run them all one by one using psql -c.

Other partial solution

I've figured we can actually construct the queries using the following SQL:

SELECT 'DELETE * FROM ' || tenant_schema || '.parent WHERE expiration_date_time < NOW();' AS query
FROM public.tenant_schema_mappings;

Maybe there would be a way to tell Postgres to execute all of the resulting strings?

CodePudding user response:

You can define a Postgres procedure that uses dynamic commands, e.g.:

create or replace procedure clear_tenants()
language plpgsql as $function$
declare
    tenant text;
begin
    for tenant in
        select tenant_schema 
        from public.tenant_schema_mappings
    loop
        execute format($ex$
            delete from %I.parent 
            where expiration_date_time < now()
            $ex$, tenant);
    end loop;
end
$function$;

Then all your cron task has to do is to call the procedure:

call clear_tenants()

In Postgres 10 or earlier use a function or do block instead of procedure.


The main disadvantage of this simple solution is that all the stuff is executed in a single transaction. Unfortunately, you cannot control transactions in procedures containing dynamic queries. I would define chunk_number in the table describing schemas and call the procedure for each chunk in its own transaction.

create or replace procedure public.clear_tenants(chunk integer)
language plpgsql as $function$
declare
    tenant text;
begin
    for tenant in
        select tenant_schema 
        from public.tenant_schema_mappings
        where chunk_number = chunk
    loop
        execute format($ex$
            delete from %I.parent 
            where expiration_date_time < now()
            $ex$, tenant);
    end loop;
end
$function$;

On the client side I would have to prepare a script in the format:

-- in psql the procedures will be executed in separate transactions
-- if you do not use begin; explicitly
call clear_tenants(1);
call clear_tenants(2);
call clear_tenants(3);
...

or execute many instances of psql for individual chunks (each in a separate connection). The last option is practically the only way to enforce concurrency. It is, of course, limited by a reasonable number of concurrent connections.


The following function emits notices with number of deleted rows from each tenant and returns the total number of deleted rows for a chunk:

create or replace function public.clear_tenants_modified(chunk integer)
returns bigint language plpgsql as $function$
declare
    tenant text;
    deleted_rows bigint;
    total_deleted_rows bigint = 0;
begin
    for tenant in
        select tenant_schema 
        from public.tenant_schema_mappings
        where chunk_number = chunk
    loop
        execute format($ex$
            with delete_statement as (
                delete from %I.parent 
                where expiration_date_time < now()
                returning 1 as x)
            select count(x)
            from delete_statement
            $ex$, tenant)
        into deleted_rows;
        raise notice '%: %', tenant, deleted_rows;
        total_deleted_rows = total_deleted_rows  deleted_rows;
    end loop;
    return total_deleted_rows;
end
$function$;

select clear_tenants_modified(1);
  • Related