Postgres - Performance of select for large jsonb column-CodePudding

We are using Postgres jsonb type in one of our database tables. Table structure is shown as below:

CREATE TABLE IF NOT EXISTS public.draft_document (
    id bigserial NOT NULL PRIMARY KEY,
    ...
    document jsonb NOT NULL,
    ein_search character varying(11) NOT NULL
);

CREATE INDEX IF NOT EXISTS count_draft_document_idx ON public.draft_document USING btree (ein_search);
CREATE INDEX IF NOT EXISTS read_draft_document_idx ON public.draft_document USING btree (id, ein_search);

The json structure of document column may vary. Below is one example of a possible schema for document:

"withholdingCredit": {  
    "type": "array",
    "items": {
        "$ref": "#/definitions/withholding"
    }
}

Where the withholding structure (array elements) respects:

"withholding": {
    "properties": {
        ...
        "proportionalityIndicator": {
            "type": "boolean"
        },
        "tribute": {
            "$ref": "#/definitions/tribute"
        },
        "payingSourceEin": {
            "type": "string"
        },
        "value": {
            "type": "number"
        }
        ...
    }
    ...
},      
"tribute": {
    "type": "object",
    "properties": {
        "code": {
            "type": "number"
        },
        "additionalCode": {
            "type": "number"
        }
        ...
    }
}

Here is an example of the json into document jsonb column:

{
   "withholdingCredit":[
      {
         "value": 15000,
         "tribute":{
            "code": 1216,
            "additionalCode": 2
         },
         "payingSourceEin": "03985506123132",
         "proportionalityIndicator": false
      },
      ...
      {
         "value": 98150,
         "tribute":{
            "code": 3155,
            "additionalCode": 1
         },
         "payingSourceEin": "04185506123163",
         "proportionalityIndicator": false
      }
   ]
}

The maximum number of elements in the array can vary up to a maximum limit of 100.000 (one hundred thousand) elements. It is a business limit.

We need a paged select query that returns the withholding array disaggregated (1 element per row), where each row also brings the sum of the withholding elements value and the array length. The query also needs to return the withholdings ordered by proportionalityIndicator, tribute-->code, tribute-->additionalCode, payingSourceEin. Something like:

id	sum	jsonb_array_length	jsonb_array_elements
30900	1.800.027	2300	{"value":15000,"tribute":{"code":1216,...}, ...}
...	...	...	{ ... }
30900	1.800.027	2300	{"value":98150,"tribute":{"code":3155,...}, ...}

We have defined the following query:

SELECT dft.id, 
    SUM((elem->>'value')::NUMERIC),
    jsonb_array_length(dft.document->'withholdingCredit'),
    jsonb_array_elements(jsonb_agg(elem 
    ORDER BY 
        elem->>'proportionalityIndicator',
        (elem->'tribute'->>'code')::NUMERIC,
        (elem->'tribute'->>'additionalCode')::NUMERIC,
        elem->>'payingSourceEin'))
FROM 
    draft_document dft
    CROSS JOIN LATERAL jsonb_array_elements(dft.document->'withholdingCredit') arr(elem)
WHERE (dft.document->'withholdingCredit') IS NOT NULL
    AND dft.id = :id
    AND dft.ein_search = :ein_search
GROUP BY dft.id
LIMIT :limit OFFSET :offset;

This query works, but with performance limitation when we have a large number of elements into the jsonb array. Any suggestion on how to improve it is welcome.

BTW, we are using Postgres 9.6.

CodePudding user response：

Your weird query which breaks it apart, aggregates it, and breaks is apart again does seem to trigger some pathological memory management issue in PostgreSQL (tested on 15dev). Maybe you should file a bug report on that.

But you can avoid the problem by just breaking it apart one time. Then you need to use a window function to get the tabulations you want to include all rows even those removed by the offset and limit.

SELECT dft.id, 
    SUM((elem->>'value')::NUMERIC) over (),
    count(*) over (),                                     
    elem                                
FROM 
    draft_document dft
    CROSS JOIN LATERAL jsonb_array_elements(dft.document->'withholdingCredit') arr(elem)
WHERE (dft.document->'withholdingCredit') IS NOT NULL
    AND dft.id = 4
    AND dft.ein_search = '4' 
ORDER BY 
        elem->>'proportionalityIndicator',
        (elem->'tribute'->>'code')::NUMERIC,
        (elem->'tribute'->>'additionalCode')::NUMERIC,
        elem->>'payingSourceEin' 
limit 4 offset 500;

In my hands this gives the same answer as your query, but takes 370 ms rather than 13,789 ms.

At higher offsets than that, my query still works while yours leads to a total lock up requiring a hard reset.

If anyone wants to reproduce the poor behavior, I generated the data by:

insert into draft_document select 4, jsonb_build_object('withholdingCredit',jsonb_agg(jsonb_build_object('value',floor(random()*99999)::int,'tribute','{"code": 1216, "additionalCode": 2}'::jsonb,'payingSourceEin',floor(random()*99999999)::int,'proportionalityIndicator',false))),'4' from generate_series(1,100000) group by 1,3;