Home > Software engineering >  Optimization Strategies to perform aggregate on 4 Bil records
Optimization Strategies to perform aggregate on 4 Bil records

Time:10-10

So I have a table of 4.7 billion records on which I want to perform a group by count expression in Postgres using PGadmin4.

Obviously this is gonna take a lot of time and I want to speed up the process as high as possible.

Example query

Update TT FROM Target_table TT, (Select col_1, count(col_1) cnt from 
Very_Large_Table
group by col_1) as AA
set tt.qty = AA.cnt
where aa.col1 = tt.col1 and aa.cnt <> tt.qty;

I have freshly created/analysed indexes on the column col_1 still the process takes 2 hours.

Trying parallel hints by adding / PARALLEL (very_large_table 6) / in select but it seems like a different syntax is required as explain plan still shows 2 workers.

Cannot really create partitioning.

Any help is greatly appreciated as I am out of ideas now. This is choking the system and other applications are getting impacted.

Edit: Thanks everyone for the all help but I am looking for ideas to mitigate the problem as I am quite sure anything I write/change directly on PGadmin would not help me here.

CodePudding user response:

Sometimes there are situations where we don't have any functionality or capabilities of DB to help us solve that problem. In these cases, we have to think of some logical solutions. For example, suppose we need the number of records in a table. If we don't need an exact count, but an approximate count is enough, we can get very high performance by getting this count from information-schema. So, if calculating the count of a table with 4 billion records takes 1-5 minutes, with information-schema we can get it in 1 millisecond.

Now information-schema will not help us in this matter you wrote, because it only gives the count of records of the entire table, grouping prevents us here. But you can use materialized views, if you don't need exactly count of records. Every day on the night you can refreshing this materialized using schedule and use it on day.

Again, after getting to know the issue in depth, knowing the business logic in detail, we can think of different alternative solutions. All I can say is that in all DBs the count command is slow running process on very large tables.

CodePudding user response:

From your question, I guess you need to do queries on your Target_Table where each row of your result set shows how many rows from Very_Large_Table have the same value of col_1 as the present row. Your plan seems to be populating your qty column with that number.

With respect, your present approach is impractical.

  • As you have discovered, it takes too long to populate qty.
  • The dbms write-locks your tables while you populate qty, otherwise insertions, updates and deletions will interfere with the UPDATE query in your question. The locks interfere with other workloads on your dbms.
  • Your qty values will become stale between times you run your query. Therefore you must treat them as approximate values in your application.

There are other ways to generate the qty value you need, when querying. For example, this retrieves your qty at the time of your query.

SELECT TT.*, COALESCE (q.qty, 0) qty
  FROM Target_Table
  LEFT JOIN (
             SELECT COUNT(*) qty,
                    col_1
               FROM Very_Large_Table
              GROUP BY col_1
       ) q ON TT.col_1 = q.col_1
 WHERE  something;

It seems very likely that you always, or almost always, use a WHERE something filter clause when querying. I suspect it's very rare to query the whole table. The presence of a filter, any filter, reduces the number of col_1 values for which you need qty. If the filter is selective enough it reduces that number to a workable value. This approach has the benefit that qty is always up to date.

This index accelerates the query I propose.

CREATE INDEX col_1 ON Very_Large_Table (col_1);

There's another, more elaborate, approach. If you decide to go this direction, give it a try and ask another question if you need help.

  • Create a separate table Col_1_Counts_From_Very_Large_Table with col_1 and qty columns. Put indexes on (col_1, qty) and (qty, col_1).

  • Populate it, during some system downtime, with a query like the one in your question.

  • Create triggers for insert, update, and delete on your Very_Large_Table. Have those triggers update the qty values in the appropriate rows in Col_1_Counts_From_Very_Large_Table. This keeps your counts table up-to-date whenever somebody changes rows in Very_Large_Table.

  • Then when you run your query do it like this:

    SELECT TT.*, COALESCE (q.qty, 0) qty
       FROM Target_Table
       LEFT JOIN Col_1_Counts_From_Very_Large_Table q ON TT.col_1 = q.col_1
      WHERE  something;
    

This approach will be accurate and fast at query time, at the cost of trigger complexity and small slowdowns on inserts, updates, and deletes to your very large table. It also lets you do things like find MAX(qty) or ORDER BY qty DESC without dimming the lights in your data center.

  • Related