PostgresQL query takes too long to run on a big dataset-CodePudding

I have a token_count table.

docid | tokenid | tf | log_ave_tf
  1   |    1    | 1  |  null
  1   |    2    | 2  |  null
  2   |    1    | 3  |  null
  2   |    2    | 1  |  null

ALTER TABLE token_count
ADD COLUMN log_ave_tf real;

I am trying to calculate the values for the log_ave_tf column and the formula is as follows:

log_ave_tf = (1 log(tf)) / (1 log(average tf for document))

Here's the code I'm running:

UPDATE token_count tc 
SET log_ave_tf = (1   log(2, tf)) / (1 log(2, subquery.avg_tf))
FROM (
    
          SELECT docid, avg(tf) as avg_tf
          FROM token_count
          GROUP BY docid
    ) subquery 
WHERE subquery.docid = tc.docid;

Took it 1 and a half minutes on a 1000 document dataset, tried running it on a 100 000 document dataset(36 million rows in token_count) and had to cancel the query after 5 hours. And I need it to work on a 4 million document dataset. Is there a way to optimize this query so it doesn't take so much time?

explain (analyze, buffers, format text) for a dataset of 1000 documents:

"Update on token_count tc  (cost=37563.77..92185.42 rows=1128913 width=94) (actual time=89287.844..89287.847 rows=0 loops=1)"
"  Buffers: shared hit=2319962 read=13056 dirtied=17040 written=922"
"  ->  Hash Join  (cost=37563.77..92185.42 rows=1128913 width=94) (actual time=768.179..83652.020 rows=1128913 loops=1)"
"        Hash Cond: (tc.docid = subquery.docid)"
"        Buffers: shared hit=32402 read=8796 dirtied=1 written=922"
"        ->  Seq Scan on token_count tc  (cost=0.00..31888.13 rows=1128913 width=30) (actual time=0.089..702.652 rows=1128913 loops=1)"
"              Buffers: shared hit=16206 read=4393 written=922"
"        ->  Hash  (cost=37552.67..37552.67 rows=888 width=96) (actual time=767.982..767.983 rows=1001 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 93kB"
"              Buffers: shared hit=16196 read=4403 dirtied=1"
"              ->  Subquery Scan on subquery  (cost=37532.69..37552.67 rows=888 width=96) (actual time=766.111..767.517 rows=1001 loops=1)"
"                    Buffers: shared hit=16196 read=4403 dirtied=1"
"                    ->  HashAggregate  (cost=37532.69..37543.79 rows=888 width=36) (actual time=766.105..767.119 rows=1001 loops=1)"
"                          Group Key: token_count.docid"
"                          Batches: 1  Memory Usage: 321kB"
"                          Buffers: shared hit=16196 read=4403 dirtied=1"
"                          ->  Seq Scan on token_count  (cost=0.00..31888.13 rows=1128913 width=8) (actual time=0.010..231.895 rows=1128913 loops=1)"
"                                Buffers: shared hit=16196 read=4403 dirtied=1"
"Planning Time: 0.222 ms"
"Execution Time: 89288.014 ms"

CodePudding user response：

This is a classic case of Backfill which is best done by creating a new table rather than doing by an in-place update. Pre-requisite is you need to have 2X space on the tablespace where you would be doing this.

The whole operation should be done in multiple rounds of 3 steps. Create the temporary tables.

create temp table tf_avg_tab_temp
( docid int,
  avg_tf real
  )

create temp table tf_avg_log_tab_temp
(
  docid int,
  log_avg_tf real
)

Orginal table:

create table token_count
(
  docid int,
  tokenid int,
  tf int
)

Final table:

create table token_count_final
(
  docid int,
  tokenid int,
  tf int,
  log_avg_tf real
)

Each round should be done for a range of docid say 1 million:

Step 1: Load the table tf_avg_tab_temp.

insert into tf_avg_tab_temp select docid, avg(tf) as avg_tf FROM token_count where docid between 1 and 1000000 GROUP BY docid;

[ Note: To speed this process up you can create an index on docid.]

Step 2: With values from Step 1 , load the table tf_avg_log_tab_temp.

insert into tf_avg_log_tab_temp select docid,(1   log(2, tf)) / (1 log(2,avg_tf)) log_avg_tf from tf_avg_tab_temp;

Step 3: Load the final table by joining the original table and dataset obtained in step 2:

insert into token_count_final 
select a.docid, a.tokenid, a.tf , b.log_avg_tf from token_count a inner join tf_avg_log_tab_temp b on a.docid = b.docid ;

TEMP tables need to be truncated after each round and the next iteration should be started with next subsequent million docid's ... say here docid 1000001 to 2000000

You can rename the the final table and drop the original one. Analyze the table after you rename it.