Home > Software engineering >  DISTINCT ON slow for 300000 rows
DISTINCT ON slow for 300000 rows

Time:12-16

I have a table named assets. Here is the ddl:

create table assets (
    id            bigint                                 primary key,
    name          varchar(255)                           not null,
    value         double precision                       not null,
    business_time timestamp with time zone,
    insert_time   timestamp with time zone default now() not null
);

create index idx_assets_name on assets (name);

I need to extract the newest (based on insert_time) value for each asset name. This is the query that I initially used:

SELECT DISTINCT
    ON (a.name) *
FROM home.assets a
WHERE a.name IN (
                 'USD_RLS',
                 'EUR_RLS',
                 'SEKKEH_RLS',
                 'NIM_SEKKEH_RLS',
                 'ROB_SEKKEH_RLS',
                 'BAHAR_RLS',
                 'GOLD_18_RLS',
                 'GOLD_OUNCE_USD',
                 'SILVER_OUNCE_USD',
                 'PLATINUM_OUNCE_USD',
                 'GOLD_MESGHAL_RLS',
                 'GOLD_24_RLS',
                 'STOCK_IR',
                 'AED_RLS',
                 'GBP_RLS',
                 'CAD_RLS',
                 'CHF_RLS',
                 'TRY_RLS',
                 'AUD_RLS',
                 'JPY_RLS',
                 'CNY_RLS',
                 'RUB_RLS',
                 'BTC_USD'
    )
ORDER BY a.name,
         a.insert_time DESC;

I have around 300,000 rows in the assets table. On my VPS this query takes about 800 ms. this is causing a whole response time of about 1 second for a specific endpoint. This is a bit slow and considering the fact that the assets table is growing fast, this endpoint will be even slower in the near future. I also tried to avoid IN(...) using this query:

SELECT DISTINCT
    ON (a.name) *
FROM home.assets a
ORDER BY a.name,
         a.insert_time DESC;

But I didn't notice a significant difference. Any idea how I could optimize this query?

CodePudding user response:

You may try adding the following index to your table:

CREATE INDEX idx ON assets (name, insert_time DESC);

If used, Postgres can simply scan this index to find the distinct record having the most recent insert_time for each name.

CodePudding user response:

For more than a few rows per name in the table (looks to be so), I expect this query to be substantially faster, yet:

SELECT a.*
FROM   unnest('{USD_RLS, EUR_RLS, SEKKEH_RLS, NIM_SEKKEH_RLS, ROB_SEKKEH_RLS
               , BAHAR_RLS, GOLD_18_RLS, GOLD_OUNCE_USD, SILVER_OUNCE_USD
               , PLATINUM_OUNCE_USD, GOLD_MESGHAL_RLS, GOLD_24_RLS, STOCK_IR
               , AED_RLS, GBP_RLS, CAD_RLS, CHF_RLS
               , TRY_RLS, AUD_RLS, JPY_RLS, CNY_RLS
               , RUB_RLS, BTC_USD}'::text[]) AS n(name)
CROSS  JOIN LATERAL (
   SELECT *
   FROM   home.assets a
   WHERE  a.name = n.name
   ORDER  BY a.insert_time DESC
   LIMIT  1
   ) a;

Pass your list as array, unnest, and then get each latest row in a LATERAL subquery. The CROSS JOIN eliminates names that are not found at all. (You might be interested in LEFT JOIN LATERAL ... ON true instead, to keep those in the result.)

You still need the multicolumn index that Tim mentioned.

CREATE INDEX ON assets (name, insert_time DESC);

Default ascending sort order would work, too, in this case. Postgres can scan backwards:

CREATE INDEX ON assets (name, insert_time);

See:

Not the number of rows in the table, but the number of rows per group (per name in your case) decides whether DISTINCT ON is the best choice. See this benchmark comparing relevant query styles:

  • Related