How to efficiently join two huge tables by nearest timestamp?-CodePudding

I have two huge tables, A and B. Table A has around 500 million rows of time-series data. Table B has around 10 million rows of time-series data. To simplify, we can assume they are constituted by the following columns:

Table A

factory	machine	timestamp_1	part	suplement
1	1	2022-01-01 23:54:01	1	1
1	1	2022-01-01 23:54:05	1	2
1	1	2022-01-01 23:54:10	1	3
...	...	...	...

Table B

machine	timestamp_2	measure
1	2022-01-01 23:54:00	0
1	2022-01-01 23:54:07	10
1	2022-01-01 23:54:08	0
...	...	...

I want to create a table C, that results from "joining" both tables by matching each value of timestamp_1 of table A to the nearest value of timestamp_2 of table B whose measure is 0, and also for the same factory and machine. I also only need this for the part = 1 values of table A. For the small example above, the resulting table C would have the same amount of rows as A and would look like:

Table C

machine	timestamp_1	time_since_measure_0
1	2022-01-01 23:54:01	1
1	2022-01-01 23:54:05	5
1	2022-01-01 23:54:10	2
...	...	...

Some things that are also important to consider are:

Table A has an index on columns (factory, machine, timestamp_1, part, suplement). That index is essential working great for other queries not related to this. Table B has indexes on columns (machine, timestamp_2, measure).
Table A is a compressed timescaleDB partitioned table by (factory, timestamp_1). This is also because of other queries. Table B is a postgresql vanilla table.

I used the following statement to create table C:

create table C (
    machine int4 not null,
    timestamp_1 timestamptz,
    time_since_measure_0 interval,
    constraint C primary key (machine,timestamp_1)
)

I then tried this code to select and insert data into table C:

insert into C (
    select
        factory,
        machine,
        timestamp_1,
        timestamp_1  - (
            select timestamp_2
            from B
            where 
                A.machine = B.machine
                and B.measure = 0 
                and B.timestamp_2 <= A.timestamp_1
            order by B.timestamp_2 desc
            limit 1
        ) as "time_since_measure_0"
    from A
    where A.part = 1
)

However, this seems takes a loooot of time. I know I am dealing with very big tables, but is there something I am missing or how could I optimize this?

CodePudding user response：

Because of course we don't have access to your tables and you haven't posted a query plan it's difficult to do more than make some general observations. The indexes you describe as being in place do not appear to be useful to this query. Looking at your query it appears to me that you need to add the following indexes:

Table A
  Index on (machine, timestamp_1)

Table B
  Index on (machine, measure, timestamp_2)

Give that a shot and see what happens.

CodePudding user response：

What you want is called "as-of join". That joins each timestamp to the nearest value in the other table.

Some time-series databases, like clickhouse, support this directly. This is the only way to make it fast. It is quite similar to a merge join, with a few modifications: the engine must scan both tables in timestamp order, and join to the nearest value row instead of the equal value row.

I've looked into it briefly and it doesn't look like timescaledb supports it, but this post shows a workaround using lateral join and a covering index. This is likely to have similar performance to your query, because it will use a nested loop and an index-only scan to pick the nearest value for each row.