Home > database >  SQL - Fuzzy JOIN on Timestamp columns within X amount of time
SQL - Fuzzy JOIN on Timestamp columns within X amount of time

Time:12-12

Say I have two tables:

a:

timestamp precipitation
2015-08-03 21:00:00 UTC 3
2015-08-03 22:00:00 UTC 3
2015-08-04 3:00:00 UTC 4
2016-02-04 18:00:00 UTC 4

and b:

timestamp loc
2015-08-03 21:23:00 UTC San Francisco
2016-02-04 16:04:00 UTC New York

I want to join to get a table who has fuzzy joined entries where every row in b tries to get joined to a row in a. Criteria:

  • The time is within 60 minutes. If a match does not exist within 60 minutes, do not include that row in the output.
  • In the case of a tie where some row in b could join onto two rows in a, pick the closest one in terms of time.

Example Output:

timestamp loc precipitation
2015-08-03 21:00:00 UTC San Francisco 3

CodePudding user response:

What you need is an ASOF join. I don't think there is an easy way to do this with BigQuery. Other databases like Kinetica (and I think Clickhouse) support ASOF functions that can be used to perform 'fuzzy' joins.

The syntax for Kinetica would be something like the following.

SELECT * 
FROM a
LEFT JOIN b 
ON ASOF(a.timestamp, b.timestamp, INTERVAL '0' MINUTES, INTERVAL '60' MINUTES, MIN)

The ASOF function above sets up an interval of 60 minutes within which to look for matches on the right side table. When there are multiple matches, it selects the one that is closest (MAX would pick the one that is farthest away).

CodePudding user response:

As per my understanding and based on the data you provided I think the below query should work for your use case.

create temporary table a as(
select TIMESTAMP('2015-08-03 21:00:00 UTC') as ts,   3  as precipitation union all
select TIMESTAMP('2015-08-03 22:00:00 UTC'),   3 union all
select TIMESTAMP('2015-08-04 3:00:00 UTC'),   4 union all
select TIMESTAMP('2016-02-04 18:00:00 UTC'),   4
);
create temporary table b as(
select TIMESTAMP('2015-08-03 21:23:00 UTC') as ts,'San Francisco ' as loc union all
select TIMESTAMP('2016-02-04 14:04:00 UTC') as ts,'New York  ' as loc
);
 
select b_ts,a_ts,loc,precipitation,diff_time_sec
from(
 select b.ts b_ts,a.ts a_ts,
 ABS(TIMESTAMP_DIFF(b.ts,a.ts, SECOND)) as diff_time_sec,
 *
 from b
 inner join a on b.ts between date_sub(a.ts, interval 60 MINUTE) and  date_add(a.ts, interval 60 MINUTE)
)
qualify  RANK() OVER(partition by b_ts ORDER BY diff_time_sec) = 1
  • Related