I have 2 tables.
tab1:
dates. |
---|
2021-09-30 |
2021-10-01 |
2021-10-02 |
2021-10-04 |
2021-11-15 |
buckets:
bucket_dates |
---|
2021-10-01. |
2021-10-02. |
2021-10-03. |
2021-11-03. |
I want to join these two table to get final result like below. (If there is a matching date then it will match and if there is no matching date, then it should read the next highest date from tab1 table. e.g. 2021-10-03 & 2021-11-03 dates).
Result table:
bucket_dates | final_dates |
---|---|
2021-10-01. | 2021-10-01 |
2021-10-02. | 2021-10-02 |
2021-10-03. | 2021-10-04 |
2021-11-03. | 2021-11-15 |
I tried to do this by using join query
select a.bucket_dates,
case when b.dates is null then (select min(c.dates) from tab1 c where c.dates > a.bucket_dates)
else b.dates end as final_dates
from buckets a left join tab1 b
on a.bucket_dates = b.dates;
but this query is giving below error
Correlated column is not allowed in a non-equality predicate
Any suggestion will be really helpful.
CodePudding user response:
If pyspark won't allow >
or >=
in a correlated sub-query, just use the MIN()
in the main query...
SELECT
b.bucket_dates,
MIN(t.dates) AS final_dates
FROM
buckets AS b
LEFT JOIN
tab1 AS t
ON t.dates >= b.bucket_dates
GROUP BY
b.bucket_dates