I need some help with this problem.
Assuming I have following table:
contract_id | tariff_id | product_category | date (DD.MM.YYYY) | month (YYYYMM) |
---|---|---|---|---|
123456 | ABC | small | 01.01.2021 | 202101 |
123456 | ABC | medium | 01.02.2021 | 202102 |
123456 | DEF | small | 01.03.2021 | 202103 |
123456 | DEF | small | 01.04.2021 | 202104 |
123456 | ABC | big | 01.05.2021 | 202105 |
123456 | DEF | small | 01.06.2021 | 202106 |
123456 | DEF | medium | 02.06.2021 | 202106 |
123456 | DEF | medium | 01.07.2021 | 202107 |
The table is partitioned by month
.
This is a part of my table containing multiple contract_id
s.
I'm trying to figure out for every contract_id
, since when it has its most recent tariff_id
and since when it has the product_category_id='small'
(if it doesn't have small as product category, the value should then be Null
).
The results will be written into a table which gets updated every month.
So for the table above my latest results should look like this:
contract_id | same_tariff_id_since | product_category_small_since |
---|---|---|
123456 | 01.06.2021 | NULL |
I'm using Hive.
So far, I could only come up with this solution for same_tariff_id_since
:
The problem is that it gives me absolute min(date)
for the tariff_id
and not the min(date)
since the most recent tariff_id
.
I think the code for product_category_small_since
will have mostly the same logic.
My current code is:
SELECT q2.contract_id
, q3.tariff_id
, q2.date
FROM (
SELECT contract_id
, max(date_2) AS date
FROM (
SELECT contract_id
, date
, min(date) OVER (PARTITION BY tariff_id ORDER BY date) AS date_2
FROM given_table
)q1
WHERE date=date_2
GROUP BY contract_id
)q2
JOIN given_table AS q3
ON q2.contract_id=q3.contract_id
AND q2.date=q3.date
Thanks in advance.
CodePudding user response:
One approach for solving this type of query is to do a grouping of the sequences you want to track. For the tariff_id sequence grouping, you want a new "sequence grouping id" for each time that the tariff id changes for a given contract id. Since the product_category can change independently, you need to do a sequence grouping id for that change as well.
Here's code to accomplish the task. This only returns the latest version of each contract and the specific columns you described in your latest results table. This was done against PostgreSQL 9.6, but the syntax and data types can probably be modified to be compatible with Hive.
https://www.db-fiddle.com/f/qSk3Mb9Xfp1NDo5VeA1qHh/8
select q2.contract_id
, to_char(min(q2."date (DD.MM.YYYY)")
over (partition by q2.contract_id, q2.contract_tariff_sequence_id), 'DD.MM.YYYY') as same_tariff_id_since
, to_char(min(case when q2.product_category = 'small' then q2."date (DD.MM.YYYY)" else null end)
over (partition by q2.contract_id, q2.contract_product_category_sequence_id), 'DD.MM.YYYY') as product_category_small_since
from(
select q1.*
, sum(case when q1.tariff_id = q1.prior_tariff_id then 0 else 1 end)
over (partition by q1.contract_id order by q1."date (DD.MM.YYYY)" rows unbounded preceding) as contract_tariff_sequence_id
, sum(case when q1.product_category = q1.prior_product_category then 0 else 1 end)
over (partition by q1.contract_id order by q1."date (DD.MM.YYYY)" rows unbounded preceding) as contract_product_category_sequence_id
from (
select *
, lag(tariff_id) over (partition by contract_id order by "date (DD.MM.YYYY)") as prior_tariff_id
, lag(product_category) over (partition by contract_id order by "date (DD.MM.YYYY)") as prior_product_category
, row_number() over (partition by contract_id order by "date (DD.MM.YYYY)" desc) latest_record_per_contract
from contract_tariffs
) q1
) q2
where latest_record_per_contract = 1
If you want to see all the rows and columns so you can examine how this works with the sequence grouping ids etc., you can modify the outer query slightly:
https://www.db-fiddle.com/f/qSk3Mb9Xfp1NDo5VeA1qHh/10
If this works for you, please mark as correct answer.