Home > Software engineering >  Join using ID and date, and select only relevant rows that fall within a date range using Impala
Join using ID and date, and select only relevant rows that fall within a date range using Impala

Time:10-04

I have two tables in HDFS and I want to join them using Impala. One table is called 'cust' and the other is called 'items'. The code to create these tables is shared below, at the end.

cust:

 id     date_start
212175  20/05/2022
286170  18/05/2022
361739  10/08/2021
297438  20/01/2022
415712  30/07/2021

Items:

 id      item    date_sent
212175  PX002   21/05/2022
212175  PY005   10/05/2022
212175  PX003   01/06/2022
212175  PX003   01/07/2021
212175  NX002   15/08/2022
286170  PX002   19/05/2022
286170  PY005   10/07/2022
286170  PX003   15/07/2022
286170  PX003   10/04/2022
286170  NX002   15/06/2022

Desired Output:

id       item      date
212175  PX002   21/05/2022
212175  PX003   01/06/2022
212175  PX005   10/05/2022
286170  PX002   19/05/2022
286170  PY005   10/07/2022
286170  PX003   15/07/2022
286170  NX002   15/06/2022

So I need to find all the items that were sent to all customers between two dates. The start date should be taken from the customers table (each customer have a different start date) and it's called 'date_start', whereas the the end date is 12/08/2022 for everyone.

In the output table we can see that for ID = 212175, only 3 items are showing. That's because only 3 items were sent during the window 20/05/2022 - 12/08/2022.

For ID = 286170, only 1 item 'PX003' isn't showing in the desired output and that's because the date this item was sent was on '10/04/2022', which is before the date_start (18/05/2022) for customer 286170.

The query I have written so far is:

select c.id, type, c.date_start ,i.date_sent from cust c
left join items   i
on c.id = i.id
where i.date_sent <= '12/08/2022'

This query takes into account the end date but not the start dates. How can I change this so my results are same as in the desired output? Any help appreciated, thanks.

Code to create the tables:

CREATE EXTERNAL TABLE cust (
id BIGINT, date_start STRING )   

INSERT INTO cust (id, date_start)
VALUES 
( (212175,'20/05/2022'),
  (286170,'18/05/2022'),
  (361739,'10/08/2021'),
  (297438,'20/01/2022'),
  (415712,'30/07/2021')

);

-- Create the items table 
CREATE EXTERNAL TABLE items (
id BIGINT, type STRING, date_sent STRING )   


-- Types table 
INSERT INTO items (id, type, date_sent)
VALUES 
( (212175,'PX002', '21/05/2022'),
(212175,'PY005', '10/05/2022'),
(212175,'PX003', '01/06/2022'),
(212175,'PX003', '01/07/2021'),
(212175,'NX002', '15/08/2022'),
(286170,'PX002', '19/05/2022'),
(286170,'PY005', '10/07/2022'),
(286170,'PX003', '15/07/2022'),
(286170,'PX003', '10/04/2022'),
(286170,'NX002', '15/06/2022')

);

CodePudding user response:

You're on right track, you just have to convert strings to date.

Pls note, you are storing dates in string format and you need to convert to timestamp using to_timestamp(str,'dd/MM/yyyy') then compare. So, your sql should be

select c.id, type, c.date_start ,i.date_sent from cust c
left join items   i
on c.id = i.id
where to_timestamp(i.date_sent,'dd/MM/yyyy') <= to_timestamp('12/08/2022','dd/MM/yyyy')

If you store data in yyyy-MM-dd format(default), impala would catch that as date and your SQL would have worked.

CodePudding user response:

Below is the solution that got me the results I wanted.

select merged.* from  
(select c.id, c.date_start, i.type, i.date_sent, to_timestamp(c.date_start, 'dd/MM/yyyy') as time_start, to_timestamp(i.date_sent, 'dd/MM/yyyy') as time_sent from cust c
inner join items   i
on c.id = i.id) merged 

where merged.time_sent <= '2022-08-14 00:00:00'
and merged.time_start <= merged.time_sent 
  • Related