I have two tables:
table 1
------------- -------------- -----------------
| id_product | id_customer |start_date |
------------- -------------- -----------------
| 1 | 1 | 2021-08-28T10:37|
| 1 | 2 | 2021-08-28T11:17|
| 1 | 3 | 2021-08-28T12:27|
| 2 | 1 | 2021-08-28T17:00|
table 2
------------- ------------------ ---------- -------------------------------
| id_customer | stop_date | duration | 20 other columns like duration|
------------- ------------------ ---------- -------------------------------
| 1 | 2021-08-27T17:00| 20 | ...
| 1 | 2021-08-26T17:00| 40 | ...
| 2 | 2021-08-29T17:00| 120 | ...
| 1 | 2021-08-30T17:00| 40 | ...
| ..........................................|
start_date
in table 1 is the date the customer started the product.
stop_date
in table 2 is the date the customer stopped the product.
I want to join these two tables to have something like : one row with :
- productid
- customer_id
- start_date
- sum of all duration for all the stop_date BEFORE start_date.
- same as duration for all the 20 reminding columns.
example for product_id = 1, custom_id = 1 :
------------- -------------- ----------------- --------------- -----------------------------------
| id_product | id_customer |start_date | sum(duration) | sum(all other columns from table 2)
------------- -------------- ----------------- --------------- -----------------------------------
| 1 | 1 | 2021-08-28T10:37| 60
I have a really big tables, I am using pyspark with SQL. Do you know an optimised way to this ?
Thank you
EDIT :
There is also an id_product in table2
CodePudding user response:
SELECT
Table_1.id_product,
Table_1.id_customer,
Table_1.start_date,
SUM(duration) AS [sum(duration)]
---,SUM(duration2)
---,SUM(duration3)
FROM Table_1
LEFT JOIN Table_2 ON
Table_2.id_customer = Table_1.id_customer
AND Table_2.id_product = Table_1.id_product
AND Table_2.stop_date < Table_1.start_date
GROUP BY Table_1.id_product,Table_1.id_customer, Table_1.start_date