Using PySpark join on two dataframes. I have one unique_id and one non_unique

There are two dataframes. The first one is named products_purchased, the second one is named products_suggested.

In the products_purchased alias(pp) table there is a customer_id (unique) column, an item_id column, and a purchased column (with the value of 1).

The products_suggested alias (ps) table has a customer_id (non-unique) and an item_id column. This table has more customer_id's than the product_product purchased table, as not all customers who are suggested items, purchase them.

I would like to join the two tables, retaining the purchased column for the places where ps.customer_id (non-unique) and an ps.item_id match pp.customer_id (unique) column, an pp.item_id. I would also like to keep any records where pp.customer_id (unique) match ps.customer_id(non-unique).

The idea is to have a table where only the records relate to the customers who went on to purchase an item. That item would be labeled with a 1 in the purchased column, their other suggested items would be labeled 0.

Product Suggested Table

 customer_id|item_id       |   
 ---------- --------------- 
|     16413|          51654|   
|     16413|          75950|
|     16413|        1366117|
|     78450|          56107|               
|     94038|          72358|               
|     94038|        1451889| 
|    113067|          75077|       
|     89578|          53279|

Product Purchased Table

 customer_id|item_id       |purchased 
 ----------- -------------- --------- 
|      16413|         75950|        1|
|      78450|         56107|        1|
|      94038|         72358|        1|

Final Table

 customer_id|item_id       |purchased 
 ----------- -------------- --------- 
|      16413|         51654|        0|
|      16413|         75950|        1|
|      16413|       1366117|        0|
|      78450|         56107|        1|       
|      94038|         72358|        1|     
|      94038|       1451889|        0|

I tried a left join on customer_id and item_suggested. I got what I expected, a table with all the suggested items regardless of their customer purchased, then the purchased status attached:

final = products_suggested.join(
  products_purchased,on =["customer_id",'item_id'], how= 'left')

Final Table

 customer_id|item_id       |purchased 
 ----------- -------------- --------- 
|      16413|         51654|        0|
|      16413|         75950|        1|
|      16413|       1366117|        0|
|      78450|         56107|        1|       
|      94038|         72358|        1|     
|      94038|       1451889|        0|
|    113067|          75077|        0|
|     89578|          53279|        0|

I tried an inner join as well on just the customer_id. That made it so all my purchased columns were 1. I'm guessing thats because anywhere a customer_id matched the purchased version, it just placed the 1.

I also tried filtering after the left join .where(pp['customer_id']==ps['customer_id]), but that didn't seem to work either.

CodePudding user response：

I created another dataframe that only had the customer_id from the product purchased. Then I joined the merged left join table I tried above, with an inner join. This filtered out the remaining non purchasing customers.

purchase_only_customers = left_join_table.join(purchase_table, on =["customer_id"], how= 'inner')