There are two dataframes. The first one is named products_purchased
, the second one is named products_suggested
.
In the products_purchased
alias(pp) table there is a customer_id
(unique) column, an item_id
column, and a purchased
column (with the value of 1).
The products_suggested
alias (ps) table has a customer_id
(non-unique) and an item_id
column. This table has more customer_id's than the product_product purchased table, as not all customers who are suggested items, purchase them.
I would like to join the two tables, retaining the purchased
column for the places where ps.customer_id
(non-unique) and an ps.item_id
match pp.customer_id
(unique) column, an pp.item_id
. I would also like to keep any records where pp.customer_id
(unique) match ps.customer_id
(non-unique).
The idea is to have a table where only the records relate to the customers who went on to purchase an item. That item would be labeled with a 1 in the purchased column, their other suggested items would be labeled 0.
Product Suggested Table
customer_id|item_id |
---------- ---------------
| 16413| 51654|
| 16413| 75950|
| 16413| 1366117|
| 78450| 56107|
| 94038| 72358|
| 94038| 1451889|
| 113067| 75077|
| 89578| 53279|
Product Purchased Table
customer_id|item_id |purchased
----------- -------------- ---------
| 16413| 75950| 1|
| 78450| 56107| 1|
| 94038| 72358| 1|
Final Table
customer_id|item_id |purchased
----------- -------------- ---------
| 16413| 51654| 0|
| 16413| 75950| 1|
| 16413| 1366117| 0|
| 78450| 56107| 1|
| 94038| 72358| 1|
| 94038| 1451889| 0|
I tried a left join on customer_id and item_suggested. I got what I expected, a table with all the suggested items regardless of their customer purchased, then the purchased status attached:
final = products_suggested.join(
products_purchased,on =["customer_id",'item_id'], how= 'left')
Final Table
customer_id|item_id |purchased
----------- -------------- ---------
| 16413| 51654| 0|
| 16413| 75950| 1|
| 16413| 1366117| 0|
| 78450| 56107| 1|
| 94038| 72358| 1|
| 94038| 1451889| 0|
| 113067| 75077| 0|
| 89578| 53279| 0|
I tried an inner join as well on just the customer_id. That made it so all my purchased columns were 1. I'm guessing thats because anywhere a customer_id matched the purchased version, it just placed the 1.
I also tried filtering after the left join .where(pp['customer_id']==ps['customer_id])
, but that didn't seem to work either.
CodePudding user response:
I created another dataframe that only had the customer_id from the product purchased. Then I joined the merged left join table I tried above, with an inner join. This filtered out the remaining non purchasing customers.
purchase_only_customers = left_join_table.join(purchase_table, on =["customer_id"], how= 'inner')