I have two data frames
A = pd.DataFrame(
[["[email protected]","4311","3000","STR_1","1384"],
["[email protected]","4311","3000","STR_2","1440" ],
["[email protected]","4311","3000","STR_3","1300" ],
["[email protected]","4311","3000","STR_3","1300" ]],
columns=["EMAIL", "PRODUCT_ID", "POST_CODE", "STORE_NAME", "STORE_ID"],
)
B = pd.DataFrame(
[["[email protected]","4311","3000","STR_1","1384"],
["[email protected]","4311","3000","STR_3","1300" ],],
columns=["EMAIL", "PRODUCT_ID", "POST_CODE", "STORE_NAME", "STORE_ID"],
)
Now I need to remove records from dataframe A that have the same EMAIL, PRODUCT_ID, and POST_CODE as data frame B. So the expected output is
I tried using drop duplicates like:
pd.concat([A, B]).drop_duplicates(keep=False)
But this cannot drop rows based on a custom column which is the POST_CODE in this case
CodePudding user response:
use subset
to select on the columns you want to filter out
pd.concat([A, B]).drop_duplicates(subset=["EMAIL", "PRODUCT_ID", "POST_CODE"], keep=False)
CodePudding user response:
The solution for this contains the following elements:
- pandas set_index() function.
- pandas isin() function.
First we will set the index in the two dataframes to be "EMAIL", "PRODUCT_ID", "POST_CODE" then we can use these indexing to filter the dataframes using isin.
The code:
import pandas as pd
A = pd.DataFrame(
[["[email protected]","4311","3000","STR_1","1384"],
["[email protected]","4311","3000","STR_2","1440" ],
["[email protected]","4311","3000","STR_3","1300" ],
["[email protected]","4311","3000","STR_3","1300" ]],
columns=["EMAIL", "PRODUCT_ID", "POST_CODE", "STORE_NAME", "STORE_ID"],
)
B = pd.DataFrame(
[["[email protected]","4311","3000","STR_1","1384"],
["[email protected]","4311","3000","STR_3","1300" ],],
columns=["EMAIL", "PRODUCT_ID", "POST_CODE", "STORE_NAME", "STORE_ID"],
)
i1 = A.set_index(["EMAIL", "PRODUCT_ID", "POST_CODE"]).index
i2 = B.set_index(["EMAIL", "PRODUCT_ID", "POST_CODE"]).index
result = A[~i1.isin(i2)]
Output:
EMAIL PRODUCT_ID POST_CODE STORE_NAME STORE_ID
3 [email protected] 4311 3000 STR_3 1300