Home > Back-end >  Pandas droping rows based on multiple conditions
Pandas droping rows based on multiple conditions

Time:12-03

I have two data frames

A = pd.DataFrame(
    [["[email protected]","4311","3000","STR_1","1384"],
     ["[email protected]","4311","3000","STR_2","1440"  ],
     ["[email protected]","4311","3000","STR_3","1300"  ],
     ["[email protected]","4311","3000","STR_3","1300"  ]],
    columns=["EMAIL",   "PRODUCT_ID",   "POST_CODE",    "STORE_NAME",   "STORE_ID"],
)

enter image description here

B = pd.DataFrame(
    [["[email protected]","4311","3000","STR_1","1384"],
     ["[email protected]","4311","3000","STR_3","1300"  ],],
    columns=["EMAIL",   "PRODUCT_ID",   "POST_CODE",    "STORE_NAME",   "STORE_ID"],
)

enter image description here

Now I need to remove records from dataframe A that have the same EMAIL, PRODUCT_ID, and POST_CODE as data frame B. So the expected output is

enter image description here

I tried using drop duplicates like:

pd.concat([A, B]).drop_duplicates(keep=False)

But this cannot drop rows based on a custom column which is the POST_CODE in this case

CodePudding user response:

use subset to select on the columns you want to filter out

pd.concat([A, B]).drop_duplicates(subset=["EMAIL",   "PRODUCT_ID",   "POST_CODE"], keep=False)

CodePudding user response:

The solution for this contains the following elements:

  1. pandas set_index() function.
  2. pandas isin() function.

First we will set the index in the two dataframes to be "EMAIL", "PRODUCT_ID", "POST_CODE" then we can use these indexing to filter the dataframes using isin.

The code:

import pandas as pd

A = pd.DataFrame(
    [["[email protected]","4311","3000","STR_1","1384"],
     ["[email protected]","4311","3000","STR_2","1440"  ],
     ["[email protected]","4311","3000","STR_3","1300"  ],
     ["[email protected]","4311","3000","STR_3","1300"  ]],
    columns=["EMAIL",   "PRODUCT_ID",   "POST_CODE",    "STORE_NAME",   "STORE_ID"],
)

B = pd.DataFrame(
    [["[email protected]","4311","3000","STR_1","1384"],
     ["[email protected]","4311","3000","STR_3","1300"  ],],
    columns=["EMAIL",   "PRODUCT_ID",   "POST_CODE",    "STORE_NAME",   "STORE_ID"],
)

i1 = A.set_index(["EMAIL", "PRODUCT_ID", "POST_CODE"]).index
i2 = B.set_index(["EMAIL", "PRODUCT_ID", "POST_CODE"]).index
result = A[~i1.isin(i2)]

Output:

     EMAIL         PRODUCT_ID    POST_CODE  STORE_NAME  STORE_ID
3   [email protected]    4311          3000       STR_3       1300
  • Related