pandas subtract rows in dataframe according to a few columns-CodePudding

I have the following dataframe

data = [
 {'col1': 11, 'col2': 111, 'col3': 1111},
 {'col1': 22, 'col2': 222, 'col3': 2222},
 {'col1': 33, 'col2': 333, 'col3': 3333},
 {'col1': 44, 'col2': 444, 'col3': 4444}
]

and the following list:

lst = [(11, 111), (22, 222), (99, 999)]

I would like to get out of my data only rows that col1 and col2 do not exist in the lst

result for above example would be:

[
 {'col1': 33, 'col2': 333, 'col3': 3333},
 {'col1': 44, 'col2': 444, 'col3': 4444}
]

how can I achieve that?

import pandas as pd

df = pd.DataFrame(data)

list_df = pd.DataFrame(lst)

# command like ??
# df.subtract(list_df)

CodePudding user response：

You can create a tuple out of your col1 and col2 columns and then check if those tuples are in the lst list. Then drop the fines with True values.

df.drop(df.apply(lambda x: (x['col1'], x['col2']), axis =1)
          .isin(lst)
          .loc[lambda x: x==True]
          .index)

With this solution you don't even have to make the second list a dataframe

CodePudding user response：

If need test by pairs is possible compare MultiIndex created by both columns in Index.isin with inverted mask by ~ in boolean indexing:

df = df[~df.set_index(['col1','col2']).index.isin(lst)]
print (df)
   col1  col2  col3
2    33   333  3333
3    44   444  4444

Or with left join by merge with indicator parameter:

m = df.merge(list_df, 
             left_on=['col1','col2'],
             right_on=[0,1], 
             indicator=True, 
             how='left')['_merge'].eq('left_only')
df = df[mask]
print (df)
   col1  col2  col3
2    33   333  3333
3    44   444  4444

CodePudding user response：

You can create the tuples of col1 and col2 by .apply() with tuple. Then test these tuples whether in lst by .isin() (add ~ for the negation/opposite condition).

Finally, locate the rows with .loc, as follows:

df.loc[~df[['col1', 'col2']].apply(tuple, axis=1).isin(lst)]

Result:

   col1  col2  col3
2    33   333  3333
3    44   444  4444

CodePudding user response：

You can extract the list of values using zip and slice using a mask generated with isna:

a,b = zip(*lst)
data[~(data['col1'].isin(a)|data['col2'].isin(b))]

output:

   col1  col2  col3
2    33   333  3333
3    44   444  4444

Or if you need both conditions to be true to drop:

data[~(data['col1'].isin(a)&data['col2'].isin(b))]

NB. if you have many columns, you can automate the process:

mask = sum(data[col].isin(v) for col,v in zip(data, zip(*lst))).eq(0)
df[mask]