Home > Back-end >  Python - Remove rows of dataframe contained in list (without using a loop)
Python - Remove rows of dataframe contained in list (without using a loop)

Time:01-25

Problem Explanation

I have a dataframe with two columns 'A' and 'B'. I also have a list of tuples where the first element of the tuple is an element in the column 'A', and the second is in the column 'B'. I would like to remove all rows of the dataframe coinciding with the tuples.

Of course, I could just use a loop, but I want a smarter solution that would be faster and cleaner.

Minimal Working Example

import pandas as pd
df = pd.DataFrame(
    {
        'A': ['a', 'b', 'c', 'd', 'a', 'd', 'a', 'c'],
        'B': [4, 2, 2, 1, 3, 4, 3, 2],
    }
)
rows_to_remove = [('a', 4), ('c', 2), ('d', 4), ('a', 3)]

CodePudding user response:

I would use boolean indexing with pandas.Series.isin :

m = df.agg(tuple, axis=1).isin(rows_to_remove)

df = df.loc[~m]

Output :

print(df)

   A  B
1  b  2
3  d  1

CodePudding user response:

You can use a merge with indicator:

out = (df.merge(pd.DataFrame(rows_to_remove, columns=['A', 'B']), indicator=True, how='left')
         .query('_merge == "left_only"')
         #.drop(columns='_merge') # commented to see the logic
       )

Output:

   A  B     _merge
1  b  2  left_only
3  d  1  left_only

Or, combined with drop:

idx = (df.merge(pd.DataFrame(rows_to_remove, columns=['A', 'B']), how='left', indicator=True)
         .query('_merge == "both"').index
      )

out = df.drop(idx)

Output:

   A  B
1  b  2
3  d  1

CodePudding user response:

If you import numpy as np you can:

df[~(df.values==np.array(rows_to_remove, dtype=object)[:,None]).any(0).all(-1)]

It is entertaining but also useful because I tested on google-colab and the performance is 274 µs ± 11.6 µs vs 1.44 ms ± 37.7µs from @Timeless 's solution, that I think is the best and it's the same one that came to my mind as soon as I read the question.

It would be interesting to see the difference for bigger DataFrames.

CodePudding user response:

You can create a dataframe from rows_to_remove then append to your original dataframe and remove duplicates:

>>> (pd.concat([df, pd.DataFrame(rows_to_remove, columns=['A', 'B'])])
       .drop_duplicates(['A', 'B'], keep=False))

   A  B
1  b  2
3  d  1

CodePudding user response:

Hope this helps. you can use 'df.drop' to drop the rows/columns by index. please have a look at the below mentioned:

DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')[https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html]

  • Related