Problem Explanation
I have a dataframe with two columns 'A'
and 'B'
. I also have a list of tuples where the first element of the tuple is an element in the column 'A'
, and the second is in the column 'B'
. I would like to remove all rows of the dataframe coinciding with the tuples.
Of course, I could just use a loop, but I want a smarter solution that would be faster and cleaner.
Minimal Working Example
import pandas as pd
df = pd.DataFrame(
{
'A': ['a', 'b', 'c', 'd', 'a', 'd', 'a', 'c'],
'B': [4, 2, 2, 1, 3, 4, 3, 2],
}
)
rows_to_remove = [('a', 4), ('c', 2), ('d', 4), ('a', 3)]
CodePudding user response:
I would use boolean indexing with pandas.Series.isin
:
m = df.agg(tuple, axis=1).isin(rows_to_remove)
df = df.loc[~m]
Output :
print(df)
A B
1 b 2
3 d 1
CodePudding user response:
You can use a merge
with indicator:
out = (df.merge(pd.DataFrame(rows_to_remove, columns=['A', 'B']), indicator=True, how='left')
.query('_merge == "left_only"')
#.drop(columns='_merge') # commented to see the logic
)
Output:
A B _merge
1 b 2 left_only
3 d 1 left_only
Or, combined with drop
:
idx = (df.merge(pd.DataFrame(rows_to_remove, columns=['A', 'B']), how='left', indicator=True)
.query('_merge == "both"').index
)
out = df.drop(idx)
Output:
A B
1 b 2
3 d 1
CodePudding user response:
If you import numpy as np
you can:
df[~(df.values==np.array(rows_to_remove, dtype=object)[:,None]).any(0).all(-1)]
It is entertaining but also useful because I tested on google-colab and the performance is 274 µs ± 11.6 µs
vs 1.44 ms ± 37.7µs
from @Timeless 's solution, that I think is the best and it's the same one that came to my mind as soon as I read the question.
It would be interesting to see the difference for bigger DataFrames.
CodePudding user response:
You can create a dataframe from rows_to_remove
then append to your original dataframe and remove duplicates:
>>> (pd.concat([df, pd.DataFrame(rows_to_remove, columns=['A', 'B'])])
.drop_duplicates(['A', 'B'], keep=False))
A B
1 b 2
3 d 1
CodePudding user response:
Hope this helps. you can use 'df.drop' to drop the rows/columns by index. please have a look at the below mentioned:
DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')[https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html]