So I have a specific problem that needs to be solved. I need to DELETE elements present in one pandas series (ser1) that are common to another pandas series (ser2).
I have tried a bunch of things that do not work and the closest thing I was able to find was with arrays using np.intersect1d() function. This works to find common values, but when I try to drop indexes that are equal to these values, i get a bunch of mistakes.
I've tried a bunch of other things that did not really work and have been at it for 3 hours now so about to give up.
here are the two series:
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])
The result should be:
print(ser1)
0 1
1 2
2 3
I am sure there is a simple solution. Could anyone help me with this I am desperate and please don't downvote my question if you know a similar question has been asked before bacause I have looked for 3 hours and haven't found shit so I would be pretty pissed, thank you for understanding.
CodePudding user response:
Use .isin
:
>>> ser1[~ser1.isin(ser2)]
0 1
1 2
2 3
dtype: int64
The numpy version is .setdiff1d
(and not .intersect1d
)
>>> np.setdiff1d(ser1, ser2)
array([1, 2, 3])
CodePudding user response:
A numpy alternative, np.isin
import pandas as pd
import numpy as np
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])
res = ser1[~np.isin(ser1, ser2)]
print(res)
Micro-Benchmark
import pandas as pd
import numpy as np
ser1 = pd.Series([1, 2, 3, 4, 5] * 100)
ser2 = pd.Series([4, 5, 6, 7, 8] * 10)
%timeit res = ser1[~np.isin(ser1, ser2)]
136 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit res = ser1[~ser1.isin(ser2)]
209 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Index(ser1).difference(ser2).to_series()
277 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
CodePudding user response:
You can use set notation - I am not sure of the speed though, compared to isin:
pd.Index(ser1).difference(ser2).to_series()
Out[35]:
1 1
2 2
3 3
dtype: int64