Home > Blockchain >  How to delete values from one pandas series that are common to another?
How to delete values from one pandas series that are common to another?

Time:10-30

So I have a specific problem that needs to be solved. I need to DELETE elements present in one pandas series (ser1) that are common to another pandas series (ser2).

I have tried a bunch of things that do not work and the closest thing I was able to find was with arrays using np.intersect1d() function. This works to find common values, but when I try to drop indexes that are equal to these values, i get a bunch of mistakes.

I've tried a bunch of other things that did not really work and have been at it for 3 hours now so about to give up.

here are the two series:

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

The result should be:

print(ser1)
0   1
1   2
2   3

I am sure there is a simple solution. Could anyone help me with this I am desperate and please don't downvote my question if you know a similar question has been asked before bacause I have looked for 3 hours and haven't found shit so I would be pretty pissed, thank you for understanding.

CodePudding user response:

Use .isin:

>>> ser1[~ser1.isin(ser2)]
0    1
1    2
2    3
dtype: int64

The numpy version is .setdiff1d (and not .intersect1d)

>>> np.setdiff1d(ser1, ser2)
array([1, 2, 3])

CodePudding user response:

A numpy alternative, np.isin

import pandas as pd
import numpy as np

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

res = ser1[~np.isin(ser1, ser2)]
print(res)

Micro-Benchmark

import pandas as pd
import numpy as np
ser1 = pd.Series([1, 2, 3, 4, 5] * 100)
ser2 = pd.Series([4, 5, 6, 7, 8] * 10)
%timeit res = ser1[~np.isin(ser1, ser2)]
136 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit res = ser1[~ser1.isin(ser2)]
209 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Index(ser1).difference(ser2).to_series()
277 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

CodePudding user response:

You can use set notation - I am not sure of the speed though, compared to isin:

pd.Index(ser1).difference(ser2).to_series()
Out[35]: 
1    1
2    2
3    3
dtype: int64
  • Related