say I have a pandas series which contains some sequences of duplicates, e.g. like this:
pd.Series([
1,2,3,0,0,4,5,6,0,0,0,0,7,1,1,1,8,9,0,10
])
My goal is to find an easy method for removing / replacing with nan all duplicate sequences that are longer than a chosen length L. For instance if L=3, the series would look like this:
pd.Series([
1,2,3,0,0,4,5,6,nan,nan,nan,nan,7,nan,nan,nan,8,9,0,10
])
(or with the nan's removed).
Also, altering the code to keep always the first element of the duplicates would be interesting:
pd.Series([
1,2,3,0,0,4,5,6,0,nan,nan,nan,7,1,nan,nan,8,9,0,10
])
Next challenge would be to have this code work efficiently for a DataFrame along axis=0.
Thanks!
Best, JZ
CodePudding user response:
Create consecutive groups by compared shifted values by cumualtive sum to helper g
Series and then count values by Series.map
with Series.value_counts
and compare by Series.ge
for greater or equal, add missing values by mask by Series.mask
:
g = s.ne(s.shift()).cumsum()
s = s.mask(g.map(g.value_counts()).ge(3))
print (s)
0 1.0
1 2.0
2 3.0
3 0.0
4 0.0
5 4.0
6 5.0
7 6.0
8 NaN
9 NaN
10 NaN
11 NaN
12 7.0
13 NaN
14 NaN
15 NaN
16 8.0
17 9.0
18 0.0
19 10.0
dtype: float64
Alternative solution with first duplicates is chain Series.duplicated
by &
for bitwise AND
:
s = s.mask(g.map(g.value_counts()).ge(3) & g.duplicated())
print (s)
0 1.0
1 2.0
2 3.0
3 0.0
4 0.0
5 4.0
6 5.0
7 6.0
8 0.0
9 NaN
10 NaN
11 NaN
12 7.0
13 1.0
14 NaN
15 NaN
16 8.0
17 9.0
18 0.0
19 10.0
dtype: float64
Solution for multiple columns is changed:
cols = ['a','b','c']
f = lambda x: x.map(x.value_counts())
df1 = df.mask(df[cols].ne(df[cols].shift()).cumsum().apply(f).ge(3))
print (df1)
a b c
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
5 4.0 4.0 4.0
6 5.0 5.0 5.0
7 6.0 6.0 6.0
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 7.0 7.0 7.0
13 NaN NaN NaN
14 NaN NaN NaN
15 NaN NaN NaN
16 8.0 8.0 8.0
17 9.0 9.0 9.0
18 0.0 0.0 0.0
19 10.0 10.0 10.0
For second solution:
cols = ['a','b','c']
f = lambda x: x.map(x.value_counts()).ge(3) & x.duplicated()
df1 = df.mask(df[cols].ne(df[cols].shift()).cumsum().apply(f))