remove sequences of duplicates longer than a specified length in pandas-CodePudding

say I have a pandas series which contains some sequences of duplicates, e.g. like this:

pd.Series([
1,2,3,0,0,4,5,6,0,0,0,0,7,1,1,1,8,9,0,10
])

My goal is to find an easy method for removing / replacing with nan all duplicate sequences that are longer than a chosen length L. For instance if L=3, the series would look like this:

pd.Series([
1,2,3,0,0,4,5,6,nan,nan,nan,nan,7,nan,nan,nan,8,9,0,10
])

(or with the nan's removed).

Also, altering the code to keep always the first element of the duplicates would be interesting:

pd.Series([
1,2,3,0,0,4,5,6,0,nan,nan,nan,7,1,nan,nan,8,9,0,10
])

Next challenge would be to have this code work efficiently for a DataFrame along axis=0.

Thanks!

Best, JZ

CodePudding user response：

Create consecutive groups by compared shifted values by cumualtive sum to helper g Series and then count values by Series.map with Series.value_counts and compare by Series.ge for greater or equal, add missing values by mask by Series.mask:

g = s.ne(s.shift()).cumsum()
s = s.mask(g.map(g.value_counts()).ge(3))
print (s)

0      1.0
1      2.0
2      3.0
3      0.0
4      0.0
5      4.0
6      5.0
7      6.0
8      NaN
9      NaN
10     NaN
11     NaN
12     7.0
13     NaN
14     NaN
15     NaN
16     8.0
17     9.0
18     0.0
19    10.0
dtype: float64

Alternative solution with first duplicates is chain Series.duplicated by & for bitwise AND:

s = s.mask(g.map(g.value_counts()).ge(3) & g.duplicated())
print (s)
0      1.0
1      2.0
2      3.0
3      0.0
4      0.0
5      4.0
6      5.0
7      6.0
8      0.0
9      NaN
10     NaN
11     NaN
12     7.0
13     1.0
14     NaN
15     NaN
16     8.0
17     9.0
18     0.0
19    10.0
dtype: float64

Solution for multiple columns is changed:

cols = ['a','b','c']
f = lambda x: x.map(x.value_counts())
df1 = df.mask(df[cols].ne(df[cols].shift()).cumsum().apply(f).ge(3))
print (df1)
       a     b     c
0    1.0   1.0   1.0
1    2.0   2.0   2.0
2    3.0   3.0   3.0
3    0.0   0.0   0.0
4    0.0   0.0   0.0
5    4.0   4.0   4.0
6    5.0   5.0   5.0
7    6.0   6.0   6.0
8    NaN   NaN   NaN
9    NaN   NaN   NaN
10   NaN   NaN   NaN
11   NaN   NaN   NaN
12   7.0   7.0   7.0
13   NaN   NaN   NaN
14   NaN   NaN   NaN
15   NaN   NaN   NaN
16   8.0   8.0   8.0
17   9.0   9.0   9.0
18   0.0   0.0   0.0
19  10.0  10.0  10.0

For second solution:

cols = ['a','b','c']
f = lambda x: x.map(x.value_counts()).ge(3) & x.duplicated()
df1 = df.mask(df[cols].ne(df[cols].shift()).cumsum().apply(f))