Home > database >  Delete all columns for which value repents consecutively more than 3 times
Delete all columns for which value repents consecutively more than 3 times

Time:02-21

I have adf that looks like this:

date stock1 stock2 stock3 stock4 stock5 stock6 stock7 stock8 stock9 stock10
10/20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.9
11/20 0.1 0.9 0.3 0.4 0.3 0.5 0.3 0.2 0.4 0.1
12/20 0.1 0.6 0.9 0.5 0.6 0.7 0.8 0.7 0.9 0.1
10/20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.9
11/20 0.8 0.9 0.3 0.4 0.3 0.5 0.3 0.2 0.9 0.1
12/20 0.3 0.6 0.9 0.5 0.6 0.7 0.8 0.7 0.9 0.1
10/20 0.1 0.2 0.3 0.4 0.5 0.7 0.7 0.8 0.9 0.9
11/20 0.8 0.9 0.3 0.4 0.3 0.7 0.3 0.2 0.4 0.1
12/20 0.3 0.6 0.9 0.5 0.6 0.7 0.8 0.7 0.9 0.1

I want to delete all columns for which the same value repeats, consecutively, more than 3 times. In this example, the columns "stock1", "stock6" and "stock9" should be deleted. In the other columns, we have repeating values more than 3 times, but not one after the other. I think I can adapt the code from that question Removing values that repeat more than 5 times in Pandas DataFrame, but I could not manage to do that yet.

CodePudding user response:

You can set "date" aside as index, then check if the rows are different from the next one as use it to groupby cumcount.

Then compute the max count per column, if greater than N-1, drop the column:

df2 = df.set_index('date')
N = 3
df2.loc[:, df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount()).max().lt(N-1)]

output:

        stock2   stock3   stock4   stock5   stock7   stock8   stock10
date                                                                 
10/20       0.2      0.3      0.4      0.5      0.7      0.8      0.9
11/20       0.9      0.3      0.4      0.3      0.3      0.2      0.1
12/20       0.6      0.9      0.5      0.6      0.8      0.7      0.1
10/20       0.2      0.3      0.4      0.5      0.7      0.8      0.9
11/20       0.9      0.3      0.4      0.3      0.3      0.2      0.1
12/20       0.6      0.9      0.5      0.6      0.8      0.7      0.1
10/20       0.2      0.3      0.4      0.5      0.7      0.8      0.9
11/20       0.9      0.3      0.4      0.3      0.3      0.2      0.1
12/20       0.6      0.9      0.5      0.6      0.8      0.7      0.1

intermediate count of successive values:

>>> df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount())

        stock1   stock2   stock3   stock4   stock5   stock6   stock7   stock8   stock9   stock10
date                                                                                            
10/20         0        0        0        0        0        0        0        0        0        0
11/20         1        0        1        1        0        0        0        0        0        0
12/20         2        0        0        0        0        0        0        0        0        1
10/20         3        0        0        0        0        0        0        0        1        0
11/20         0        0        1        1        0        0        0        0        2        0
12/20         0        0        0        0        0        0        0        0        3        1
10/20         0        0        0        0        0        1        0        0        4        0
11/20         0        0        1        1        0        2        0        0        0        0
12/20         0        0        0        0        0        3        0        0        0        1

CodePudding user response:

You could want avoid apply here:

N = 3
df.loc[:, 
   df.ne(df.shift()).cumsum()
     .stack()
     .groupby(level=1)
     .value_counts()
     .max(level=0).le(N)]
  • Related