I have adf
that looks like this:
date | stock1 | stock2 | stock3 | stock4 | stock5 | stock6 | stock7 | stock8 | stock9 | stock10 |
---|---|---|---|---|---|---|---|---|---|---|
10/20 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 0.9 |
11/20 | 0.1 | 0.9 | 0.3 | 0.4 | 0.3 | 0.5 | 0.3 | 0.2 | 0.4 | 0.1 |
12/20 | 0.1 | 0.6 | 0.9 | 0.5 | 0.6 | 0.7 | 0.8 | 0.7 | 0.9 | 0.1 |
10/20 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 0.9 |
11/20 | 0.8 | 0.9 | 0.3 | 0.4 | 0.3 | 0.5 | 0.3 | 0.2 | 0.9 | 0.1 |
12/20 | 0.3 | 0.6 | 0.9 | 0.5 | 0.6 | 0.7 | 0.8 | 0.7 | 0.9 | 0.1 |
10/20 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.7 | 0.7 | 0.8 | 0.9 | 0.9 |
11/20 | 0.8 | 0.9 | 0.3 | 0.4 | 0.3 | 0.7 | 0.3 | 0.2 | 0.4 | 0.1 |
12/20 | 0.3 | 0.6 | 0.9 | 0.5 | 0.6 | 0.7 | 0.8 | 0.7 | 0.9 | 0.1 |
I want to delete all columns for which the same value repeats, consecutively, more than 3 times. In this example, the columns "stock1", "stock6" and "stock9" should be deleted. In the other columns, we have repeating values more than 3 times, but not one after the other. I think I can adapt the code from that question Removing values that repeat more than 5 times in Pandas DataFrame, but I could not manage to do that yet.
CodePudding user response:
You can set "date" aside as index, then check if the rows are different from the next one as use it to groupby
cumcount
.
Then compute the max count per column, if greater than N-1, drop the column:
df2 = df.set_index('date')
N = 3
df2.loc[:, df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount()).max().lt(N-1)]
output:
stock2 stock3 stock4 stock5 stock7 stock8 stock10
date
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
10/20 0.2 0.3 0.4 0.5 0.7 0.8 0.9
11/20 0.9 0.3 0.4 0.3 0.3 0.2 0.1
12/20 0.6 0.9 0.5 0.6 0.8 0.7 0.1
intermediate count of successive values:
>>> df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount())
stock1 stock2 stock3 stock4 stock5 stock6 stock7 stock8 stock9 stock10
date
10/20 0 0 0 0 0 0 0 0 0 0
11/20 1 0 1 1 0 0 0 0 0 0
12/20 2 0 0 0 0 0 0 0 0 1
10/20 3 0 0 0 0 0 0 0 1 0
11/20 0 0 1 1 0 0 0 0 2 0
12/20 0 0 0 0 0 0 0 0 3 1
10/20 0 0 0 0 0 1 0 0 4 0
11/20 0 0 1 1 0 2 0 0 0 0
12/20 0 0 0 0 0 3 0 0 0 1
CodePudding user response:
You could want avoid apply here:
N = 3
df.loc[:,
df.ne(df.shift()).cumsum()
.stack()
.groupby(level=1)
.value_counts()
.max(level=0).le(N)]