I have a dataframe df
like this
| Time | variable one |
| -----------| -------------|
| 2022-11-09 | 0 |
| 2022-11-10 | 0 |
| 2022-11-11 | 2 |
| 2022-11-12 | 7 |
| 2022-11-13 | 0 |
| 2022-11-14 | 5 |
| 2022-11-15 | 3 |
| 2022-11-16 | 0 |
| 2022-11-17 | 0 |
I need to remove all the zeros before the first non zero element and to remove all the zeros after the last non zero element. Zeros in between non zero elements should remain zero.
I solved with two while loops:
i=0
while df.loc[i,'variable one']==0:
df.loc[i,'variable one'] = np.nan
i=i 1
i=len(df['variable one'])-1
while df.loc[i,'variable one']==0:
df.loc[i,'variable one'] = np.nan
i=i-1
This code works, but when dealing with hundreds of columns and many thousands rows it becomes very slow. I am looking for an optimization, even removing while loops.
CodePudding user response:
You can use a boolean mask for boolean indexing by combining cummax
in a forward and reverse order:
m = df['variable one'].ne(0)
df.loc[~(m.cummax()&m[::-1].cummax()), 'variable one'] = np.nan
# or
# df['variable one'] = df['variable one'].where(m.cummax()&m[::-1].cummax())
Equivalent with cummin
:
m = df['variable one'].eq(0)
df.loc[(m.cummin()|m[::-1].cummin()), 'variable one'] = np.nan
Output:
Time variable one
0 2022-11-09 NaN
1 2022-11-10 NaN
2 2022-11-11 2.0
3 2022-11-12 7.0
4 2022-11-13 0.0
5 2022-11-14 5.0
6 2022-11-15 3.0
7 2022-11-16 NaN
8 2022-11-16 NaN
Intermediates:
Time variable one m cummax rev_cummax & ~
0 2022-11-09 0 False False True False True
1 2022-11-10 0 False False True False True
2 2022-11-11 2 True True True True False
3 2022-11-12 7 True True True True False
4 2022-11-13 0 False True True True False
5 2022-11-14 5 True True True True False
6 2022-11-15 3 True True True True False
7 2022-11-16 0 False True False False True
8 2022-11-16 0 False True False False True