Home > database >  Iterate over multiple columns and replace the values in these columns after a row (increment) with n
Iterate over multiple columns and replace the values in these columns after a row (increment) with n

Time:08-26

Given a dataframe df as follows:

          date      value     20211003     20211010     20211017
0    2021-9-19  3613.9663          NaN          NaN          NaN
1    2021-9-26  3613.0673          NaN          NaN          NaN
2    2021-10-3  3568.1668          NaN          NaN          NaN
3   2021-10-10  3592.1666  3510.221000          NaN          NaN
4   2021-10-17  3572.3662  3465.737012  3534.220800          NaN
5   2021-10-24  3582.6036  3479.107035  3539.856801  3514.420400
6   2021-10-31  3547.3361  3421.161235  3481.911001  3456.474600
7    2021-11-7  3491.5677  3370.140147  3439.284539  3416.621024
8   2021-11-14  3539.1002  3319.289523  3391.930037  3370.079953
9   2021-11-21  3560.3734  3261.343723  3333.984237  3312.134153
10  2021-11-28  3564.0894  3255.328902  3338.967086  3305.054247
11   2021-12-5  3607.4320  3313.274702  3396.912886  3363.000047
12  2021-12-12  3666.3479  3371.220502  3450.172564  3412.234440
13  2021-12-19  3632.3638          NaN  3466.930383  3428.683490
14  2021-12-26  3618.0535          NaN          NaN  3370.737690

Let's say the columns after value column (20211003, 20211010 and 20211017) are rolling forecast result of value, instead of 10 values for each column, I'll need to keep 3 values only and get the result as follows:

         date      value     20211003     20211010     20211017
0   2021-9-19  3613.9663          NaN          NaN          NaN
1   2021-9-26  3613.0673          NaN          NaN          NaN
2   2021-10-3  3568.1668          NaN          NaN          NaN
3  2021-10-10  3592.1666  3510.221000          NaN          NaN
4  2021-10-17  3572.3662  3465.737012  3534.220800          NaN
5  2021-10-24  3582.6036  3479.107035  3539.856801  3514.420400
6  2021-10-31  3547.3361          NaN  3481.911001  3456.474600
7   2021-11-7  3491.5677          NaN          NaN  3416.621024

How could I achieve that in Pandas? Thanks.

CodePudding user response:

Assuming there is a single stretch per column, use a cumcount of the non-NA values to determine which ones to keep, then slice and reindex each column:

cols = ['20211003', '20211010', '20211017']
N = 3

cc = df[cols].apply(lambda s: s.groupby((m:=s.notna())).cumcount().where(m))
m1 = ~cc.min(1).gt(N)
masks = ~cc.ge(N)

out = df.apply(lambda s: s[masks[s.name] if s.name in masks else m1]
                         .reset_index(drop=True))

output:

         date      value     20211003     20211010     20211017
0   2021-9-19  3613.9663          NaN          NaN          NaN
1   2021-9-26  3613.0673          NaN          NaN          NaN
2   2021-10-3  3568.1668          NaN          NaN          NaN
3  2021-10-10  3592.1666  3510.221000          NaN          NaN
4  2021-10-17  3572.3662  3465.737012  3534.220800          NaN
5  2021-10-24  3582.6036  3479.107035  3539.856801  3514.420400
6  2021-10-31  3547.3361          NaN  3481.911001  3456.474600
7   2021-11-7  3491.5677          NaN          NaN  3416.621024
8  2021-11-14  3539.1002          NaN          NaN          NaN
  • Related