Home > Software design >  Can't update specific rows with different value on specific indexes of the dataframe?
Can't update specific rows with different value on specific indexes of the dataframe?

Time:09-22

Trying to calculate difference in time based on user groups and last occurrence of flag == 1:

df = pd.DataFrame({'user': ['x','x','x','x','x','y','y','y','y','y'],
                   'Flag': [0,0,1,0,1,0,1,0,1,0],
                   'time': [10, 34, 40, 43, 44, 12, 20, 46, 51, 71]})

And I am calculating the diff from the last 1's

(df.assign(mask=df['Flag'].eq(1),
           group=lambda d: d.groupby('user')['mask'].cumsum(),
           # diff from last 1
           diff=lambda d: d.groupby(['user', 'group'])['time'].apply(lambda g: g -(g.iloc[0] if g.name[1]>0 else float('nan'))),
           )
 
   # mask 1s with their own diff
   .assign(diff= lambda d: d['diff'].mask(d['mask'],
                                          (d[d['mask'].groupby(d['user']).cumsum().eq(0)|d['mask']].groupby('user')['time'].diff())
                                         )
          )
)

For some reason I don't get NaN values (as I expect) on indexes 2 and 6:

    user    Flag    time    mask    group   diff
0   x       0       10      False   0       NaN
1   x       0       34      False   0       NaN
2   x       1       40      True    1       6.0
3   x       0       43      False   1       3.0
4   x       1       44      True    2       4.0
5   y       0       12      False   0       NaN
6   y       1       20      True    1       8.0
7   y       0       46      False   1       26.0
8   y       1       51      True    2       31.0
9   y       0       71      False   2       20.0

Trying with df.loc[[2,6],'diff'] = 1000000000 But getting:

 user   Flag    time    diff
0   x       0       10      NaN
1   x       0       34      NaN
2   x       1       40      100000000.0
3   x       0       43      NaN
4   x       1       44      NaN
5   y       0       12      NaN
6   y       1       20      100000000.0
7   y       0       46      NaN
8   y       1       51      NaN
9   y       0       71      NaN

CodePudding user response:

IIUC, the idea is once you replace the time by nan where the Flag is not 1, then you can groupby the user and combine shift and ffill to report the value of the last 1's on the following rows. Then substract this to the original time column

df['diff'] = (
    df['time'] 
    - df['time'].where(df['Flag'].astype(bool))
        .groupby(df['user'])
        .apply(lambda x: x.shift().ffill())
)
print(df)
  user  Flag  time  diff
0    x     0    10   NaN
1    x     0    34   NaN
2    x     1    40   NaN
3    x     0    43   3.0
4    x     1    44   4.0
5    y     0    12   NaN
6    y     1    20   NaN
7    y     0    46  26.0
8    y     1    51  31.0
9    y     0    71  20.0
  • Related