Home > Software design >  How to calculate the time difference between every 2 adjacent datapoints by each user
How to calculate the time difference between every 2 adjacent datapoints by each user

Time:12-22

I have the following dataframe:

df = pd.DataFrame(
    {'user_id': [53, 53, 53, 53, 53, 53, 53, 53, 54, 54, 54, 54, 54, 54, 54], 
     'timestamp': [10, 15, 20, 25, 30, 31, 34, 37, 14, 16, 18, 20, 22, 25, 28], 
     'activity': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
                  'D', 'D', 'D', 'D', 'D', 'D', 'D']}
)

df
    user_id     timestamp   activity
0    53            10         A
1    53            15         A
2    53            20         A
3    53            25         A
4    53            30         A
5    53            31         A
6    53            34         A
7    53            37         A
8    54            14         D
9    54            16         D
10   54            18         D
11   54            20         D
12   54            22         D
13   54            25         D
14   54            28         D

I want to calculate the time difference between every 2 adjacent datapoints (rows) in each user_id and plot the CDF, per activity. Assuming each user starts new activity from 0 seconds. timestamp column represents unix timestamp, I give last 2 digits for brevity.

Target df (required result):

    user_id     timestamp   activity    timestamp_diff
0    53           10          A              0
1    53           15          A              5
2    53           20          A              5
3    53           25          A              5
4    53           30          A              5
5    53           31          A              1
6    53           34          A              3
7    53           37          A              3
8    54           14          D              0
9    54           16          D              2
10   54           18          D              2
11   54           20          D              2
12   54           22          D              2
13   54           25          D              3
14   54           28          D              3

My attempts (to calculate the time differences):

df['shift1'] = df.groupby('user_id')['timestamp'].shift(1, fill_value=0)
df['shift2'] = df.groupby('user_id')['timestamp'].shift(-1, fill_value=0)

df['diff1'] = df.timestamp - df.shift1
df['diff2'] = df.shift2 - df.timestamp

df['shift3'] = df.groupby('user_id')['timestamp'].shift(-1)
df['shift3'].fillna(method='ffill', inplace=True)
df['diff3'] = df.shift3 - df.timestamp

df
    user_id     timestamp   activity    shift1  shift2  diff1   diff2   shift3  diff3
0     53           10          A          0       15     10      5       15.0    5.0
1     53           15          A         10       20      5      5       20.0    5.0
2     53           20          A         15       25      5      5       25.0    5.0
3     53           25          A         20       30      5      5       30.0    5.0
4     53           30          A         25       31      5      1       31.0    1.0
5     53           31          A         30       34      1      3       34.0    3.0
6     53           34          A         31       37      3      3       37.0    3.0
7     53           37          A         34        0      3    -37       37.0    0.0
8     54           14          D          0       16     14      2       16.0    2.0
9     54           16          D         14       18      2      2       18.0    2.0
10    54           18          D         16       20      2      2       20.0    2.0
11    54           20          D         18       22      2      2       22.0    2.0
12    54           22          D         20       25      2      3       25.0    3.0
13    54           25          D         22       28      3      3       28.0    3.0
14    54           28          D         25        0      3    -28       28.0    0.0

I cannot reach to the target, none of diff1, diff2 or diff3 columns match the timestamp_diff.

CodePudding user response:

IIUC you are looking for diff:

df['timestamp_diff'] = df.groupby('user_id')['timestamp'].diff().fillna(0).astype(int)
  • Related