I have the following dataframe:
df = pd.DataFrame(
{'user_id': [53, 53, 53, 53, 53, 53, 53, 53, 54, 54, 54, 54, 54, 54, 54],
'timestamp': [10, 15, 20, 25, 30, 31, 34, 37, 14, 16, 18, 20, 22, 25, 28],
'activity': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
'D', 'D', 'D', 'D', 'D', 'D', 'D']}
)
df
user_id timestamp activity
0 53 10 A
1 53 15 A
2 53 20 A
3 53 25 A
4 53 30 A
5 53 31 A
6 53 34 A
7 53 37 A
8 54 14 D
9 54 16 D
10 54 18 D
11 54 20 D
12 54 22 D
13 54 25 D
14 54 28 D
I want to calculate the time difference between every
2 adjacent datapoints (rows) in each user_id
and plot the CDF,
per activity
. Assuming each user starts new activity from 0 seconds. timestamp
column represents unix
timestamp, I give last 2 digits for brevity.
Target df
(required result):
user_id timestamp activity timestamp_diff
0 53 10 A 0
1 53 15 A 5
2 53 20 A 5
3 53 25 A 5
4 53 30 A 5
5 53 31 A 1
6 53 34 A 3
7 53 37 A 3
8 54 14 D 0
9 54 16 D 2
10 54 18 D 2
11 54 20 D 2
12 54 22 D 2
13 54 25 D 3
14 54 28 D 3
My attempts (to calculate the time differences):
df['shift1'] = df.groupby('user_id')['timestamp'].shift(1, fill_value=0)
df['shift2'] = df.groupby('user_id')['timestamp'].shift(-1, fill_value=0)
df['diff1'] = df.timestamp - df.shift1
df['diff2'] = df.shift2 - df.timestamp
df['shift3'] = df.groupby('user_id')['timestamp'].shift(-1)
df['shift3'].fillna(method='ffill', inplace=True)
df['diff3'] = df.shift3 - df.timestamp
df
user_id timestamp activity shift1 shift2 diff1 diff2 shift3 diff3
0 53 10 A 0 15 10 5 15.0 5.0
1 53 15 A 10 20 5 5 20.0 5.0
2 53 20 A 15 25 5 5 25.0 5.0
3 53 25 A 20 30 5 5 30.0 5.0
4 53 30 A 25 31 5 1 31.0 1.0
5 53 31 A 30 34 1 3 34.0 3.0
6 53 34 A 31 37 3 3 37.0 3.0
7 53 37 A 34 0 3 -37 37.0 0.0
8 54 14 D 0 16 14 2 16.0 2.0
9 54 16 D 14 18 2 2 18.0 2.0
10 54 18 D 16 20 2 2 20.0 2.0
11 54 20 D 18 22 2 2 22.0 2.0
12 54 22 D 20 25 2 3 25.0 3.0
13 54 25 D 22 28 3 3 28.0 3.0
14 54 28 D 25 0 3 -28 28.0 0.0
I cannot reach to the target, none of diff1, diff2
or diff3
columns match the timestamp_diff
.
CodePudding user response:
IIUC you are looking for diff
:
df['timestamp_diff'] = df.groupby('user_id')['timestamp'].diff().fillna(0).astype(int)