I have my data about users' event logs in unix timestamp as follows:
data = {'id': [533815001,533815001,533815001,533815001,533815001,533815001,
533815003,533815003,533815003,533815003,533815003,533815003],
'start_time': [1459181400, 1459182480, 1459185780,1459494634,1459496914,1459498354,
1459837648, 1459838908, 1459840108,1461578728,1461579508,1461581428,]
}
df = pd.DataFrame(data)
df
id start_time
0 533815001 1459181400
1 533815001 1459182480
2 533815001 1459185780
3 533815001 1459494634
4 533815001 1459496914
5 533815001 1459498354
6 533815003 1459837648
7 533815003 1459838908
8 533815003 1459840108
9 533815003 1461578728
10 533815003 1461579508
11 533815003 1461581428
And I want to know total number of hours users are active in a range of date. So I add datetime
column from start_time
, like so:
df['date'] = pd.to_datetime(df.start_time, unit='s')
df
id start_time date
0 533815001 1459181400 2016-03-28 16:10:00
1 533815001 1459182480 2016-03-28 16:28:00
2 533815001 1459185780 2016-03-28 17:23:00
3 533815001 1459494634 2016-04-01 07:10:34
4 533815001 1459496914 2016-04-01 07:48:34
5 533815001 1459498354 2016-04-01 08:12:34
6 533815003 1459837648 2016-04-05 06:27:28
7 533815003 1459838908 2016-04-05 06:48:28
8 533815003 1459840108 2016-04-05 07:08:28
9 533815003 1461578728 2016-04-25 10:05:28
10 533815003 1461579508 2016-04-25 10:18:28
11 533815003 1461581428 2016-04-25 10:50:28
As can be seen, user 533815001
was active on 2016-03-28
at 16h10
all through 17h23
(1-hour 13 minutes). Also on 2016-04-01
at 07h10
through 08h12
(1 hour 2-minutes).
Similarly user 533815003
was logged in on 2016-4-05
at 06h27
through 07h08
(41 minutes), and on 2016-4-25
at 10h05
all through 10h50
(45 minutes).
- If I want to know number of hours users are spent before
2016-3-31
, that would be1.13minutes
(only user533815001
was active). - Total hours user was active until
2016-4-02
would then be1h-13min 41min = 1h.54minutes
. - Between
2016-4-1
and2016-4-10
would be1h02mins 41mins = 1h43mins
. - and so on...
How do I do this kind of analysis?
CodePudding user response:
It seems like you want to do on user and date level. For that I suggest having a date
column as well:
df['datetime'] = pd.to_datetime(df.start_time, unit='s')
df['date'] = df['datetime'].apply(lambda x: str(x).split(' ')[0])
id | start_time | datetime | date | |
---|---|---|---|---|
0 | 533815001 | 1459181400 | 2016-03-28 16:10:00 | 2016-03-28 |
1 | 533815001 | 1459182480 | 2016-03-28 16:28:00 | 2016-03-28 |
2 | 533815001 | 1459185780 | 2016-03-28 17:23:00 | 2016-03-28 |
3 | 533815001 | 1459494634 | 2016-04-01 07:10:34 | 2016-04-01 |
4 | 533815001 | 1459496914 | 2016-04-01 07:48:34 | 2016-04-01 |
5 | 533815001 | 1459498354 | 2016-04-01 08:12:34 | 2016-04-01 |
6 | 533815003 | 1459837648 | 2016-04-05 06:27:28 | 2016-04-05 |
7 | 533815003 | 1459838908 | 2016-04-05 06:48:28 | 2016-04-05 |
8 | 533815003 | 1459840108 | 2016-04-05 07:08:28 | 2016-04-05 |
9 | 533815003 | 1461578728 | 2016-04-25 10:05:28 | 2016-04-25 |
10 | 533815003 | 1461579508 | 2016-04-25 10:18:28 | 2016-04-25 |
11 | 533815003 | 1461581428 | 2016-04-25 10:50:28 | 2016-04-25 |
Now you can groupby them to have both min and max on each day for each user:
dfx = df.groupby(['id', 'date']).agg({'datetime':['min', 'max']}).reset_index()
dfx['duration'] = dfx['datetime']['max'] - dfx['datetime']['min']
dfx
('id', '') | ('date', '') | ('datetime', 'min') | ('datetime', 'max') | ('duration', '') | |
---|---|---|---|---|---|
0 | 533815001 | 2016-03-28 | 2016-03-28 16:10:00 | 2016-03-28 17:23:00 | 0 days 01:13:00 |
1 | 533815001 | 2016-04-01 | 2016-04-01 07:10:34 | 2016-04-01 08:12:34 | 0 days 01:02:00 |
2 | 533815003 | 2016-04-05 | 2016-04-05 06:27:28 | 2016-04-05 07:08:28 | 0 days 00:41:00 |
3 | 533815003 | 2016-04-25 | 2016-04-25 10:05:28 | 2016-04-25 10:50:28 | 0 days 00:45:00 |
Now you can have this duration aggregated as well if you want:
dfx.groupby('id').agg({('duration', ''): sum})
id | ('duration', '') |
---|---|
533815001 | 0 days 02:15:00 |
533815003 | 0 days 01:26:00 |
CodePudding user response:
IIUC, you can use groupby
with a custom Grouper
then keep the min and the max entries for a day and compute the difference:
out = (df.groupby(['id', pd.Grouper(freq='D', key='date')])['date']
.agg(['min', 'max']).diff(axis=1)['max']
.rename('active_time').reset_index())
print(out)
# Output
id date active_time
0 533815001 2016-03-28 0 days 01:13:00
1 533815001 2016-04-01 0 days 01:02:00
2 533815003 2016-04-05 0 days 00:41:00
3 533815003 2016-04-25 0 days 00:45:00