calculating the total hours users are looged-in for date range-CodePudding

I have my data about users' event logs in unix timestamp as follows:

data = {'id': [533815001,533815001,533815001,533815001,533815001,533815001,
              533815003,533815003,533815003,533815003,533815003,533815003],
 'start_time': [1459181400, 1459182480, 1459185780,1459494634,1459496914,1459498354,
                1459837648, 1459838908, 1459840108,1461578728,1461579508,1461581428,]
}

df = pd.DataFrame(data)
df

        id      start_time
0   533815001   1459181400
1   533815001   1459182480
2   533815001   1459185780
3   533815001   1459494634
4   533815001   1459496914
5   533815001   1459498354
6   533815003   1459837648
7   533815003   1459838908
8   533815003   1459840108
9   533815003   1461578728
10  533815003   1461579508
11  533815003   1461581428

And I want to know total number of hours users are active in a range of date. So I add datetime column from start_time, like so:

df['date'] = pd.to_datetime(df.start_time, unit='s')
df
id  start_time  date
0   533815001   1459181400  2016-03-28 16:10:00
1   533815001   1459182480  2016-03-28 16:28:00
2   533815001   1459185780  2016-03-28 17:23:00
3   533815001   1459494634  2016-04-01 07:10:34
4   533815001   1459496914  2016-04-01 07:48:34
5   533815001   1459498354  2016-04-01 08:12:34
6   533815003   1459837648  2016-04-05 06:27:28
7   533815003   1459838908  2016-04-05 06:48:28
8   533815003   1459840108  2016-04-05 07:08:28
9   533815003   1461578728  2016-04-25 10:05:28
10  533815003   1461579508  2016-04-25 10:18:28
11  533815003   1461581428  2016-04-25 10:50:28

As can be seen, user 533815001 was active on 2016-03-28 at 16h10 all through 17h23 (1-hour 13 minutes). Also on 2016-04-01 at 07h10 through 08h12 (1 hour 2-minutes).

Similarly user 533815003 was logged in on 2016-4-05 at 06h27 through 07h08 (41 minutes), and on 2016-4-25 at 10h05 all through 10h50 (45 minutes).

If I want to know number of hours users are spent before 2016-3-31, that would be 1.13minutes (only user 533815001 was active).
Total hours user was active until 2016-4-02 would then be 1h-13min 41min = 1h.54minutes.
Between 2016-4-1 and 2016-4-10 would be 1h02mins 41mins = 1h43mins.
and so on...

How do I do this kind of analysis?

CodePudding user response：

It seems like you want to do on user and date level. For that I suggest having a date column as well:

df['datetime'] = pd.to_datetime(df.start_time, unit='s')
df['date'] = df['datetime'].apply(lambda x: str(x).split(' ')[0])

	id	start_time	datetime	date
0	533815001	1459181400	2016-03-28 16:10:00	2016-03-28
1	533815001	1459182480	2016-03-28 16:28:00	2016-03-28
2	533815001	1459185780	2016-03-28 17:23:00	2016-03-28
3	533815001	1459494634	2016-04-01 07:10:34	2016-04-01
4	533815001	1459496914	2016-04-01 07:48:34	2016-04-01
5	533815001	1459498354	2016-04-01 08:12:34	2016-04-01
6	533815003	1459837648	2016-04-05 06:27:28	2016-04-05
7	533815003	1459838908	2016-04-05 06:48:28	2016-04-05
8	533815003	1459840108	2016-04-05 07:08:28	2016-04-05
9	533815003	1461578728	2016-04-25 10:05:28	2016-04-25
10	533815003	1461579508	2016-04-25 10:18:28	2016-04-25
11	533815003	1461581428	2016-04-25 10:50:28	2016-04-25

Now you can groupby them to have both min and max on each day for each user:

dfx = df.groupby(['id', 'date']).agg({'datetime':['min', 'max']}).reset_index()
dfx['duration'] = dfx['datetime']['max'] - dfx['datetime']['min']
dfx

	('id', '')	('date', '')	('datetime', 'min')	('datetime', 'max')	('duration', '')
0	533815001	2016-03-28	2016-03-28 16:10:00	2016-03-28 17:23:00	0 days 01:13:00
1	533815001	2016-04-01	2016-04-01 07:10:34	2016-04-01 08:12:34	0 days 01:02:00
2	533815003	2016-04-05	2016-04-05 06:27:28	2016-04-05 07:08:28	0 days 00:41:00
3	533815003	2016-04-25	2016-04-25 10:05:28	2016-04-25 10:50:28	0 days 00:45:00

Now you can have this duration aggregated as well if you want:

dfx.groupby('id').agg({('duration', ''): sum})

id	('duration', '')
533815001	0 days 02:15:00
533815003	0 days 01:26:00

CodePudding user response：

IIUC, you can use groupby with a custom Grouper then keep the min and the max entries for a day and compute the difference:

out = (df.groupby(['id', pd.Grouper(freq='D', key='date')])['date']
         .agg(['min', 'max']).diff(axis=1)['max']
         .rename('active_time').reset_index())
print(out)

# Output
          id       date     active_time
0  533815001 2016-03-28 0 days 01:13:00
1  533815001 2016-04-01 0 days 01:02:00
2  533815003 2016-04-05 0 days 00:41:00
3  533815003 2016-04-25 0 days 00:45:00