Home > database >  calculating the total hours users are looged-in for date range
calculating the total hours users are looged-in for date range

Time:07-06

I have my data about users' event logs in unix timestamp as follows:

data = {'id': [533815001,533815001,533815001,533815001,533815001,533815001,
              533815003,533815003,533815003,533815003,533815003,533815003],
 'start_time': [1459181400, 1459182480, 1459185780,1459494634,1459496914,1459498354,
                1459837648, 1459838908, 1459840108,1461578728,1461579508,1461581428,]
}

df = pd.DataFrame(data)
df

        id      start_time
0   533815001   1459181400
1   533815001   1459182480
2   533815001   1459185780
3   533815001   1459494634
4   533815001   1459496914
5   533815001   1459498354
6   533815003   1459837648
7   533815003   1459838908
8   533815003   1459840108
9   533815003   1461578728
10  533815003   1461579508
11  533815003   1461581428

And I want to know total number of hours users are active in a range of date. So I add datetime column from start_time, like so:

df['date'] = pd.to_datetime(df.start_time, unit='s')
df
id  start_time  date
0   533815001   1459181400  2016-03-28 16:10:00
1   533815001   1459182480  2016-03-28 16:28:00
2   533815001   1459185780  2016-03-28 17:23:00
3   533815001   1459494634  2016-04-01 07:10:34
4   533815001   1459496914  2016-04-01 07:48:34
5   533815001   1459498354  2016-04-01 08:12:34
6   533815003   1459837648  2016-04-05 06:27:28
7   533815003   1459838908  2016-04-05 06:48:28
8   533815003   1459840108  2016-04-05 07:08:28
9   533815003   1461578728  2016-04-25 10:05:28
10  533815003   1461579508  2016-04-25 10:18:28
11  533815003   1461581428  2016-04-25 10:50:28

As can be seen, user 533815001 was active on 2016-03-28 at 16h10 all through 17h23 (1-hour 13 minutes). Also on 2016-04-01 at 07h10 through 08h12 (1 hour 2-minutes).

Similarly user 533815003 was logged in on 2016-4-05 at 06h27 through 07h08 (41 minutes), and on 2016-4-25 at 10h05 all through 10h50 (45 minutes).

  • If I want to know number of hours users are spent before 2016-3-31, that would be 1.13minutes (only user 533815001 was active).
  • Total hours user was active until 2016-4-02 would then be 1h-13min 41min = 1h.54minutes.
  • Between 2016-4-1 and 2016-4-10 would be 1h02mins 41mins = 1h43mins.
  • and so on...

How do I do this kind of analysis?

CodePudding user response:

It seems like you want to do on user and date level. For that I suggest having a date column as well:

df['datetime'] = pd.to_datetime(df.start_time, unit='s')
df['date'] = df['datetime'].apply(lambda x: str(x).split(' ')[0])
id start_time datetime date
0 533815001 1459181400 2016-03-28 16:10:00 2016-03-28
1 533815001 1459182480 2016-03-28 16:28:00 2016-03-28
2 533815001 1459185780 2016-03-28 17:23:00 2016-03-28
3 533815001 1459494634 2016-04-01 07:10:34 2016-04-01
4 533815001 1459496914 2016-04-01 07:48:34 2016-04-01
5 533815001 1459498354 2016-04-01 08:12:34 2016-04-01
6 533815003 1459837648 2016-04-05 06:27:28 2016-04-05
7 533815003 1459838908 2016-04-05 06:48:28 2016-04-05
8 533815003 1459840108 2016-04-05 07:08:28 2016-04-05
9 533815003 1461578728 2016-04-25 10:05:28 2016-04-25
10 533815003 1461579508 2016-04-25 10:18:28 2016-04-25
11 533815003 1461581428 2016-04-25 10:50:28 2016-04-25

Now you can groupby them to have both min and max on each day for each user:

dfx = df.groupby(['id', 'date']).agg({'datetime':['min', 'max']}).reset_index()
dfx['duration'] = dfx['datetime']['max'] - dfx['datetime']['min']
dfx
('id', '') ('date', '') ('datetime', 'min') ('datetime', 'max') ('duration', '')
0 533815001 2016-03-28 2016-03-28 16:10:00 2016-03-28 17:23:00 0 days 01:13:00
1 533815001 2016-04-01 2016-04-01 07:10:34 2016-04-01 08:12:34 0 days 01:02:00
2 533815003 2016-04-05 2016-04-05 06:27:28 2016-04-05 07:08:28 0 days 00:41:00
3 533815003 2016-04-25 2016-04-25 10:05:28 2016-04-25 10:50:28 0 days 00:45:00

Now you can have this duration aggregated as well if you want:

dfx.groupby('id').agg({('duration', ''): sum})
id ('duration', '')
533815001 0 days 02:15:00
533815003 0 days 01:26:00

CodePudding user response:

IIUC, you can use groupby with a custom Grouper then keep the min and the max entries for a day and compute the difference:

out = (df.groupby(['id', pd.Grouper(freq='D', key='date')])['date']
         .agg(['min', 'max']).diff(axis=1)['max']
         .rename('active_time').reset_index())
print(out)

# Output
          id       date     active_time
0  533815001 2016-03-28 0 days 01:13:00
1  533815001 2016-04-01 0 days 01:02:00
2  533815003 2016-04-05 0 days 00:41:00
3  533815003 2016-04-25 0 days 00:45:00
  • Related