Count datetime between 2 datetime based on another datetime-CodePudding

I have table df which contains logged in and logged out time of users

Login time	Logout time
2022-08-01 11:30:00	2022-08-01 11:50:00
2022-08-01 11:35:00	2022-08-01 11:55:00
2022-08-01 11:35:00	2022-08-01 11:57:00

I have another table df2 which contains datetimes when jobs are created

created time
2022-08-01 11:45:00
2022-08-01 11:51:00
2022-08-01 11:56:00
2022-08-01 11:57:00
2022-08-01 12:00:00

I am struggling to create a result df and would appreciate any help on how to create the resulting dataframe result_df

created time	Online users	Offline users
2022-08-01 11:45:00	3	0
2022-08-01 11:51:00	2	1
2022-08-01 11:56:00	1	2
2022-08-01 11:57:00	0	3
2022-08-01 12:00:00	0	3

CodePudding user response：

here is one way to do it using pandassql

while it is possible to do with the pandas merge as well, but it will requires to create a Cartesian product of two DF, and then filtering out the rows that meet the criteria.

using pandasql, if one is familiar with SQL, makes it simpler to solve it

# https://pypi.org/project/pandasql/
pysqldf = lambda q: sqldf(q, globals())

# Query to select where the created time fall inbetween the login and logout
qry = """
select *
from df2
left join df
on df2.created_time between df.login_time and df.logout_time
"""
pysqldf = lambda q: sqldf(q, globals())
result=pysqldf(qry)
result  #capture the result

# do a groupby to take the count of logged in users
df3=result.groupby(['created_time'])['Login_time'].agg(online_user='count').reset_index()


# logged out is the total number of users minus the logged in users
cnt=df['Login_time'].count()
df3['offline_user'] = cnt - df3['online_user'] 
df3

created_time    online_user     offline_user
0   2022-08-01 11:45:00     3   0
1   2022-08-01 11:51:00     2   1
2   2022-08-01 11:56:00     1   2
3   2022-08-01 11:57:00     1   2
4   2022-08-01 12:00:00     0   3

CodePudding user response：

The question can be solved by using a range join, which is a common type of inequality join.

This can be solved with the conditional_join from pyjanitor, which is efficient, and for large data, should be more performant than a naive cartesian join:

# pip install pyjanitor
# for more performance, 
# if you have numba installed,
# you can install the development version: 
# pip git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn

Compute the range join, followed by the count of matches, where the date from df2 is within df1:

out = (df1
       .conditional_join(
            df2, 
            # variable arguments
            # left column, right column, operator
           ('Login time', 'created time', '<'), 
           ('Logout time', 'created time', '>'),
           how = "inner")
       .groupby('created time')
       .size()
       .rename('online_users'))

out

created time
2022-08-01 11:45:00    3
2022-08-01 11:51:00    2
2022-08-01 11:56:00    1
Name: online_users, dtype: int64

Join back to df2, to get the offline users as well:

(
df2
.merge(
   out, 
   on = 'created time', 
   how = 'left')
.assign(
   online_users = lambda df: df.online_users.fillna(0, downcast='infer'), 
   offline_users = lambda df: len(df1) - df.online_users)
)

         created time  online_users  offline_users
0 2022-08-01 11:45:00             3              0
1 2022-08-01 11:51:00             2              1
2 2022-08-01 11:56:00             1              2
3 2022-08-01 11:57:00             0              3
4 2022-08-01 12:00:00             0              3