I have table df which contains logged in and logged out time of users
Login time | Logout time |
---|---|
2022-08-01 11:30:00 | 2022-08-01 11:50:00 |
2022-08-01 11:35:00 | 2022-08-01 11:55:00 |
2022-08-01 11:35:00 | 2022-08-01 11:57:00 |
I have another table df2 which contains datetimes when jobs are created
created time |
---|
2022-08-01 11:45:00 |
2022-08-01 11:51:00 |
2022-08-01 11:56:00 |
2022-08-01 11:57:00 |
2022-08-01 12:00:00 |
I am struggling to create a result df and would appreciate any help on how to create the resulting dataframe result_df
created time | Online users | Offline users |
---|---|---|
2022-08-01 11:45:00 | 3 | 0 |
2022-08-01 11:51:00 | 2 | 1 |
2022-08-01 11:56:00 | 1 | 2 |
2022-08-01 11:57:00 | 0 | 3 |
2022-08-01 12:00:00 | 0 | 3 |
CodePudding user response:
here is one way to do it using pandassql
while it is possible to do with the pandas merge as well, but it will requires to create a Cartesian product of two DF, and then filtering out the rows that meet the criteria.
using pandasql, if one is familiar with SQL, makes it simpler to solve it
# https://pypi.org/project/pandasql/
pysqldf = lambda q: sqldf(q, globals())
# Query to select where the created time fall inbetween the login and logout
qry = """
select *
from df2
left join df
on df2.created_time between df.login_time and df.logout_time
"""
pysqldf = lambda q: sqldf(q, globals())
result=pysqldf(qry)
result #capture the result
# do a groupby to take the count of logged in users
df3=result.groupby(['created_time'])['Login_time'].agg(online_user='count').reset_index()
# logged out is the total number of users minus the logged in users
cnt=df['Login_time'].count()
df3['offline_user'] = cnt - df3['online_user']
df3
created_time online_user offline_user
0 2022-08-01 11:45:00 3 0
1 2022-08-01 11:51:00 2 1
2 2022-08-01 11:56:00 1 2
3 2022-08-01 11:57:00 1 2
4 2022-08-01 12:00:00 0 3
CodePudding user response:
The question can be solved by using a range join, which is a common type of inequality join.
This can be solved with the conditional_join from pyjanitor, which is efficient, and for large data, should be more performant than a naive cartesian join:
# pip install pyjanitor
# for more performance,
# if you have numba installed,
# you can install the development version:
# pip git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
Compute the range join, followed by the count of matches, where the date from df2 is within df1:
out = (df1
.conditional_join(
df2,
# variable arguments
# left column, right column, operator
('Login time', 'created time', '<'),
('Logout time', 'created time', '>'),
how = "inner")
.groupby('created time')
.size()
.rename('online_users'))
out
created time
2022-08-01 11:45:00 3
2022-08-01 11:51:00 2
2022-08-01 11:56:00 1
Name: online_users, dtype: int64
Join back to df2, to get the offline users as well:
(
df2
.merge(
out,
on = 'created time',
how = 'left')
.assign(
online_users = lambda df: df.online_users.fillna(0, downcast='infer'),
offline_users = lambda df: len(df1) - df.online_users)
)
created time online_users offline_users
0 2022-08-01 11:45:00 3 0
1 2022-08-01 11:51:00 2 1
2 2022-08-01 11:56:00 1 2
3 2022-08-01 11:57:00 0 3
4 2022-08-01 12:00:00 0 3