I have a table of logins and logouts by user.
the table looks like this but has a few hundred thousand rows:
data = [['aa', '2020-05-31 00:00:01', '2020-05-31 00:00:31'],
['bb','2020-05-31 00:01:01', '2020-05-31 00:02:01'],
['aa','2020-05-31 00:02:01', '2020-05-31 00:06:03'],
['cc','2020-05-31 00:03:01', '2020-05-31 00:04:01'],
['dd','2020-05-31 00:04:01', '2020-05-31 00:34:01'],
['aa', '2020-05-31 00:05:01', '2020-05-31 00:07:31'],
['bb','2020-05-31 00:05:01', '2020-05-31 00:06:01'],
['aa','2020-05-31 00:05:01', '2020-05-31 00:08:03'],
['cc','2020-05-31 00:10:01', '2020-05-31 00:40:01'],
['dd','2020-05-31 00:20:01', '2020-05-31 00:35:01']]
df_test = pd.DataFrame(data, columns=['user_id','login', 'logout'], dtype='datetime64[ns]')
I was able to solve this problem in a hacky way using a for loop. It works fine on a smaller dataset but takes hours on 300k rows.
Basically, this code calculates how many users were logged in at the same time for each session (session being each row)
Here is my solution. it gives the result that i need. I was also able to do the same by writing a lambda with apply but it takes even longer.
# create a new column for simultaneous
df_test['simultaneous'] = 0
start_time = time.time()
# loop through dataframe and check condition
for i in df_test.index:
login, logout = df_test.loc[i,'login'], df_test.loc[i,'logout']
this_index = df_test.index.isin([i])
df_test.loc[i, 'simultaneous'] = int(sum(
(df_test[~this_index]['login'] <= logout) & (df_test[~this_index]['logout'] >= login)
))
print("--- %s seconds ---" % (time.time() - start_time))
Could you please take a look and let me know if there is a much better way of getting to the same result. Maybe im missing something obvious .
Thanks in advance!
CodePudding user response:
This algorithm takes a streaming approach based on the fact that this data is sorted by login time. For each session, it keeps track of a count of all sessions whose logout time hasn't yet passed (by simply storing the logout time in a list). I decided to count a sess1.logout==sess2.login as occurring simultaneously, but you can change the >=
to >
if you disagree.
#!/usr/bin/python
import datetime
import random
import time
from statistics import mean, stdev
def generate_data(numsessions):
start_time = datetime.datetime(2020, 5, 13, 0, 0, 1)
data = []
while len(data) < numsessions:
for cnt in range(random.choice([0, 0, 0, 1, 1, 2, 3])):
user_id = chr(ord("a") cnt) * 2
duration = random.choice([30, 30, 60, 90, 90, 900, 1800])
logout_time = start_time datetime.timedelta(seconds=duration)
data.append(
(
user_id,
start_time.strftime("%Y-%m-%d %H:%M:%S"),
logout_time.strftime("%Y-%m-%d %H:%M:%S"),
)
)
start_time = datetime.timedelta(minutes=1)
return data
start_time = time.time()
print("generating data")
data = generate_data(3 * 1e5)
print(f"sample data=[{data[0]}]")
print("--- %.2f seconds ---" % (time.time() - start_time))
start_time = time.time()
print("calculating simultanous sessions")
active_sessions = []
simultaneous_sessions = []
for user_id, login, logout in data:
active_sessions = [ts for ts in active_sessions if ts >= login]
simultaneous_sessions.append(len(active_sessions))
active_sessions.append(logout)
print(
f"for {len(simultaneous_sessions)} sessions have max={max(simultaneous_sessions)} "
f"min={min(simultaneous_sessions)} mean={mean(simultaneous_sessions):.2f} stdev={stdev(simultaneous_sessions):.2f} "
)
print("--- %.2f seconds ---" % (time.time() - start_time))
From a performance perspective, I walk the list once, and while I constantly recreate the active_sessions list, that will be quick as long as the active_sessions is a small number. There are other optimizations you could make by having a more efficient active_sessions list, but this should be much faster then searching all data for every session. Even if the data wasn't sorted by login time, I think it would be more efficient to sort by login time and then use this algorithm than scanning all sessions for each session.
UPDATE: I've added a synthetic data generator, which creates a bunch of sessions, based on some random variables. This shows that this algorithm will take less then a second for 300k rows.
for 300k sessions it takes 0.4 seconds
generating data
sample data=[('aa', '2020-05-13 00:01:01', '2020-05-13 00:31:01')]
--- 1.92 seconds ---
calculating simultanous sessions
for 300001 sessions have max=21 min=0 mean=7.42 stdev=2.78
--- 0.40 seconds ---
for 3 million sessions it takes ~4 seconds.
generating data
sample data=[('aa', '2020-05-13 00:00:01', '2020-05-13 00:00:31')]
--- 20.85 seconds ---
calculating simultanous sessions
for 3000001 sessions have max=26 min=0 mean=7.43 stdev=2.77
--- 4.08 seconds ---
O(N)
CodePudding user response:
Try this solution, on your data * 30_000
it took ~1900 seconds to compute the result (AMD 3700X/Python 3.9.7) - but I'm not sure how it will perform on real data:
mn = df_test["login"].min()
mx = df_test["logout"].max()
tmp = pd.Series(0, index=pd.date_range(mn, mx, freq="S"), dtype=object)
def fn1(x):
tmp[x["login"] : x["logout"]] = [
v | (1 << x.name) for v in tmp[x["login"] : x["logout"]]
]
def fn2(x):
out = 0
for v in tmp[x["login"] : x["logout"]]:
out |= v
# If you use Python 3.10 you can use this answer
# https://stackoverflow.com/a/64848298/10035985
# which should be ~6x faster instead of this:
return bin(out).count("1") - 1
df_test.apply(fn1, axis=1)
df_test["sim"] = df_test.apply(fn2, axis=1)