I have a large data set that I need to calculate the number of checked out items vs the number of checked in items.
Sample data where rollingTotalCheckedOut
describes the expected value. While items are checked out, the number of checked out items increases. When items are checked back in, the number of checked out items decreases.
df = pd.DataFrame([
['A', 1624990605, 1627102404, 1],
['A', 1624990635, 1625015061, 2],
['A', 1624990790, 1624991096, 3],
['A', 1624990790, 1624990913, 4],
['A', 1624990822, 1624991711, 5],
['A', 1624990945, 1624991096, 5],
['A', 1624991036, 1624991066, 6],
['A', 1624991067, 1624991188, 6],
],
columns = ['ID', 'out_ts', 'in_ts', 'rollingTotalCheckedOut'])
# some helpers
df['checkoutTime'] = pd.to_datetime(df['out_ts'], unit='s', origin='unix')
df['checkinTime'] = pd.to_datetime(df['in_ts'], unit='s', origin='unix')
I am not even sure how to best describe this problem. What is my strategy here / how to frame and tackle this problem? A rolling window does not seem suitable because, in this case, the first row is "checked out" for a very long time.
CodePudding user response:
Here is what I got. not exactly your calculation but I can't immediately see an error. Will check again. But honestly I am not sure why you have a 5 in the end. Previous period ended but new just started.
import pandas as pd
df = pd.DataFrame([
['A', 1624990605, 1627102404, 1],
['A', 1624990635, 1625015061, 2],
['A', 1624990790, 1624991096, 3],
['A', 1624990790, 1624990913, 4],
['A', 1624990822, 1624991711, 5],
['A', 1624990945, 1624991096, 5],
['A', 1624991036, 1624991066, 6],
['A', 1624991067, 1624991188, 5],
],
columns = ['ID', 'out_ts', 'in_ts', 'rollingTotalCheckedOut'])
df["full_interval"] = df["out_ts"].astype("str") "_" df["in_ts"].astype("str")
df_out= df.drop(columns = ["in_ts"])
df_out["ts"] = df_out["out_ts"]
df_out["op"] = "out"
df_out["op_val"] = 1
df_in= df.drop(columns = ["out_ts"])
df_in["ts"] = df_in["in_ts"]
df_in["op"] = "in"
df_in["op_val"] = -1
df_stacked = pd.concat([df_out, df_in]).sort_values("ts")
df_stacked["rollingTotalCheckedOut"] = df_stacked["op_val"].cumsum()
df_stacked = df_stacked.sort_values("out_ts").dropna(subset=["out_ts"])
df = df.merge(df_stacked.loc[:,["ID","full_interval", "rollingTotalCheckedOut"]], how="left", on=["ID", "full_interval"])
df = df.drop(columns=["full_interval"])
df
Output:
ID out_ts in_ts rollingTotalCheckedOut_x rollingTotalCheckedOut_y
0 A 1624990605 1627102404 1 1
1 A 1624990635 1625015061 2 2
2 A 1624990790 1624991096 3 3
3 A 1624990790 1624990913 4 4
4 A 1624990822 1624991711 5 5
5 A 1624990945 1624991096 5 5
6 A 1624991036 1624991066 6 6
7 A 1624991067 1624991188 5 6