Pandas: rolling total of checked out vs checked in items-CodePudding

I have a large data set that I need to calculate the number of checked out items vs the number of checked in items.

Sample data where rollingTotalCheckedOut describes the expected value. While items are checked out, the number of checked out items increases. When items are checked back in, the number of checked out items decreases.

df = pd.DataFrame([
            ['A', 1624990605, 1627102404, 1],
            ['A', 1624990635, 1625015061, 2],
            ['A', 1624990790, 1624991096, 3],
            ['A', 1624990790, 1624990913, 4],
            ['A', 1624990822, 1624991711, 5],
            ['A', 1624990945, 1624991096, 5],
            ['A', 1624991036, 1624991066, 6],
            ['A', 1624991067, 1624991188, 6],
            ],
        columns = ['ID', 'out_ts', 'in_ts', 'rollingTotalCheckedOut'])

# some helpers
df['checkoutTime'] = pd.to_datetime(df['out_ts'], unit='s', origin='unix')
df['checkinTime'] = pd.to_datetime(df['in_ts'], unit='s', origin='unix')

I am not even sure how to best describe this problem. What is my strategy here / how to frame and tackle this problem? A rolling window does not seem suitable because, in this case, the first row is "checked out" for a very long time.

CodePudding user response：

Here is what I got. not exactly your calculation but I can't immediately see an error. Will check again. But honestly I am not sure why you have a 5 in the end. Previous period ended but new just started.

import pandas as pd
df = pd.DataFrame([
            ['A', 1624990605, 1627102404, 1],
            ['A', 1624990635, 1625015061, 2],
            ['A', 1624990790, 1624991096, 3],
            ['A', 1624990790, 1624990913, 4],
            ['A', 1624990822, 1624991711, 5],
            ['A', 1624990945, 1624991096, 5],
            ['A', 1624991036, 1624991066, 6],
            ['A', 1624991067, 1624991188, 5],
            ],
        columns = ['ID', 'out_ts', 'in_ts', 'rollingTotalCheckedOut'])

df["full_interval"] = df["out_ts"].astype("str")   "_"   df["in_ts"].astype("str")

df_out= df.drop(columns = ["in_ts"])
df_out["ts"] = df_out["out_ts"]
df_out["op"] = "out"
df_out["op_val"] = 1
df_in= df.drop(columns = ["out_ts"])
df_in["ts"] = df_in["in_ts"]
df_in["op"] = "in"
df_in["op_val"] = -1



df_stacked = pd.concat([df_out, df_in]).sort_values("ts")

df_stacked["rollingTotalCheckedOut"] = df_stacked["op_val"].cumsum()
df_stacked = df_stacked.sort_values("out_ts").dropna(subset=["out_ts"])

df = df.merge(df_stacked.loc[:,["ID","full_interval", "rollingTotalCheckedOut"]], how="left", on=["ID", "full_interval"])
df = df.drop(columns=["full_interval"])
df

Output:

ID  out_ts  in_ts   rollingTotalCheckedOut_x    rollingTotalCheckedOut_y
0   A   1624990605  1627102404  1   1
1   A   1624990635  1625015061  2   2
2   A   1624990790  1624991096  3   3
3   A   1624990790  1624990913  4   4
4   A   1624990822  1624991711  5   5
5   A   1624990945  1624991096  5   5
6   A   1624991036  1624991066  6   6
7   A   1624991067  1624991188  5   6