Create table of records from subsets of a dataframe in python using datetime column-CodePudding

Be the following pandas DataFrame in Python:

ID	date	direction
0	2022-01-03 10:00:01	IN
0	2022-01-03 11:00:01	IN
0	2022-01-03 11:10:01	OUT
0	2022-01-03 12:00:03	IN
0	2022-01-03 14:32:01	OUT
1	2022-01-03 10:32:01	OUT
1	2022-01-04 11:32:01	IN
1	2022-01-04 14:32:01	OUT
2	2022-01-02 08:00:01	OUT
2	2022-01-02 08:02:01	IN

I need to get the check-in and check-out for each group by ID. Considering that an entry_date is registered by direction = IN and an exit_date by direction = OUT. The idea is to create the record as follows:

ID	entry_date	exit_date
0	2022-01-03 10:00:01	2022-01-03 11:10:01
0	2022-01-03 12:00:03	2022-01-03 14:32:01
1	2022-01-04 11:32:01	2022-01-04 14:32:01

They are only recorded if a row with IN appears first and closes the record with a row with OUT as value for direction column (there may be more rows in between with a value of direction IN, as can be seen in the example).

I hope you can help me, thank you in advance.

CodePudding user response：

I'm not sure I completely understand, but here's an attempt.

After sorting along the date column (date should be datetime): Group on ID and then eliminate in three steps all rows that don't fit into the required pattern: 1. eliminate OUT-blocks at the beginning of every group. 2. reduce consecutive blocks of IN/OUTs in each group to the first one. 3. Remove the the last row of the group if it is an IN-row. The rest is essentially splitting the resulting dataframe in 2 interlaced parts and concatenating them next to another.

def connect(df):
    df = df[df["direction"].cummax()]
    df = df[df["direction"].diff().fillna(True)]
    if df.shape[0] % 2:
        df = df.iloc[:-1]
    return df

result = (
    df
    .assign(direction=df["direction"].replace({"IN": True, "OUT": False}))
    .sort_values(["ID", "date"])
    .groupby("ID").apply(connect)
    .drop(columns="direction")
    .droplevel(-1, axis=0)
)
result = pd.concat(
    [result.iloc[0::2], result[["date"]].iloc[1::2]], axis=1
).reset_index(drop=True)
result.columns = ["ID", "entry_date", "exit_date"]

Result:

   ID          entry_date           exit_date
0   0 2022-01-03 10:00:01 2022-01-03 11:10:01
1   0 2022-01-03 12:00:03 2022-01-03 14:32:01
2   1 2022-01-04 11:32:01 2022-01-04 14:32:01

Another solution would be to group by ID, and then, in each group, split the directions in 2 stacks, the ins and outs, and run through them to gather the connections:

def connect(df):
    mask = df["direction"].eq("IN")
    ins = list(reversed(df.loc[mask, "date"].values))
    outs = list(reversed(df.loc[~mask, "date"].values))
    connects = []
    if ins:
        dt_out = ins[-1]   pd.Timedelta(days=-1)
    while ins and outs:
        dt_in = None
        while ins:
            dt = ins.pop()
            if dt > dt_out:
                dt_in = dt
                break
        while dt_in and outs:
            dt_out = outs.pop()
            if dt_in < dt_out:
                connects.append([dt_in, dt_out])
                break
    return pd.DataFrame(
        connects, columns=["entry_date", "exit_date"]
    )

result = (
    df.sort_values(["ID", "date"]).groupby("ID").apply(connect)
      .droplevel(1, axis=0).reset_index()
)