Be the following pandas DataFrame in Python:
ID | date | direction |
---|---|---|
0 | 2022-01-03 10:00:01 | IN |
0 | 2022-01-03 11:00:01 | IN |
0 | 2022-01-03 11:10:01 | OUT |
0 | 2022-01-03 12:00:03 | IN |
0 | 2022-01-03 14:32:01 | OUT |
1 | 2022-01-03 10:32:01 | OUT |
1 | 2022-01-04 11:32:01 | IN |
1 | 2022-01-04 14:32:01 | OUT |
2 | 2022-01-02 08:00:01 | OUT |
2 | 2022-01-02 08:02:01 | IN |
I need to get the check-in and check-out for each group by ID. Considering that an entry_date
is registered by direction = IN and an exit_date
by direction = OUT. The idea is to create the record as follows:
ID | entry_date | exit_date |
---|---|---|
0 | 2022-01-03 10:00:01 | 2022-01-03 11:10:01 |
0 | 2022-01-03 12:00:03 | 2022-01-03 14:32:01 |
1 | 2022-01-04 11:32:01 | 2022-01-04 14:32:01 |
They are only recorded if a row with IN appears first and closes the record with a row with OUT as value for direction column (there may be more rows in between with a value of direction IN, as can be seen in the example).
I hope you can help me, thank you in advance.
CodePudding user response:
I'm not sure I completely understand, but here's an attempt.
After sorting along the date
column (date
should be datetime
): Group on ID
and then eliminate in three steps all rows that don't fit into the required pattern: 1. eliminate OUT
-blocks at the beginning of every group. 2. reduce consecutive blocks of IN
/OUT
s in each group to the first one. 3. Remove the the last row of the group if it is an IN
-row. The rest is essentially splitting the resulting dataframe in 2 interlaced parts and concatenating them next to another.
def connect(df):
df = df[df["direction"].cummax()]
df = df[df["direction"].diff().fillna(True)]
if df.shape[0] % 2:
df = df.iloc[:-1]
return df
result = (
df
.assign(direction=df["direction"].replace({"IN": True, "OUT": False}))
.sort_values(["ID", "date"])
.groupby("ID").apply(connect)
.drop(columns="direction")
.droplevel(-1, axis=0)
)
result = pd.concat(
[result.iloc[0::2], result[["date"]].iloc[1::2]], axis=1
).reset_index(drop=True)
result.columns = ["ID", "entry_date", "exit_date"]
Result:
ID entry_date exit_date
0 0 2022-01-03 10:00:01 2022-01-03 11:10:01
1 0 2022-01-03 12:00:03 2022-01-03 14:32:01
2 1 2022-01-04 11:32:01 2022-01-04 14:32:01
Another solution would be to group by ID
, and then, in each group, split the directions
in 2 stacks, the ins
and outs
, and run through them to gather the connections:
def connect(df):
mask = df["direction"].eq("IN")
ins = list(reversed(df.loc[mask, "date"].values))
outs = list(reversed(df.loc[~mask, "date"].values))
connects = []
if ins:
dt_out = ins[-1] pd.Timedelta(days=-1)
while ins and outs:
dt_in = None
while ins:
dt = ins.pop()
if dt > dt_out:
dt_in = dt
break
while dt_in and outs:
dt_out = outs.pop()
if dt_in < dt_out:
connects.append([dt_in, dt_out])
break
return pd.DataFrame(
connects, columns=["entry_date", "exit_date"]
)
result = (
df.sort_values(["ID", "date"]).groupby("ID").apply(connect)
.droplevel(1, axis=0).reset_index()
)