i've already posted here my problem and since then i've been trying to find solution to help me optimise my results , in the previous post ,Yaloa understood what i wanted to do but sadly i always end up in a dead end My previous Post
the fact is that I want to improve my results in order to visualize them This is my dataframe:
ID TimeandDate Date Time
10 2020-08-07 07:40:09 2022-08-07 07:40:09
10 2020-08-07 08:50:00 2022-08-07 08:50:00
10 2020-08-07 12:40:09 2022-08-07 12:40:09
10 2020-08-08 07:40:09 2022-08-08 07:40:09
10 2020-08-08 17:40:09 2022-08-08 17:40:09
12 2020-08-07 08:03:09 2022-08-07 08:03:09
12 2020-08-07 10:40:09 2022-08-07 10:40:09
12 2020-08-07 14:40:09 2022-08-07 14:40:09
12 2020-08-07 16:40:09 2022-08-07 16:40:09
13 2020-08-07 09:22:45 2022-08-07 09:22:45
13 2020-08-07 17:57:06 2022-08-07 17:57:06
first of all the data is collected from time clock , i want to create new dataframe with 2 new columns the first one is df["Check-in"]
, as you can see my data doesnt have any indicator to show what time the id
has checked in , so i will suppose that the first time for every id
is a check-in
, and the next row is a check-out and it will be inserted in df["Check-out"]
, also if a check-in
doesnt have a check-out
time it has to be registred as the check-out
for the previous check-out
of the same day(sometimes id
forgot to check-out
) because it has to be the same number of rows for check-in
and check-out
cant have 2 check-ins
and 3 check-outs
What i've tried ?
what i meant with i need better results is because what i've tried is not the best solution , i took the min
as a check-in
and the max
is a check-out
of time
for every id
without adding the two columns , and after that i started calculating the time difference , now imagine ID=13
has entered at 07:40:09
and the he check out at 08:40:09
, later that day he returns at 19:20:00
and leave in the next 10 minutes 19:30:00
if i do that fonction it will show that he worked for 12 hours while his real working time is 1 hour
Result Desired
ID Date Check-in Check-out
10 2020-08-07 07:40:09 12:40:09
10 2020-08-08 07:40:09 17:40:09
12 2020-08-07 08:03:09 10:40:09
12 2020-08-07 14:40:09 16:40:09
13 2020-08-07 09:22:45 17:57:06
Thank you in advance
CodePudding user response:
It took me a while to correctly understand your question. One way to do this is to group your df by ID
and Date
and check the number of rows of the sub-df. Then delete the next-to-last row if you have an odd number of rows.
Finally create your checkin and checkout columns (ffill checkin and dropna to delete double entries):
df = df.drop('TimeandDate', axis=1)
df_output = pd.DataFrame()
for (id, date), subdf in df.groupby(['ID', 'Date']):
subdf = subdf.reset_index(drop=True)
# handling case where more checkins than checkouts
nb_checks = len(subdf.index)
if nb_checks % 2 and nb_checks > 1:
subdf.iloc[-2, :] = subdf.iloc[-1, :]
subdf = subdf.head(-1)
subdf['Check-in'] = subdf.loc[::2, 'Time']
subdf['Check-out'] = subdf.loc[1::2, 'Time']
subdf['Check-in'].ffill(inplace=True)
df_output = df_output.append(subdf.drop('Time', axis=1).dropna())
print(df_output.reset_index(drop=True))
Output:
ID Date Check-in Check-out
0 10 2022-08-07 07:40:09 12:40:09
1 10 2022-08-08 07:40:09 17:40:09
2 12 2022-08-07 08:03:09 10:40:09
3 12 2022-08-07 14:40:09 16:40:09
4 13 2022-08-07 09:22:45 17:57:06
Edit/Note: if you have only one entry for a given ID/Date, it won't appear in your final result. (You'd have to handle the case where nb_checks==1
separately)
CodePudding user response:
Here's a solution, not as efficient as Tranbi's, but it can deal with 1-row cases by making the checkin and checkout time the same.
df = df.drop(["TimeandDate"], axis=1)
df = df.sort_values(by=["ID", "Date"], axis=0)
unique_ids = df["ID"].unique()
# Rows will eventually become the final df.
rows = []
for id in unique_ids:
id_df = df[df["ID"] == id]
unique_dates = id_df["Date"].unique()
print(unique_dates)
for date in unique_dates:
id_date_df = id_df[id_df["Date"] == date]
length = len(id_date_df)
# Case where there are an even number of rows for that ID and date combination.
if length % 2 == 0:
for i in range(0, length, 2):
checkin_time = id_date_df["Time"].iloc[i]
checkout_time = id_date_df["Time"].iloc[i 1]
row = [id, date, checkin_time, checkout_time]
rows.append(row)
# Odd number of rows, more than 2 rows.
elif length > 2:
for i in range(0, length-3, 2):
checkin_time = id_date_df.iloc[i]
checkout_time = id_date_df.iloc[i 1]
row = [id, date, checkin_time, checkout_time]
rows.append(row)
# The last row checkin-checkout combo comes from the 3rd-from-last and last rows.
rows.append([id, date, id_date_df["Time"].iloc[length-3], id_date_df["Time"].iloc[length-1]])
# 1 row only. Dealt with by making checkin and checkout times the same.
else:
rows.append([id, date, id_date_df["Time"].iloc[0], id_date_df["Time"].iloc[0]])
# The final dataframe.
df_fin = pd.DataFrame(rows, columns=["ID", "Date", "CheckinTime", "CheckoutTime"])
Chances are, you can replace some of the for
loops with groupby
like in Tranbi's answer to make this more efficient.
CodePudding user response:
Assuming your initial dataframe is df
:
Group data by "ID" and "Date", then take the even index values (so 0, 2, ...). Turn this into a dataframe .to_frame()
, rename the column, and drop the unwanted index column (the initial index).
df2 = df.groupby(["ID", "Date"]).apply(lambda x: x["Time"].iloc[::2]).to_frame().rename(columns={"Time": "Check-in"}).droplevel(2)
#Out:
# Check-in
#ID Date
#10 2022-08-07 07:40:09
# 2022-08-07 12:40:09
# 2022-08-08 07:40:09
#12 2022-08-07 08:03:09
# 2022-08-07 14:40:09
#13 2022-08-07 09:22:45
Create second index, using odd index values, and merge.
df2 = df2.merge(df.groupby(["ID", "Date"]).apply(lambda x: x["Time"].iloc[1::2]).to_frame().rename(columns={"Time": "Check-out"}).droplevel(2),
left_index=True, right_index=True, how="left")
#Out:
# Check-in Check-out
#ID Date
#10 2022-08-07 07:40:09 08:50:00
# 2022-08-07 12:40:09 08:50:00
# 2022-08-08 07:40:09 17:40:09
#12 2022-08-07 08:03:09 10:40:09
# 2022-08-07 08:03:09 16:40:09
# 2022-08-07 14:40:09 10:40:09
# 2022-08-07 14:40:09 16:40:09
#13 2022-08-07 09:22:45 17:57:06
Change and check-out values that are greater than the check-in (which would occur when there are not an equal number of in and out) to the check-in time (so that shift's working hours are given time 0).
df2["Check-out"] = np.where(df2["Check-out"] < df2["Check-in"], df2["Check-in"], df2["Check-out"])
#Out:
# Check-in Check-out
#ID Date
#10 2022-08-07 07:40:09 08:50:00
# 2022-08-07 12:40:09 12:40:09
# 2022-08-08 07:40:09 17:40:09
#12 2022-08-07 08:03:09 10:40:09
# 2022-08-07 08:03:09 16:40:09
# 2022-08-07 14:40:09 14:40:09
# 2022-08-07 14:40:09 16:40:09
#13 2022-08-07 09:22:45 17:57:06
Calculate the difference between check-in and check-out.
df2["Time_in"] = pd.to_datetime(df2["Check-out"], format='%H:%M:%S') - pd.to_datetime(df2["Check-in"], format='%H:%M:%S')
Sum each day for each ID to give total working hours.
logged_hours = df2.groupby(["ID", "Date"])["Time_in"].sum()
#Out:
#ID Date
#10 2022-08-07 0 days 01:09:51
# 2022-08-08 0 days 10:00:00
#12 2022-08-07 0 days 13:14:00
#13 2022-08-07 0 days 08:34:21
#Name: Time_in, dtype: timedelta64[ns]