How to create an algorithm that helps me improve results and automate process?-CodePudding

i've already posted here my problem and since then i've been trying to find solution to help me optimise my results , in the previous post ,Yaloa understood what i wanted to do but sadly i always end up in a dead end My previous Post

the fact is that I want to improve my results in order to visualize them This is my dataframe:

ID           TimeandDate        Date       Time
10   2020-08-07 07:40:09  2022-08-07   07:40:09
10   2020-08-07 08:50:00  2022-08-07   08:50:00
10   2020-08-07 12:40:09  2022-08-07   12:40:09
10   2020-08-08 07:40:09  2022-08-08   07:40:09
10   2020-08-08 17:40:09  2022-08-08   17:40:09
12   2020-08-07 08:03:09  2022-08-07   08:03:09
12   2020-08-07 10:40:09  2022-08-07   10:40:09
12   2020-08-07 14:40:09  2022-08-07   14:40:09
12   2020-08-07 16:40:09  2022-08-07   16:40:09
13   2020-08-07 09:22:45  2022-08-07   09:22:45
13   2020-08-07 17:57:06  2022-08-07   17:57:06

first of all the data is collected from time clock , i want to create new dataframe with 2 new columns the first one is df["Check-in"] , as you can see my data doesnt have any indicator to show what time the id has checked in , so i will suppose that the first time for every id is a check-in , and the next row is a check-out and it will be inserted in df["Check-out"] , also if a check-in doesnt have a check-out time it has to be registred as the check-out for the previous check-out of the same day(sometimes id forgot to check-out) because it has to be the same number of rows for check-in and check-out cant have 2 check-ins and 3 check-outs

What i've tried ? what i meant with i need better results is because what i've tried is not the best solution , i took the min as a check-in and the max is a check-out of time for every id without adding the two columns , and after that i started calculating the time difference , now imagine ID=13 has entered at 07:40:09 and the he check out at 08:40:09 , later that day he returns at 19:20:00 and leave in the next 10 minutes 19:30:00 if i do that fonction it will show that he worked for 12 hours while his real working time is 1 hour

Result Desired

ID         Date   Check-in    Check-out
10   2020-08-07   07:40:09     12:40:09
10   2020-08-08   07:40:09     17:40:09
12   2020-08-07   08:03:09     10:40:09
12   2020-08-07   14:40:09     16:40:09 
13   2020-08-07   09:22:45     17:57:06

Thank you in advance

CodePudding user response：

It took me a while to correctly understand your question. One way to do this is to group your df by ID and Date and check the number of rows of the sub-df. Then delete the next-to-last row if you have an odd number of rows. Finally create your checkin and checkout columns (ffill checkin and dropna to delete double entries):

df = df.drop('TimeandDate', axis=1)
df_output = pd.DataFrame()

for (id, date), subdf in df.groupby(['ID', 'Date']):
    subdf = subdf.reset_index(drop=True)

    # handling case where more checkins than checkouts
    nb_checks = len(subdf.index)
    if nb_checks % 2 and nb_checks > 1:
        subdf.iloc[-2, :] = subdf.iloc[-1, :]
        subdf = subdf.head(-1)
    
    subdf['Check-in'] = subdf.loc[::2, 'Time']
    subdf['Check-out'] = subdf.loc[1::2, 'Time']
    subdf['Check-in'].ffill(inplace=True)

    df_output = df_output.append(subdf.drop('Time', axis=1).dropna())

print(df_output.reset_index(drop=True))

Output:

   ID        Date  Check-in Check-out
0  10  2022-08-07  07:40:09  12:40:09
1  10  2022-08-08  07:40:09  17:40:09
2  12  2022-08-07  08:03:09  10:40:09
3  12  2022-08-07  14:40:09  16:40:09
4  13  2022-08-07  09:22:45  17:57:06

Edit/Note: if you have only one entry for a given ID/Date, it won't appear in your final result. (You'd have to handle the case where nb_checks==1 separately)

CodePudding user response：

Here's a solution, not as efficient as Tranbi's, but it can deal with 1-row cases by making the checkin and checkout time the same.

df = df.drop(["TimeandDate"], axis=1)
df = df.sort_values(by=["ID", "Date"], axis=0)

unique_ids = df["ID"].unique()

# Rows will eventually become the final df.
rows = []

for id in unique_ids:
    id_df = df[df["ID"] == id]
    unique_dates = id_df["Date"].unique()
    print(unique_dates)

    for date in unique_dates:
        id_date_df = id_df[id_df["Date"] == date]
        length = len(id_date_df)

        # Case where there are an even number of rows for that ID and date combination.
        if length % 2 == 0:
            for i in range(0, length, 2):
                checkin_time = id_date_df["Time"].iloc[i]
                checkout_time = id_date_df["Time"].iloc[i   1]

                row = [id, date, checkin_time, checkout_time]
                rows.append(row)

        # Odd number of rows, more than 2 rows.
        elif length > 2:
            for i in range(0, length-3, 2):
                checkin_time = id_date_df.iloc[i]
                checkout_time = id_date_df.iloc[i   1]

                row = [id, date, checkin_time, checkout_time]
                rows.append(row)

            # The last row checkin-checkout combo comes from the 3rd-from-last and last rows.
            rows.append([id, date, id_date_df["Time"].iloc[length-3], id_date_df["Time"].iloc[length-1]])

        # 1 row only. Dealt with by making checkin and checkout times the same.
        else:
            rows.append([id, date, id_date_df["Time"].iloc[0], id_date_df["Time"].iloc[0]])

# The final dataframe.
df_fin = pd.DataFrame(rows, columns=["ID", "Date", "CheckinTime", "CheckoutTime"])

Chances are, you can replace some of the for loops with groupby like in Tranbi's answer to make this more efficient.

CodePudding user response：

Assuming your initial dataframe is df:

Group data by "ID" and "Date", then take the even index values (so 0, 2, ...). Turn this into a dataframe .to_frame(), rename the column, and drop the unwanted index column (the initial index).

df2 = df.groupby(["ID", "Date"]).apply(lambda x: x["Time"].iloc[::2]).to_frame().rename(columns={"Time": "Check-in"}).droplevel(2)
#Out: 
#               Check-in
#ID Date                
#10 2022-08-07  07:40:09
#   2022-08-07  12:40:09
#   2022-08-08  07:40:09
#12 2022-08-07  08:03:09
#   2022-08-07  14:40:09
#13 2022-08-07  09:22:45

Create second index, using odd index values, and merge.

df2 = df2.merge(df.groupby(["ID", "Date"]).apply(lambda x: x["Time"].iloc[1::2]).to_frame().rename(columns={"Time": "Check-out"}).droplevel(2),
                left_index=True, right_index=True, how="left")
#Out: 
#               Check-in Check-out
#ID Date                          
#10 2022-08-07  07:40:09  08:50:00
#   2022-08-07  12:40:09  08:50:00
#   2022-08-08  07:40:09  17:40:09
#12 2022-08-07  08:03:09  10:40:09
#   2022-08-07  08:03:09  16:40:09
#   2022-08-07  14:40:09  10:40:09
#   2022-08-07  14:40:09  16:40:09
#13 2022-08-07  09:22:45  17:57:06

Change and check-out values that are greater than the check-in (which would occur when there are not an equal number of in and out) to the check-in time (so that shift's working hours are given time 0).

df2["Check-out"] = np.where(df2["Check-out"] < df2["Check-in"], df2["Check-in"], df2["Check-out"])
#Out: 
#               Check-in Check-out
#ID Date                          
#10 2022-08-07  07:40:09  08:50:00
#   2022-08-07  12:40:09  12:40:09
#   2022-08-08  07:40:09  17:40:09
#12 2022-08-07  08:03:09  10:40:09
#   2022-08-07  08:03:09  16:40:09
#   2022-08-07  14:40:09  14:40:09
#   2022-08-07  14:40:09  16:40:09
#13 2022-08-07  09:22:45  17:57:06

Calculate the difference between check-in and check-out.

df2["Time_in"] = pd.to_datetime(df2["Check-out"], format='%H:%M:%S') - pd.to_datetime(df2["Check-in"], format='%H:%M:%S')

Sum each day for each ID to give total working hours.

logged_hours = df2.groupby(["ID", "Date"])["Time_in"].sum()

#Out: 
#ID  Date      
#10  2022-08-07   0 days 01:09:51
#    2022-08-08   0 days 10:00:00
#12  2022-08-07   0 days 13:14:00
#13  2022-08-07   0 days 08:34:21
#Name: Time_in, dtype: timedelta64[ns]