Pandas : How to create an algorithm that helps me improve results and creating new columns?-CodePudding

it's a little bit complicated , i have this dataframe :

ID           TimeandDate        Date       Time
10   2020-08-07 07:40:09  2022-08-07   07:40:09
10   2020-08-07 08:50:00  2022-08-07   08:50:00
10   2020-08-07 12:40:09  2022-08-07   12:40:09
10   2020-08-08 07:40:09  2022-08-08   07:40:09
10   2020-08-08 17:40:09  2022-08-08   17:40:09
12   2020-08-07 08:03:09  2022-08-07   08:03:09
12   2020-08-07 10:40:09  2022-08-07   10:40:09
12   2020-08-07 14:40:09  2022-08-07   14:40:09
12   2020-08-07 16:40:09  2022-08-07   16:40:09
13   2020-08-07 09:22:45  2022-08-07   09:22:45
13   2020-08-07 17:57:06  2022-08-07   17:57:06

i want to create new dataframe with 2 new columns the first one is df["Check-in"] , as you can see my data doesnt have any indicator to show what time the id has checked in , so i will suppose that the first time for every id is a check-in , and the next row is a check-out and will be inserted in df["Check-out"] , also if a check-in doesnt have a check-out time it has to be registred as the check-out for the previous check-out of the same day

i tried this but i'm afraid its not efficient because it shows the first and last one imagine if ID=13 has entered at 07:40:09 and the he check out at 08:40:09 , later that day he returns at 19:20:00 and leave in the next 10 minutes 19:30:00 if i do that fonction it will show that he worked for 12 hours

group = df.groupby(['ID', 'Date'])
def TimeDifference(df):
    in = df['TimeandDate'].min()
    out = df['TimeandDate'].max()
    df2 = p.DataFrame([in-out], columns=['TimeDiff'])
    return df2
group.apply(TimeDifference)

Result Desired

ID         Date   Check-in    Check-out
10   2020-08-07   07:40:09     12:40:09
10   2020-08-08   07:40:09     17:40:09
12   2020-08-07   08:03:09     10:40:09
12   2020-08-07   14:40:09     16:40:09 
13   2020-08-07   09:22:45     17:57:06

Thanks !!!

CodePudding user response：

If I understand correctly, you can do something like:

import pandas as pd
df["TimeandDate"] = pd.to_datetime(df["TimeandDate"])
df.set_index("TimeandDate", inplace=True)
print(df.groupby([df["ID"], df.index.year, df.index.month, df.index.day]).agg(["min", "max"]).to_markdown())

Output

(ID, Y, m, d)	('Date', 'min')	('Date', 'max')	('Time', 'min')	('Time', 'max')
(10, 2020, 8, 7)	2022-08-07	2022-08-07	07:40:09	12:40:09
(10, 2020, 8, 8)	2022-08-08	2022-08-08	07:40:09	17:40:09
(12, 2020, 8, 7)	2022-08-07	2022-08-07	08:03:09	16:40:09
(13, 2020, 8, 7)	2022-08-07	2022-08-07	09:22:45	17:57:06

CodePudding user response：

This approach is going to be verbose and not speedy but might solve the problem for now.

I first assign a suffix-pair to each ID/Date pair, then check if there is a check-in without check-out (so if the length is not even, it means there is a missing check-out).

The output is the same as your desired output

new_col = []
for i in df.ID.unique():
    for d in df.Date.unique():
        p = df.loc[(df.ID==i)&(df.Date==d)]
        suffix = sorted(list(range(1,len(p)))*2)[:len(p)]
        if len(suffix)%2!=0 and len(suffix)>1:
            suffix[-2]=np.nan
            suffix[-1]-=1
        new_col.extend(suffix)

df['new'] = new_col
df.dropna().groupby(['ID','Date','new'], as_index=False).agg({'Time':[min,max]}).drop('new', axis=1, level=0)

Output:
ID  Date    Time
                    min         max
0   10  2022-08-07  07:40:09    12:40:09
1   10  2022-08-08  07:40:09    17:40:09
2   12  2022-08-07  08:03:09    10:40:09
3   12  2022-08-07  14:40:09    16:40:09
4   13  2022-08-07  09:22:45    17:57:06

Trying a different approach:

df=df[['ID','Date','Time']]

def check(x):
    x = x.reset_index(drop=True)
    if len(x)%2!=0:
        x=x.drop(len(x)-2)
    return x

df
df.groupby(['ID','Date'], as_index=False).agg(check)

g = df.groupby(['ID','Date'], as_index=False).agg(check).explode('Time').reset_index(drop=True)

g['in'] = np.where(g.index%2==0, g.loc[g.index,'Time'], np.nan)
g['out'] = np.where(g.index%2!=0, g.loc[g.index,'Time'], np.nan)

out = g.groupby(['ID','Date'], as_index=False).agg(list)
out['in'] = out['in'].apply(lambda x: [i for i in x if str(i) != "nan"])
out['out'] = out['out'].apply(lambda x: [i for i in x if str(i) != "nan"])
out[['ID','Date','in','out']].explode(['in','out']).reset_index(drop=True)

Output:

    ID  Date        in          out
0   10  2022-08-07  07:40:09    12:40:09
1   10  2022-08-08  07:40:09    17:40:09
2   12  2022-08-07  08:03:09    10:40:09
2   12  2022-08-07  14:40:09    16:40:09
3   13  2022-08-07  09:22:45    17:57:06