it's a little bit complicated , i have this dataframe :
ID TimeandDate Date Time
10 2020-08-07 07:40:09 2022-08-07 07:40:09
10 2020-08-07 08:50:00 2022-08-07 08:50:00
10 2020-08-07 12:40:09 2022-08-07 12:40:09
10 2020-08-08 07:40:09 2022-08-08 07:40:09
10 2020-08-08 17:40:09 2022-08-08 17:40:09
12 2020-08-07 08:03:09 2022-08-07 08:03:09
12 2020-08-07 10:40:09 2022-08-07 10:40:09
12 2020-08-07 14:40:09 2022-08-07 14:40:09
12 2020-08-07 16:40:09 2022-08-07 16:40:09
13 2020-08-07 09:22:45 2022-08-07 09:22:45
13 2020-08-07 17:57:06 2022-08-07 17:57:06
i want to create new dataframe with 2 new columns the first one is df["Check-in"]
, as you can see my data doesnt have any indicator to show what time the id has checked in , so i will suppose that the first time for every id is a check-in , and the next row is a check-out and will be inserted in df["Check-out"]
, also if a check-in
doesnt have a check-out
time it has to be registred as the check-out
for the previous check-out
of the same day
i tried this but i'm afraid its not efficient because it shows the first and last one
group = df.groupby(['ID', 'Date'])
def TimeDifference(df):
in = df['TimeandDate'].min()
out = df['TimeandDate'].max()
df2 = p.DataFrame([in-out], columns=['TimeDiff'])
return df2
group.apply(TimeDifference)
Result Desired
ID Date Check-in Check-out
10 2020-08-07 07:40:09 12:40:09
10 2020-08-08 07:40:09 17:40:09
12 2020-08-07 08:03:09 10:40:09
12 2020-08-07 14:40:09 16:40:09
13 2020-08-07 09:22:45 17:57:06
Thanks !!!
CodePudding user response:
If I understand correctly, you can do something like:
import pandas as pd
df["TimeandDate"] = pd.to_datetime(df["TimeandDate"])
df.set_index("TimeandDate", inplace=True)
print(df.groupby([df["ID"], df.index.year, df.index.month, df.index.day]).agg(["min", "max"]).to_markdown())
Output
(ID, Y, m, d) | ('Date', 'min') | ('Date', 'max') | ('Time', 'min') | ('Time', 'max') |
---|---|---|---|---|
(10, 2020, 8, 7) | 2022-08-07 | 2022-08-07 | 07:40:09 | 12:40:09 |
(10, 2020, 8, 8) | 2022-08-08 | 2022-08-08 | 07:40:09 | 17:40:09 |
(12, 2020, 8, 7) | 2022-08-07 | 2022-08-07 | 08:03:09 | 16:40:09 |
(13, 2020, 8, 7) | 2022-08-07 | 2022-08-07 | 09:22:45 | 17:57:06 |
CodePudding user response:
This approach is going to be verbose and not speedy but might solve the problem for now.
I first assign a suffix-pair to each ID/Date pair, then check if there is a check-in without check-out (so if the length is not even, it means there is a missing check-out).
The output is the same as your desired output
new_col = []
for i in df.ID.unique():
for d in df.Date.unique():
p = df.loc[(df.ID==i)&(df.Date==d)]
suffix = sorted(list(range(1,len(d)))*2)[:len(p)]
if len(suffix)%2!=0 and len(suffix)>1:
suffix[-2]=np.nan
suffix[-1]-=1
new_col.extend(suffix)
df['new'] = new_col
df.dropna().groupby(['ID','Date','new'], as_index=False).agg({'Time':[min,max]}).drop('new', axis=1, level=0)
Output:
ID Date Time
min max
0 10 2022-08-07 07:40:09 12:40:09
1 10 2022-08-08 07:40:09 17:40:09
2 12 2022-08-07 08:03:09 10:40:09
3 12 2022-08-07 14:40:09 16:40:09
4 13 2022-08-07 09:22:45 17:57:06