dic = {'Employee ID':emp_id, 'Log Date':emp_logdate, 'Log Time':emp_logtime}
df = pd.DataFrame(dic).groupby(['Employee ID','Log Date']).agg({'Log Date':'first', 'Log Time': lambda x: ', '.join(x.unique())})['Log Time'].astype(str).str.split(', ', expand=True).reset_index()
Let's say I have a time log like this as the result on the code above:
'7:20', '11:50', '12:49', '17:20'
'7:02', '11:36', '12:59'
'11:33', '12:40', '17:06'
'11:38'
Given the above df, I want to move the row to their specific log sequence, like all 11 am logs are moved to the second column, all 12nn logs are moved to the third column, and all 5 pm logs are on the last column:
AM In AM Out PM In PM Out
'7:20', '11:50', '12:49', '17:20'
'7:02', '11:36', '12:59', ' '
' ', '11:33', '12:40', '17:06'
' ', '11:38', ' ', ' '
CodePudding user response:
One approach could be as follows:
import pandas as pd
import numpy as np
import re
data = {0: {0: '7:20', 1: '7:02', 2: '11:33', 3: '11:38'},
1: {0: '11:50', 1: '11:36', 2: '12:40', 3: np.nan},
2: {0: '12:49', 1: '12:59' , 2: '17:06', 3: np.nan},
3: {0: '17:20', 1: np.nan, 2: np.nan, 3: np.nan}}
df = pd.DataFrame(data)
d = {'AM_In': 7,
'AM_Out': 11,
'PM_In': 12,
'PM_Out': 17}
patt = re.compile('|'.join([f'(?P<{k}>{v}:.*)' for k,v in d.items()]))
res = df.stack().str.extractall(patt)
res = res.groupby(level=0).first()
print(res)
AM_In AM_Out PM_In PM_Out
0 7:20 11:50 12:49 17:20
1 7:02 11:36 12:59 None
2 None 11:33 12:40 17:06
3 None 11:38 None None
You can turn those strings into proper time values as follows:
res = res.apply(lambda x: pd.to_datetime(x).dt.time)
print(res)
AM_In AM_Out PM_In PM_Out
0 07:20:00 11:50:00 12:49:00 17:20:00
1 07:02:00 11:36:00 12:59:00 NaT
2 NaT 11:33:00 12:40:00 17:06:00
3 NaT 11:38:00 NaT NaT
Explanation steps:
- First create a
dict
with the relevant key-value pairs: e.g.'AM_In'
and7
. N.B. make sure that the keys contain only alphanumeric characters. Regex placeholders do not allow spaces (see next step). - Use a list comprehension to create the regex patterns with placeholders (generic:
(?P<name>...)
) by looping throughd.items()
. E.g. we get(?P<AM_In>7:.*)
, which means: a group named 'AM_In', which captures '7' followed by ':' and everything after that. - Use
'|'.join()
to join these patterns as alternatives. The result is'(?P<AM_In>7:.*)|(?P<AM_Out>11:.*)|(?P<PM_In>12:.*)|(?P<PM_Out>17:.*)'
which we pass tore.compile
. - For the next step we want to turn our
df
into apd.Series
. We usedf.stack
for this, and applystr.extractall
to get all the captured groups in a newpd.DataFrame
. - At this stage, our
res
initially looks like this:
print(df.stack().str.extractall(patt).head())
AM_In AM_Out PM_In PM_Out
match
0 0 0 7:20 NaN NaN NaN
1 0 NaN 11:50 NaN NaN
2 0 NaN NaN 12:49 NaN
3 0 NaN NaN NaN 17:20
1 0 0 7:02 NaN NaN NaN
We are now nearly there:
- Next, we apply
df.groupby
on the first level of the index (so:0
), and ask to get onlyfirst()
, i.e. the first non-NaN
value if available for each index value in each column. This will turn all duplicate index values into one row.
CodePudding user response:
From my understanding, this should do what you want.
First I create times_df
to store the times initially
# putting times supplied into a list
times = ['7:20','11:50','12:49','17:20','7:02','11:36','12:59','11:33','12:40','17:06','11:38']
# converts 'times' list to a df
times_df = pd.DataFrame(times,columns=["Time"])
# changes from str to datetime
times_df["Time"] = pd.to_datetime(times_df["Time"])
# sorts values, not really necessary, just to make it look nicer
times_df = times_df.sort_values("Time").reset_index(drop=True)
times_df
giving times_df
Time
0 2022-09-13 07:02:00
1 2022-09-13 07:20:00
2 2022-09-13 11:33:00
3 2022-09-13 11:36:00
4 2022-09-13 11:38:00
5 2022-09-13 11:50:00
6 2022-09-13 12:40:00
7 2022-09-13 12:49:00
8 2022-09-13 12:59:00
9 2022-09-13 17:06:00
10 2022-09-13 17:20:00
Next we have to take times_df
and split it out into columns
times_groups = [(7,"AM In"),(11,"AM Out"),(12,"PM In"),(17,"PM Out")]
# create empty dataframe for result
out_df = pd.DataFrame()
for hour,name in times_groups:
# selects the rows of times_df that match the hour that this group contains
group = times_df[times_df["Time"].dt.hour == hour][["Time"]]
# ranmes the column and resets index
group = group.rename(columns={"Time":name}).reset_index(drop=True)
# turns into series and converts data from datetime to time
group = group[name].dt.time
# appends column to end of out_df
out_df = pd.concat([out_df,group],axis=1)
out_df
giving out_df
AM In AM Out PM In PM Out
0 07:02:00 11:33:00 12:40:00 17:06:00
1 07:20:00 11:36:00 12:49:00 17:20:00
2 NaN 11:38:00 12:59:00 NaN
3 NaN 11:50:00 NaN NaN
This could also be easily amended to account for multiple hour in each group, e.g. 7a.m. & 8a.m. going into column "AM In".