Home > Net >  Move rows in dataframe according to time
Move rows in dataframe according to time

Time:09-13

dic = {'Employee ID':emp_id, 'Log Date':emp_logdate, 'Log Time':emp_logtime}
df = pd.DataFrame(dic).groupby(['Employee ID','Log Date']).agg({'Log Date':'first', 'Log Time': lambda x: ', '.join(x.unique())})['Log Time'].astype(str).str.split(', ', expand=True).reset_index()

Let's say I have a time log like this as the result on the code above:

'7:20',  '11:50', '12:49', '17:20'
'7:02',  '11:36', '12:59'   
'11:33', '12:40', '17:06'   
'11:38'

Given the above df, I want to move the row to their specific log sequence, like all 11 am logs are moved to the second column, all 12nn logs are moved to the third column, and all 5 pm logs are on the last column:

AM In   AM Out   PM In    PM Out
'7:20', '11:50', '12:49', '17:20'
'7:02', '11:36', '12:59', ' '
' ',    '11:33', '12:40', '17:06'   
' ',    '11:38', ' ',     ' '

CodePudding user response:

One approach could be as follows:

import pandas as pd
import numpy as np
import re

data = {0: {0: '7:20', 1: '7:02', 2: '11:33', 3: '11:38'}, 
        1: {0: '11:50', 1: '11:36', 2:  '12:40', 3: np.nan}, 
        2: {0: '12:49', 1: '12:59'   , 2: '17:06', 3: np.nan}, 
        3: {0: '17:20', 1: np.nan, 2: np.nan, 3: np.nan}}

df = pd.DataFrame(data)

d = {'AM_In': 7, 
     'AM_Out': 11, 
     'PM_In': 12, 
     'PM_Out': 17}

patt = re.compile('|'.join([f'(?P<{k}>{v}:.*)' for k,v in d.items()]))

res = df.stack().str.extractall(patt)
res = res.groupby(level=0).first()

print(res)

  AM_In AM_Out  PM_In PM_Out
0  7:20  11:50  12:49  17:20
1  7:02  11:36  12:59   None
2  None  11:33  12:40  17:06
3  None  11:38   None   None

You can turn those strings into proper time values as follows:

res = res.apply(lambda x: pd.to_datetime(x).dt.time)

print(res)

      AM_In    AM_Out     PM_In    PM_Out
0  07:20:00  11:50:00  12:49:00  17:20:00
1  07:02:00  11:36:00  12:59:00       NaT
2       NaT  11:33:00  12:40:00  17:06:00
3       NaT  11:38:00       NaT       NaT

Explanation steps:

  • First create a dict with the relevant key-value pairs: e.g. 'AM_In' and 7. N.B. make sure that the keys contain only alphanumeric characters. Regex placeholders do not allow spaces (see next step).
  • Use a list comprehension to create the regex patterns with placeholders (generic: (?P<name>...)) by looping through d.items(). E.g. we get (?P<AM_In>7:.*), which means: a group named 'AM_In', which captures '7' followed by ':' and everything after that.
  • Use '|'.join() to join these patterns as alternatives. The result is '(?P<AM_In>7:.*)|(?P<AM_Out>11:.*)|(?P<PM_In>12:.*)|(?P<PM_Out>17:.*)' which we pass to re.compile.
  • For the next step we want to turn our df into a pd.Series. We use df.stack for this, and apply str.extractall to get all the captured groups in a new pd.DataFrame.
  • At this stage, our res initially looks like this:
print(df.stack().str.extractall(patt).head())

          AM_In AM_Out  PM_In PM_Out
    match                           
0 0 0      7:20    NaN    NaN    NaN
  1 0       NaN  11:50    NaN    NaN
  2 0       NaN    NaN  12:49    NaN
  3 0       NaN    NaN    NaN  17:20
1 0 0      7:02    NaN    NaN    NaN

We are now nearly there:

  • Next, we apply df.groupby on the first level of the index (so: 0), and ask to get only first(), i.e. the first non-NaN value if available for each index value in each column. This will turn all duplicate index values into one row.

CodePudding user response:

From my understanding, this should do what you want.

First I create times_df to store the times initially

# putting times supplied into a list
times = ['7:20','11:50','12:49','17:20','7:02','11:36','12:59','11:33','12:40','17:06','11:38']

# converts 'times' list to a df
times_df = pd.DataFrame(times,columns=["Time"])
# changes from str to datetime
times_df["Time"] = pd.to_datetime(times_df["Time"])
# sorts values, not really necessary, just to make it look nicer
times_df = times_df.sort_values("Time").reset_index(drop=True)
times_df

giving times_df

                  Time
0  2022-09-13 07:02:00
1  2022-09-13 07:20:00
2  2022-09-13 11:33:00
3  2022-09-13 11:36:00
4  2022-09-13 11:38:00
5  2022-09-13 11:50:00
6  2022-09-13 12:40:00
7  2022-09-13 12:49:00
8  2022-09-13 12:59:00
9  2022-09-13 17:06:00
10 2022-09-13 17:20:00

Next we have to take times_df and split it out into columns

times_groups = [(7,"AM In"),(11,"AM Out"),(12,"PM In"),(17,"PM Out")]

# create empty dataframe for result
out_df = pd.DataFrame()
for hour,name in times_groups:
    # selects the rows of times_df that match the hour that this group contains
    group = times_df[times_df["Time"].dt.hour == hour][["Time"]]
    # ranmes the column and resets index
    group = group.rename(columns={"Time":name}).reset_index(drop=True)
    # turns into series and converts data from datetime to time
    group = group[name].dt.time
    # appends column to end of out_df
    out_df = pd.concat([out_df,group],axis=1)
out_df

giving out_df

      AM In    AM Out     PM In    PM Out
0  07:02:00  11:33:00  12:40:00  17:06:00
1  07:20:00  11:36:00  12:49:00  17:20:00
2       NaN  11:38:00  12:59:00       NaN
3       NaN  11:50:00       NaN       NaN

This could also be easily amended to account for multiple hour in each group, e.g. 7a.m. & 8a.m. going into column "AM In".

  • Related