I'm pretty new to python and pandas (beginner level)
I have a pandas dataframe which I'm trying to calculate how many hours/min/secs worked in the morning , afternoon , evening.
mornings will be within the time range (03:00 and 12:00) afternoons will be within the time range ( 12:00 and 17:00)) evenings will be within the time range ( 17:00 and 03:00)
data = [ ('employee1', '2022-10-28', '12:06', '13:00:00', '00:00:00', '00:00:23'),
('employee2','2022-10-28', '10:00', '06:00:00', '00:00:00', '00:00:16'),
('employee3', '2022-05-06', '16:13', '08:00:00', '00:54:00', '00:00:09'),
('employee4', '2022-06-03', '2:33', '09:00:00', '00:19:00', '00:00:56'),
('employee5', '2022-08-12', '9:50', '20:00:00', '00:27:00', '00:00:22'),
('employee6', '2022-02-15', '6:52', '00:00:00', '00:35:00','00:00:35')]
df = pd.DataFrame(data, columns =['Name','date','start_time','hours_worked','minutes_worked','seconds_worked'])
df[['enddate_time','time_worked_morning','time_worked_afternoon','time_worked_evening','total_time_wkd']]=None
can someone please help onto how to calculate how much of total time worked falls within these 3 ranges ?
CodePudding user response:
Since you mentioned you are a beginner, I will include some extra comments in my code. I will also present 2 solutions: a "slow" one that use mostly Python's loops and a "fast" one that uses numpy vectorized code.
Both solutions will use the following code snippet. All dates/times in your dataframe are stored as strings, which are hard to work with. Convert them to pd.Timestamp
and pd.Timedelta
for easier manipulation:
start_time = pd.to_datetime(df["date"] " " df["start_time"])
time_worked = (
pd.to_timedelta(df["hours_worked"])
pd.to_timedelta(df["minutes_worked"])
pd.to_timedelta(df["seconds_worked"])
)
end_time = start_time time_worked
The slow solution
Loop over each employe, then over each shift to see how much time they work in that shift. Then tally up the results to assign them as new columns in your dataframe:
# You did not specify a unit for your time_worked_morning, _aternoon, etc.
# I assume you want to measure them in hours.
unit = pd.Timedelta(1, "h")
# The shift boundaries. There are 4 boundaries for 3 shifts because we need to
# account for the start and end time of each shift.
# Shift start time: 3, 12, 17
# Shift end time: 12, 17, 27
# The numbers represent hours since midnight of the day
boundaries = [pd.Timedelta(i, "h") for i in [3, 12, 17, 27]]
# Now loop through each employee's work record to break down their working hours
# into shifts
data = []
for work_st, work_et in zip(start_time, end_time):
# Get the midnight of the day
start_of_day = work_st.normalize()
# Work time in each shift starts with 0
work_time = [0] * (len(boundaries) - 1)
# loop through the shifts
for i, (lb, ub) in enumerate(zip(boundaries[:-1], boundaries[1:])):
# Calculate the start and end time of each shift
shift_st = start_of_day lb
shift_et = start_of_day ub
# Work time in shift = (effective end time) - (effective start time)
# Effective end time = (work end time) or (shift end time) , whichever is EARLIER => use min
# Effective start time = (work start time) or (shift start time), whichever is LATER => use max
# The "/ unit" operation is to convert the pd.Timedelta object into a
# float representing the hour
t = (min(work_et, shift_et) - max(work_st, shift_st)) / unit
# Our algorithm above sometimes cause an `Effective end time` that is
# before the `Effective start time`. An employee can't spend negative
# time in a shift so clip it to 0 if negative.
work_time[i] = max(0, t)
data.append(work_time)
# Convert `data` numpy array for easier slicing
data = np.array(data)
# Add the extra columns to your data frame
df["enddate_time"] = end_time
df["time_worked_morning"] = data[:, 0]
df["time_worked_afternoon"] = data[:, 1]
df["time_worked_evening"] = data[:, 2]
df["total_time_wkd"] = time_worked / unit
The fast solution
Instead of working with one employee and one shift at a time, we will deal with 2D arrays containing all employees in all shifts. This enables us to use several vectorized operations offered by numpy.
# The boundaries here are pretty much the same as in the "slow" solution, only
# as a numpy array instead of list
boundaries = np.array([np.timedelta64(i, "h") for i in [3, 12, 17, 27]])
n = len(boundaries) - 1
# For every employee, repeat the work start time and end time to once per shift
work_st = np.tile(start_time.to_numpy(), (n, 1)).T
work_et = np.tile(end_time.to_numpy(), (n, 1)).T
# Calculate the shift's start and end time for each day
start_of_day = start_time.dt.normalize().to_numpy()[:, None]
shift_st = start_of_day boundaries[:-1]
shift_et = start_of_day boundaries[1:]
# The effective start and end time are calculated identically to above, only
# in vectorized form
effective_st = np.max([work_st, shift_st], axis=0)
effective_et = np.min([work_et, shift_et], axis=0)
# The time worked per shift uses the same calculation as the solution: effective
# end time - effective start time, with a minimum of 0
data = np.clip((effective_et - effective_st) / unit, 0, None)
# Add the extra columns to your data frame
df["enddate_time"] = end_time
df["time_worked_morning"] = data[:, 0]
df["time_worked_afternoon"] = data[:, 1]
df["time_worked_evening"] = data[:, 2]
df["total_time_wkd"] = time_worked / unit