I have a dataframe
df = pd.DataFrame([["A","9:00 AM"],["A","11:12 AM"],["A","1:03 PM"],["B","9:00 AM"],["B","12:56 PM"],["B","1:07 PM"],
["B","1:18 PM"]],columns=["id","time"])
id time
A 09:00 AM
A 11:12 AM
A 01:03 PM
B 09:00 AM
B 12:56 PM
B 01:07 PM
B 01:18 PM
I want to create a new column which calculates the difference between last value and the first value of time column at id level, and add offset value of 30 min to the value.
Ex: Here for id A, diff between 01:03 PM and 09:00 AM is 4hr 3 min. For this add 30 min as offset value so it becomes 4 hr 33 min. Add that value to new column total_hrs for all the rows of id A.
Expected Output:
df_out = pd.DataFrame([["A","9:00 AM","04:33:00"],["A","11:12 AM","04:33:00"],["A","1:03 PM","04:33:00"],["B","9:00 AM","04:48:00"],
["B","12:56 PM","04:48:00"],["B","1:07 PM","04:48:00"],["B","1:18 PM","04:48:00"]],columns=["id","time","total_hrs"])
id time total_hrs
A 09:00 AM 04:33:00
A 11:12 AM 04:33:00
A 01:03 PM 04:33:00
B 09:00 AM 04:48:00
B 12:56 PM 04:48:00
B 01:07 PM 04:48:00
B 01:18 PM 04:48:00
CodePudding user response:
Use conversion to_datetime
, then groupby.transform
(here with np.ptp
as convenience). Add the desired Timedelta
and optionally convert to string:
import numpy as np
df['total_hrs'] = (pd.to_datetime(df['time'])
.groupby(df['id']).transform(np.ptp)
.add(pd.Timedelta('30min'))
# optional, if you want a string
.astype(str).str.extract('(\d\d:\d\d:\d\d)')
)
Output:
id time total_hrs
0 A 9:00 AM 04:33:00
1 A 11:12 AM 04:33:00
2 A 1:03 PM 04:33:00
3 B 9:00 AM 04:48:00
4 B 12:56 PM 04:48:00
5 B 1:07 PM 04:48:00
6 B 1:18 PM 04:48:00
CodePudding user response:
For improve performance avoid np.ptp
and subtract minimal with maximal values per groups to new column:
df['time'] = pd.to_datetime(df['time'])
g = df.groupby('id')['time']
df['total_hrs'] = g.transform('max').sub(g.transform('min')).add(pd.Timedelta('30min'))
print (df)
id time total_hrs
0 A 2023-01-18 09:00:00 0 days 04:33:00
1 A 2023-01-18 11:12:00 0 days 04:33:00
2 A 2023-01-18 13:03:00 0 days 04:33:00
3 B 2023-01-18 09:00:00 0 days 04:48:00
4 B 2023-01-18 12:56:00 0 days 04:48:00
5 B 2023-01-18 13:07:00 0 days 04:48:00
6 B 2023-01-18 13:18:00 0 days 04:48:00
Or:
df['time'] = pd.to_datetime(df['time'])
df1 = df.groupby('id')['time'].agg(['min','max'])
df['total_hrs'] = df['id'].map(df1['max'].sub(df1['min'])).add(pd.Timedelta('30min'))
print (df)
id time total_hrs
0 A 2023-01-18 09:00:00 0 days 04:33:00
1 A 2023-01-18 11:12:00 0 days 04:33:00
2 A 2023-01-18 13:03:00 0 days 04:33:00
3 B 2023-01-18 09:00:00 0 days 04:48:00
4 B 2023-01-18 12:56:00 0 days 04:48:00
5 B 2023-01-18 13:07:00 0 days 04:48:00
6 B 2023-01-18 13:18:00 0 days 04:48:00
Last for formating timedeltas use custom function (working well if after add 30 minutes are hour greater like 24 hour):
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
df['total_hrs'] = df['total_hrs'].apply(f)
print (df)
id time total_hrs
0 A 2023-01-18 09:00:00 04:33:00
1 A 2023-01-18 11:12:00 04:33:00
2 A 2023-01-18 13:03:00 04:33:00
3 B 2023-01-18 09:00:00 04:48:00
4 B 2023-01-18 12:56:00 04:48:00
5 B 2023-01-18 13:07:00 04:48:00
6 B 2023-01-18 13:18:00 04:48:00
Change data sample for see ouput if timedeltas are greate like 1 day:
df = pd.DataFrame([["A","9:00 AM"],["A","00:12 AM"],["A","11:53 PM"],
["B","9:00 AM"],["B","12:56 PM"],["B","1:07 PM"],
["B","1:18 PM"]],columns=["id","time"])
df['time'] = pd.to_datetime(df['time'])
df1 = df.groupby('id')['time'].agg(['min','max'])
df['total_hrs'] = df['id'].map(df1['max'].sub(df1['min'])).add(pd.Timedelta('30min'))
print (df)
id time total_hrs
0 A 2023-01-18 09:00:00 1 days 00:11:00
1 A 2023-01-18 00:12:00 1 days 00:11:00
2 A 2023-01-18 23:53:00 1 days 00:11:00
3 B 2023-01-18 09:00:00 0 days 04:48:00
4 B 2023-01-18 12:56:00 0 days 04:48:00
5 B 2023-01-18 13:07:00 0 days 04:48:00
6 B 2023-01-18 13:18:00 0 days 04:48:00
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
df['total_hrs'] = df['total_hrs'].apply(f)
print (df)
id time total_hrs
0 A 2023-01-18 09:00:00 24:11:00
1 A 2023-01-18 00:12:00 24:11:00
2 A 2023-01-18 23:53:00 24:11:00
3 B 2023-01-18 09:00:00 04:48:00
4 B 2023-01-18 12:56:00 04:48:00
5 B 2023-01-18 13:07:00 04:48:00
6 B 2023-01-18 13:18:00 04:48:00