Create a new column which calculates the difference between last value and the first value of time c-CodePudding

I have a dataframe

df = pd.DataFrame([["A","9:00 AM"],["A","11:12 AM"],["A","1:03 PM"],["B","9:00 AM"],["B","12:56 PM"],["B","1:07 PM"],
                   ["B","1:18 PM"]],columns=["id","time"])

id    time
A   09:00 AM
A   11:12 AM
A   01:03 PM
B   09:00 AM
B   12:56 PM
B   01:07 PM
B   01:18 PM

I want to create a new column which calculates the difference between last value and the first value of time column at id level, and add offset value of 30 min to the value.

Ex: Here for id A, diff between 01:03 PM and 09:00 AM is 4hr 3 min. For this add 30 min as offset value so it becomes 4 hr 33 min. Add that value to new column total_hrs for all the rows of id A.

Expected Output:

df_out = pd.DataFrame([["A","9:00 AM","04:33:00"],["A","11:12 AM","04:33:00"],["A","1:03 PM","04:33:00"],["B","9:00 AM","04:48:00"],
                       ["B","12:56 PM","04:48:00"],["B","1:07 PM","04:48:00"],["B","1:18 PM","04:48:00"]],columns=["id","time","total_hrs"])

id    time     total_hrs
A   09:00 AM    04:33:00
A   11:12 AM    04:33:00
A   01:03 PM    04:33:00
B   09:00 AM    04:48:00
B   12:56 PM    04:48:00
B   01:07 PM    04:48:00
B   01:18 PM    04:48:00

CodePudding user response：

Use conversion to_datetime, then groupby.transform (here with np.ptp as convenience). Add the desired Timedelta and optionally convert to string:

import numpy as np

df['total_hrs'] = (pd.to_datetime(df['time'])
                     .groupby(df['id']).transform(np.ptp)
                     .add(pd.Timedelta('30min'))
                     # optional, if you want a string
                     .astype(str).str.extract('(\d\d:\d\d:\d\d)')
                  )

Output:

  id      time total_hrs
0  A   9:00 AM  04:33:00
1  A  11:12 AM  04:33:00
2  A   1:03 PM  04:33:00
3  B   9:00 AM  04:48:00
4  B  12:56 PM  04:48:00
5  B   1:07 PM  04:48:00
6  B   1:18 PM  04:48:00

CodePudding user response：

For improve performance avoid np.ptp and subtract minimal with maximal values per groups to new column:

df['time'] = pd.to_datetime(df['time'])

g = df.groupby('id')['time']

df['total_hrs'] = g.transform('max').sub(g.transform('min')).add(pd.Timedelta('30min'))
print (df)
  id                time       total_hrs
0  A 2023-01-18 09:00:00 0 days 04:33:00
1  A 2023-01-18 11:12:00 0 days 04:33:00
2  A 2023-01-18 13:03:00 0 days 04:33:00
3  B 2023-01-18 09:00:00 0 days 04:48:00
4  B 2023-01-18 12:56:00 0 days 04:48:00
5  B 2023-01-18 13:07:00 0 days 04:48:00
6  B 2023-01-18 13:18:00 0 days 04:48:00

Or:

df['time'] = pd.to_datetime(df['time'])

df1 = df.groupby('id')['time'].agg(['min','max'])

df['total_hrs'] = df['id'].map(df1['max'].sub(df1['min'])).add(pd.Timedelta('30min'))
print (df)

  id                time       total_hrs
0  A 2023-01-18 09:00:00 0 days 04:33:00
1  A 2023-01-18 11:12:00 0 days 04:33:00
2  A 2023-01-18 13:03:00 0 days 04:33:00
3  B 2023-01-18 09:00:00 0 days 04:48:00
4  B 2023-01-18 12:56:00 0 days 04:48:00
5  B 2023-01-18 13:07:00 0 days 04:48:00
6  B 2023-01-18 13:18:00 0 days 04:48:00

Last for formating timedeltas use custom function (working well if after add 30 minutes are hour greater like 24 hour):

def f(x):
    ts = x.total_seconds()
    hours, remainder = divmod(ts, 3600)
    minutes, seconds = divmod(remainder, 60)
    return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))

df['total_hrs'] = df['total_hrs'].apply(f)
print (df)
  id                time total_hrs
0  A 2023-01-18 09:00:00  04:33:00
1  A 2023-01-18 11:12:00  04:33:00
2  A 2023-01-18 13:03:00  04:33:00
3  B 2023-01-18 09:00:00  04:48:00
4  B 2023-01-18 12:56:00  04:48:00
5  B 2023-01-18 13:07:00  04:48:00
6  B 2023-01-18 13:18:00  04:48:00

Change data sample for see ouput if timedeltas are greate like 1 day:

df = pd.DataFrame([["A","9:00 AM"],["A","00:12 AM"],["A","11:53 PM"],
                   ["B","9:00 AM"],["B","12:56 PM"],["B","1:07 PM"],
                   ["B","1:18 PM"]],columns=["id","time"])
    
df['time'] = pd.to_datetime(df['time'])

df1 = df.groupby('id')['time'].agg(['min','max'])

df['total_hrs'] = df['id'].map(df1['max'].sub(df1['min'])).add(pd.Timedelta('30min'))
print (df)
  id                time       total_hrs
0  A 2023-01-18 09:00:00 1 days 00:11:00
1  A 2023-01-18 00:12:00 1 days 00:11:00
2  A 2023-01-18 23:53:00 1 days 00:11:00
3  B 2023-01-18 09:00:00 0 days 04:48:00
4  B 2023-01-18 12:56:00 0 days 04:48:00
5  B 2023-01-18 13:07:00 0 days 04:48:00
6  B 2023-01-18 13:18:00 0 days 04:48:00

def f(x):
    ts = x.total_seconds()
    hours, remainder = divmod(ts, 3600)
    minutes, seconds = divmod(remainder, 60)
    return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))

df['total_hrs'] = df['total_hrs'].apply(f)
print (df)

  id                time total_hrs
0  A 2023-01-18 09:00:00  24:11:00
1  A 2023-01-18 00:12:00  24:11:00
2  A 2023-01-18 23:53:00  24:11:00
3  B 2023-01-18 09:00:00  04:48:00
4  B 2023-01-18 12:56:00  04:48:00
5  B 2023-01-18 13:07:00  04:48:00
6  B 2023-01-18 13:18:00  04:48:00