Pandas: Get first and last row of the same date and calculate time difference-CodePudding

I have a dataframe where I have a date and a time column. Each row describes some event. I want to calculate the timespan for each different day and add it as new row. The actual calculation is not that important (which units etc.), I just want to know, how I can the first and last row for each date, to access the time value. The dataframe is already sorted by date and all rows of the same date are also ordered by the time.

Minimal example of what I have

import pandas as pd

df = pd.DataFrame({"Date": ["01.01.2020", "01.01.2020", "01.01.2020", "02.02.2022", "02.02.2022"], 
                   "Time": ["12:00", "13:00", "14:45", "02:00", "08:00"]})
df

and what I want

EDIT: The duration column should be calculated by 14:45 - 12:00 = 2:45 for the first date and 08:00 - 02:00 = 6:00 for the second date.

I suspect this is possible with the groupby function but I am not sure how exactly to do it.

CodePudding user response：

By the example shown below, you can achieve want you want.

df['Start time'] = df.apply(lambda row: df[df['Date'] == row['Date']]['Time'].max(), axis=1)
df

Update:

import datetime
df['Duration'] = df.apply(lambda row: str(datetime.timedelta(seconds=(datetime.datetime.strptime(df[df['Date'] == row['Date']]['Time'].max(), '%H:%M') - datetime.datetime.strptime(df[df['Date'] == row['Date']]['Time'].min(), '%H:%M')).total_seconds())) , axis=1)
df

CodePudding user response：

I hope you will find this helpful.

import pandas as pd

df = pd.DataFrame({"Date": ["01.01.2020", "01.01.2020", "01.01.2020", "02.02.2022", "02.02.2022"],
                   "Time": ["12:00", "13:00", "14:45", "02:00", "08:00"]})

df["Datetime"] = pd.to_datetime((df["Date"]   " "   df["Time"]))


def date_diff(df):
    df["Duration"] = df["Datetime"].max() - df["Datetime"].min()
    return df


df = df.groupby("Date").apply(date_diff)

df = df.drop("Datetime", axis=1)

Output:

         Date   Time        Duration
0  01.01.2020  12:00 0 days 02:45:00
1  01.01.2020  13:00 0 days 02:45:00
2  01.01.2020  14:45 0 days 02:45:00
3  02.02.2022  02:00 0 days 06:00:00
4  02.02.2022  08:00 0 days 06:00:00

You can then do some string styling:

df['Duration'] = df['Duration'].astype(str).map(lambda x: x[7:12])

Output:

         Date   Time Duration
0  01.01.2020  12:00    02:45
1  01.01.2020  13:00    02:45
2  01.01.2020  14:45    02:45
3  02.02.2022  02:00    06:00
4  02.02.2022  08:00    06:00

CodePudding user response：

you can use:

from datetime import timedelta
import numpy as np 

df['xdate']=pd.to_datetime(df['Date']   ' '  df['Time'],format='%d.%m.%Y %H:%M')

df['max']=df.groupby(df['xdate'].dt.date)['xdate'].transform(np.max) #get max dates each date
df['min']=df.groupby(df['xdate'].dt.date)['xdate'].transform(np.min) #get min date each date

#get difference max and min dates
df['Duration']= df[['min','max']].apply(lambda x: x['max'] - timedelta(hours=x['min'].hour,minutes=x['min'].minute,seconds=x['min'].second),axis=1).dt.strftime('%H:%M') 

df=df.drop(['xdate','min','max'],axis=1)

print(df)
'''
         Date   Time Duration
0  01.01.2020  12:00    02:45
1  01.01.2020  13:00    02:45
2  01.01.2020  14:45    02:45
3  02.02.2022  02:00    06:00
4  02.02.2022  08:00    06:00
'''

CodePudding user response：

here is one way to do it

# groupby on Date and find the difference of max and min time in each group
# format it as HH:MM by extracting Hours and minutes
# and creating a dictionary
d=dict((df.groupby('Date')['Time'].apply(lambda x: 
                                  (pd.to_timedelta(x.max()  ':00')  - 
                                   pd.to_timedelta(x.min()  ':00') 
                                  )
                                 ).astype(str).str.extract(r'days (..:..)')
).reset_index().values)


# map the dictionary and update the duration in DF
df['duration']=df['Date'].map(d)
df

    Date    Time    duration
0   01.01.2020  12:00   02:45
1   01.01.2020  13:00   02:45
2   01.01.2020  14:45   02:45
3   02.02.2022  02:00   06:00
4   02.02.2022  08:00   06:00