Home > database >  Adding max and min rows to a groupby result
Adding max and min rows to a groupby result

Time:10-21

I have a dataframe that looks something like this:

df = pd.DataFrame([1,'A','X','1/3/22 12:00:00AM','1/1/22 12:00:00 AM'],
[1,'A','X','1/4/22 1:00:00AM','1/2/22 12:00:00 AM'],
[1,'A','Y','1/1/22 1:00:00AM','1/1/22 12:00:00 AM'],
[1,'B','X','1/1/22 1:00:00AM','1/1/22 12:00:00 AM'],
[2,'A','X','1/3/22 12:00:00AM','1/1/22 12:00:00 AM'],
[2,'A','X','1/4/22 12:00:00AM','1/2/22 12:00:00 AM'],
columns = ['ID','Category','Site','Task Completed','Access Completed'])
ID Category Site Task Completed Access Completed
1 A X 1/3/22 12:00:00AM 1/1/22 12:00:00 AM
1 A Y 1/1/22 1:00:00AM 1/1/22 12:00:00 AM
1 A X 1/4/22 12:00:00AM 1/2/22 12:00:00 AM
1 B X 1/1/22 1:00:00AM 1/1/22 12:00:00 AM
2 A X 1/3/22 12:00:00AM 1/1/22 12:00:00 AM
2 A X 1/4/22 12:00:00AM 1/2/22 12:00:00 AM

What I want to find is the time difference (in hours) between the latest Access Complete date and the first Task Completed date for every ID/Category/Site combination within the dataset. I also want to include that first task completed date and the latest Access completed date along side the result.

I am able to get the first task completed date and calculate the difference between an access completed date. I am also able to get the first task completed date and an access completed date alongside the result. But I am not able to get the 'latest' access completed date. Here's what I have so far:

import pandas as pd

cols = ['ID','Category','Site','Task Completed','Access Completed']

df = pd.DataFrame([1,'A','X','1/3/22 12:00:00AM','1/1/22 12:00:00 AM'],
[1,'A','X','1/4/22 1:00:00AM','1/2/22 12:00:00 AM'],
[1,'A','Y','1/1/22 1:00:00AM','1/1/22 12:00:00 AM'],
[1,'B','X','1/1/22 1:00:00AM','1/1/22 12:00:00 AM'],
[2,'A','X','1/3/22 12:00:00AM','1/1/22 12:00:00 AM'],
[2,'A','X','1/4/22 12:00:00AM','1/2/22 12:00:00 AM'],
columns = cols)

#Convert to datetime
df[['Task Completed','Access Completed']] = df[['Task Completed','Access Completed']].apply(lambda x: pd.to_datetime(x))

# Remove duplicate columns - only keep the first task completed.
res = df.sort_values('Task Completed')\
    .drop_duplicates(subset=["ID", "Category", 'Site'], keep='first')\
    .sort_index()

# Calculate time difference
res['Time Difference'] = res['Task Completed'].sub(res['Access Completed']).dt.total_seconds().div(3600)

#Re-order and re-name columns
cols.insert(3,'Time Difference')
res = res[cols].rename(columns={"Task Completed": "First Task Completed"})

# Convert the dates back to desired format
res["First Task Completed"] = res["First Task Completed"].dt.strftime('%m/%d/%Y %H:%M:%S %p')
res["Access Completed"] = res["Access Completed"].dt.strftime('%m/%d/%Y %H:%M:%S %p')

print(res)

I've tried to add a .max() to 'Access Completed' like so:

res['Time Difference'] = res['Task Completed'].sub(res['Access Completed'].max()).dt.total_seconds().div(3600)

But that doesn't seem to give me the answer I want.

This is my intended result:

ID Category Site Time Difference First Task Completed Last Access Completed
1 A X 24 1/3/22 12:00:00AM 1/2/22 12:00:00 AM
1 A Y 1 1/1/22 1:00:00AM 1/1/22 12:00:00 AM
1 B X 1 1/1/22 1:00:00AM 1/1/22 12:00:00 AM
2 A X 24 1/3/22 12:00:00AM 1/2/22 12:00:00 AM

Thanks in advance for your help.

CodePudding user response:

You can use a groupby aggregation:

out = (df
   .groupby(['ID', 'Category', 'Site'], as_index=False)
   .agg({'Task Completed': 'first', 'Access Completed': 'max'})
   .assign(**{'Time Difference': lambda d: d['Task Completed']
              .sub(d['Access Completed'])
              .dt.total_seconds().floordiv(3600)})
)

output:

   ID Category Site      Task Completed Access Completed  Time Difference
0   1        A    X 2022-01-03 00:00:00       2022-01-02             24.0
1   1        A    Y 2022-01-01 01:00:00       2022-01-01              1.0
2   1        B    X 2022-01-01 01:00:00       2022-01-01              1.0
3   2        A    X 2022-01-03 00:00:00       2022-01-02             24.0
  • Related