Home > Enterprise >  Pandas: group and sum dataframe while adding min and max year
Pandas: group and sum dataframe while adding min and max year

Time:02-22

Consider the following dataframe:

import pandas as pd
from pandas import DataFrame

df = pd.DataFrame({'ID': ['A','A','A','B','B','B','C','C','C'], 'YEAR': [2000,2001,2002,2007,2008,2009,2015,2016,2017],
                  'ITEM-A': [100,200,300,700,800,900,1100,1200,1300], 'ITEM-B':[4000,5000,6000,3000,3500,4000,500,600,800]})

Initial dataframe

I want to group by the ID and sum ITEM-A and ITEM-B, but I also want to add 2 columns to include the min and max year (respectively) for each group. My ultimate goal is to end up with a dataframe that looks something like this:

Final dataframe example

What is the best way to accomplish this? My initial thought is to add the two new columns and populate each row with the min and max which should look something like this:

Min-max example

Then I could use a groupby/sum function to get the totals and drop the YEAR column.

I tried this code to add the columns but all I got were NaN values:

df['MIN-YR'] = df.groupby('ID')['YEAR'].min()
df['MAX-YR'] = df.groupby('ID')['YEAR'].max()

NaN screenshot

Is there a better way to accomplish what I'm trying to do? Am I on the right track? If so, what is causing the NaN values in the MIN-YR and MAX-YR columns?

Thanks!

CodePudding user response:

You could use agg and specify what function(s) to use for the aggregate per column:

(
 df
    .groupby("ID")
    .agg({"YEAR": ["min", "max"], "ITEM-A": "sum", "ITEM-B": "sum"})
)

This will generate the following table:

enter image description here

CodePudding user response:

You can do it like this, using Named Aggregation, without multiindex column headers nor renaming:

df.groupby('ID', as_index=False).agg(**{'ITEM-A':('ITEM-A','sum'),
                                        'ITEM-B':('ITEM-B','sum'),
                                        'MIN-YR':('YEAR', 'min'),
                                        'MAX-YR':('YEAR','max')})

Output:

  ID  ITEM-A  ITEM-B  MIN-YR  MAX-YR
0  A     600   15000    2000    2002
1  B    2400   10500    2007    2009
2  C    3600    1900    2015    2017
  • Related