Using Pandas to aggregate value based on another column aggregated index-CodePudding

my df looks like this:

student_id,    coll_type,  grad_year,   completed    
1001           2-year        NA            N          
1001           2-year        NA            N           
1001           2-year      2003            Y           
1001           2-year      2004            Y           
1001           4-year      2007            Y            

1002           4-year        NA            N              
1002           2-year        NA            N                   
1002           2-year        NA            N                   

1003           4-year        NA            N                    
1003           4-year        NA            N                    

1004           2-year        NA            N                   
1004           4-year      2005            Y  
1004           4-year      2007            Y

My goal is get this:

student_id,    coll_type,  year_year,  completed            
1001           2-year      2003           Y           
1001           4-year      2007           Y            

1002           4-year        NA           N              
1002           2-year        NA           N                   

1003           4-year        NA           N                    

1004           2-year        NA           N                   
1004           4-year      2005           Y

In summary:

The goal is to group by ['student_id', 'coll_type'] and then aggregrate by the earliest 'grad_year' (if any) and "choose" its corresponding 'completed' value.
if the 'grad_year' value is "only" NA for that group ['student_id', 'coll_type'], then we select NA and its corresponding 'completed' value, which is always N

What did I do?

df.groupby(['student_id', 'coll_type'])
  .agg({
grad_year = ('grad_year', 'min'),
completed = ('completed', 'max')
})

but it does not work, since I am choosing different min and max value.

CodePudding user response：

Use groupby.idxmax on the boolean Series of the NA-status of the grad_year:

out = df.loc[df['grad_year']
              .notna()
              .groupby([df['student_id'], df['coll_type']])
              .idxmax()
             ]

output:

    student_id coll_type  grad_year completed
2         1001    2-year     2003.0         Y
4         1001    4-year     2007.0         Y
6         1002    2-year        NaN         N
5         1002    4-year        NaN         N
8         1003    4-year        NaN         N
10        1004    2-year        NaN         N
11        1004    4-year     2005.0         Y

CodePudding user response：

A user-defined function returns the desired output:

import pandas as pd

def my_fun(x):
    if all(x['grad_year'].isnull()):
        return (x['grad_year'].iloc[0], x['completed'].iloc[0])
    else:
        return (x['grad_year'][x['grad_year'].notna()].iloc[0], x['completed'][x['grad_year'].notna()].iloc[0])



df = df.groupby(['student_id', 'coll_type']).apply(my_fun).reset_index(name='new')
df[['grad_year', 'completed']] = pd.DataFrame(df['new'].tolist(), index=df.index)
df.drop('new', axis = 1, inplace = True)

   student_id, coll_type,  grad_year, completed
0         1001     2-year      2003.0         Y
1         1001     4-year      2007.0         Y
2         1002     2-year         NaN         N
3         1002     4-year         NaN         N
4         1003     4-year         NaN         N
5         1004     2-year         NaN         N
6         1004     4-year      2005.0         Y