my df
looks like this:
student_id, coll_type, grad_year, completed
1001 2-year NA N
1001 2-year NA N
1001 2-year 2003 Y
1001 2-year 2004 Y
1001 4-year 2007 Y
1002 4-year NA N
1002 2-year NA N
1002 2-year NA N
1003 4-year NA N
1003 4-year NA N
1004 2-year NA N
1004 4-year 2005 Y
1004 4-year 2007 Y
My goal is get this:
student_id, coll_type, year_year, completed
1001 2-year 2003 Y
1001 4-year 2007 Y
1002 4-year NA N
1002 2-year NA N
1003 4-year NA N
1004 2-year NA N
1004 4-year 2005 Y
In summary:
- The goal is to group by
['student_id', 'coll_type']
and thenaggregrate
by the earliest'grad_year'
(if any) and "choose" its corresponding'completed'
value. - if the
'grad_year'
value is "only"NA
for that group['student_id', 'coll_type']
, then we selectNA
and its corresponding'completed'
value, which is alwaysN
What did I do?
df.groupby(['student_id', 'coll_type'])
.agg({
grad_year = ('grad_year', 'min'),
completed = ('completed', 'max')
})
but it does not work, since I am choosing different min
and max
value.
CodePudding user response:
Use groupby.idxmax
on the boolean Series of the NA-status of the grad_year:
out = df.loc[df['grad_year']
.notna()
.groupby([df['student_id'], df['coll_type']])
.idxmax()
]
output:
student_id coll_type grad_year completed
2 1001 2-year 2003.0 Y
4 1001 4-year 2007.0 Y
6 1002 2-year NaN N
5 1002 4-year NaN N
8 1003 4-year NaN N
10 1004 2-year NaN N
11 1004 4-year 2005.0 Y
CodePudding user response:
A user-defined function returns the desired output:
import pandas as pd
def my_fun(x):
if all(x['grad_year'].isnull()):
return (x['grad_year'].iloc[0], x['completed'].iloc[0])
else:
return (x['grad_year'][x['grad_year'].notna()].iloc[0], x['completed'][x['grad_year'].notna()].iloc[0])
df = df.groupby(['student_id', 'coll_type']).apply(my_fun).reset_index(name='new')
df[['grad_year', 'completed']] = pd.DataFrame(df['new'].tolist(), index=df.index)
df.drop('new', axis = 1, inplace = True)
student_id, coll_type, grad_year, completed
0 1001 2-year 2003.0 Y
1 1001 4-year 2007.0 Y
2 1002 2-year NaN N
3 1002 4-year NaN N
4 1003 4-year NaN N
5 1004 2-year NaN N
6 1004 4-year 2005.0 Y