After sorting and grouping my Pandas DataFrame, I want to retrieve values from the lower rows if the top row field is NULL:
My input DataFrame is:
df = pandas.DataFrame({'ENr': ['EDFT2Z', 'EDFT2Z', 'EDFT2Z', 'EDFT2Z', 'EDFTDA', 'EDFTDA'],
'Rnk': [1, 2, 3, 4, 1, 2],
'StA': [None, 'W', 'F', 'F', None, None],
'Val': ['abc', 'def', 'ghi', 'jkl', 'mno', 'pqr']})
Sorting it according to rank and perform a groupby
df.sort_values('Rnk')
grouped = df.groupby('ENr')
I recieve the grouped data
ENr Rnk StA Val
------------------------
0 EDFT2Z 1 None abc \
1 EDFT2Z 2 W def | group 1
2 EDFT2Z 3 F ghi |
3 EDFT2Z 4 F jkl /
------------------------
4 EDFTDA 1 None mno \ group 2
5 EDFTDA 2 None pqr /
Now I want to replace the None-values of the top ranked group row if the lower ranked rows have non-None entries. The replacing value should be the one that occurs most in the group.
In my example the StA = None in row 0 should be replaced with a 'F'. For row 4 the None remains, as there is no better information in the group.
Expected result:
ENr Rnk StA Val
------------------------
0 EDFT2Z 1 F abc
------------------------
4 EDFTDA 1 None mno
CodePudding user response:
You can apply a custom function to your groups that does what you describe:
def get_first(group):
result = group.iloc[0]
if not result['StA']:
most_common = group['StA'].value_counts(dropna=False).index[0]
result['StA'] = most_common
return result
df.groupby('ENr').apply(get_first).reset_index(drop=True)
gives
ENr Rnk StA Val
0 EDFT2Z 1 F abc
1 EDFTDA 1 None mno
If you want to avoid hard-coding the columns to check for NA values, you can dynamically check them and iterate over each to do the above:
na_cols = result.index[result.isna()]
for na_col in na_cols:
most_common = group[na_col].value_counts(dropna=False).index[0]
result[na_col] = most_common