Home > Mobile >  How to get values from lower rows into the first row of a pandas group?
How to get values from lower rows into the first row of a pandas group?

Time:06-09

After sorting and grouping my Pandas DataFrame, I want to retrieve values from the lower rows if the top row field is NULL:

My input DataFrame is:

df = pandas.DataFrame({'ENr': ['EDFT2Z', 'EDFT2Z', 'EDFT2Z', 'EDFT2Z', 'EDFTDA', 'EDFTDA'],
                   'Rnk': [1, 2, 3, 4, 1, 2],
                   'StA': [None, 'W', 'F', 'F', None, None],
                   'Val': ['abc', 'def', 'ghi', 'jkl', 'mno',  'pqr']})

Sorting it according to rank and perform a groupby

df.sort_values('Rnk')
grouped = df.groupby('ENr')

I recieve the grouped data

    ENr     Rnk StA  Val
------------------------ 
0   EDFT2Z  1   None abc  \
1   EDFT2Z  2   W    def  | group 1
2   EDFT2Z  3   F    ghi  |
3   EDFT2Z  4   F    jkl  /
------------------------
4   EDFTDA  1   None mno \ group 2
5   EDFTDA  2   None pqr /

Now I want to replace the None-values of the top ranked group row if the lower ranked rows have non-None entries. The replacing value should be the one that occurs most in the group.

In my example the StA = None in row 0 should be replaced with a 'F'. For row 4 the None remains, as there is no better information in the group.

Expected result:

    ENr     Rnk StA  Val
------------------------
0   EDFT2Z  1   F    abc
------------------------
4   EDFTDA  1   None mno

CodePudding user response:

You can apply a custom function to your groups that does what you describe:

def get_first(group):
    result = group.iloc[0]
    if not result['StA']:
        most_common = group['StA'].value_counts(dropna=False).index[0]
        result['StA'] = most_common
    return result

df.groupby('ENr').apply(get_first).reset_index(drop=True)

gives

      ENr  Rnk   StA  Val
0  EDFT2Z    1     F  abc
1  EDFTDA    1  None  mno

If you want to avoid hard-coding the columns to check for NA values, you can dynamically check them and iterate over each to do the above:

na_cols = result.index[result.isna()]
for na_col in na_cols:
    most_common = group[na_col].value_counts(dropna=False).index[0]
    result[na_col] = most_common
  • Related