Home > other >  Fill Missing values with max from a group and get rows corresponding to that max value
Fill Missing values with max from a group and get rows corresponding to that max value


I have an input data as shown:

df = pd.DataFrame({"colony" : [22, 22, 22, 33, 33, 33],
                  "measure" : [np.nan, 7, 11, 13, np.nan, 9,],
                   "Length" : [14, 17, 13, 10, 19,16],
                   "net/gross" : [np.nan, "gross", "net", "gross", "np.nan", "net"]})


    colony  measure  length net/gross
0   22      NaN      14     NaN  
1   22      7        17     gross
2   22      11       13     net
3   33      13       10     gross
4   33      NaN      19     NaN
5   33      9        16     net

I want to fill the NaN in the measure column with maximum value from each group of the colony, then fill the NaN in the net/gross column with the net/gross value at the row where the measure was maximum (e.g fill the NaN at index 0 with the value corresponding to where the measure was max which is "net", and 13 on the length_adj column) and create a remark column to document all the NaN filled rows as "max_filled" and the other rows as "unchanged" to arrive at an output as below:

  colony  measure  net/gross   length_adj  remarks
0   22      11      net        13          max_filled
1   22      7       gross      17          unchanged
2   22      11      net        13          unchanged
3   33      13      gross      10          unchanged
4   33      13      gross      10          max_filled
5   33      9       net        16          unchanged

CodePudding user response:

Here you go:

df['measure'] = df['measure'].fillna(df.groupby('colony')['measure'].transform('max'))

CodePudding user response:

One approach that allows maximum control of each step (but may be less efficient than more direct pandas methods) is to use apply (with axis=1 to iterate rows) with a custom function, passing the dataframe as an argument as well.

  1. You can use np.isnan to verify that a certain value of a row is or is not nan.
  2. Without using groupby, you can directly for each row retrieve the dataframe of the corresponding colony group. Then you can retrieve the index of the maximum value found with idxmax()
def my_func(row, df):
    if np.isnan(row.measure): 
        max_index_location = df[df.colony==row.colony]['measure'].idxmax()
        row.measure = df.iloc[max_index_location].measure
        row['Length'] = df.iloc[max_index_location]['Length']
        row['net/gross'] = df.iloc[max_index_location]['net/gross']
        row['remarks'] = 'max_filled'
        row['remarks'] = 'unchanged'
    return row

df = df.apply(lambda x: my_func(x, df), axis=1)

Dataframe will be:

colony measure Length net/gross remarks
0 22 11 13 net max_filled
1 22 7 17 gross unchanged
2 22 11 13 net unchanged
3 33 13 10 gross unchanged
4 33 13 10 gross max_filled
5 33 9 16 net unchanged

CodePudding user response:


fill max in measure column

s = df.groupby('colony')['measure'].transform(lambda x: x.fillna(x.max()))


0   11.0
1    7.0
2   11.0
3   13.0
4   13.0
5    9.0
Name: measure, dtype: float64

make s to measure column


result A

    colony  measure
0   22      11.0
1   22      7.0
2   22      11.0
3   33      13.0
4   33      13.0
5   33      9.0


df1 = df[df.columns[::-1]].dropna()


    net/gross   Length  measure colony
1   gross       17      7.0     22
2   net         13      11.0    22
3   gross       10      13.0    33
5   net         16      9.0     33


merge resultA and df1

df[['colony']].assign(measure=s).merge(df1, how='left')


    colony  measure net/gross   Length
0   22      11.0    net         13
1   22      7.0     gross       17
2   22      11.0    net         13
3   33      13.0    gross       10
4   33      13.0    gross       10
5   33      9.0     net         16


make resultB to desired output(include full code)

import pandas as pd
import numpy as np
s = df.groupby('colony')['measure'].transform(lambda x: x.fillna(x.max()))
df1 = df[df.columns[::-1]].dropna()
s2 = np.where(df['measure'].isna(), 'max_filled', 'unchanged')
(df[['colony']].assign(measure=s).merge(df1, how='left')


    colony  measure net/gross   Length_adj  remark
0   22      11.0    net         13          max_filled
1   22      7.0     gross       17          unchanged
2   22      11.0    net         13          unchanged
3   33      13.0    gross       10          unchanged
4   33      13.0    gross       10          max_filled
5   33      9.0     net         16          unchanged
  • Related