Fill Missing values with max from a group and get rows corresponding to that max value-CodePudding

I have an input data as shown:

df = pd.DataFrame({"colony" : [22, 22, 22, 33, 33, 33],
                  "measure" : [np.nan, 7, 11, 13, np.nan, 9,],
                   "Length" : [14, 17, 13, 10, 19,16],
                   "net/gross" : [np.nan, "gross", "net", "gross", "np.nan", "net"]})

df

    colony  measure  length net/gross
0   22      NaN      14     NaN  
1   22      7        17     gross
2   22      11       13     net
3   33      13       10     gross
4   33      NaN      19     NaN
5   33      9        16     net

I want to fill the NaN in the measure column with maximum value from each group of the colony, then fill the NaN in the net/gross column with the net/gross value at the row where the measure was maximum (e.g fill the NaN at index 0 with the value corresponding to where the measure was max which is "net", and 13 on the length_adj column) and create a remark column to document all the NaN filled rows as "max_filled" and the other rows as "unchanged" to arrive at an output as below:

  colony  measure  net/gross   length_adj  remarks
0   22      11      net        13          max_filled
1   22      7       gross      17          unchanged
2   22      11      net        13          unchanged
3   33      13      gross      10          unchanged
4   33      13      gross      10          max_filled
5   33      9       net        16          unchanged

CodePudding user response：

Here you go:

df['measure'] = df['measure'].fillna(df.groupby('colony')['measure'].transform('max'))

CodePudding user response：

One approach that allows maximum control of each step (but may be less efficient than more direct pandas methods) is to use apply (with axis=1 to iterate rows) with a custom function, passing the dataframe as an argument as well.

You can use np.isnan to verify that a certain value of a row is or is not nan.
Without using groupby, you can directly for each row retrieve the dataframe of the corresponding colony group. Then you can retrieve the index of the maximum value found with idxmax()

def my_func(row, df):
    
    if np.isnan(row.measure): 
        max_index_location = df[df.colony==row.colony]['measure'].idxmax()
        
        row.measure = df.iloc[max_index_location].measure
        row['Length'] = df.iloc[max_index_location]['Length']
        row['net/gross'] = df.iloc[max_index_location]['net/gross']
        
        row['remarks'] = 'max_filled'
        
    else:
        
        row['remarks'] = 'unchanged'
        
    return row

df = df.apply(lambda x: my_func(x, df), axis=1)

Dataframe will be:

	colony	measure	Length	net/gross	remarks
0	22	11	13	net	max_filled
1	22	7	17	gross	unchanged
2	22	11	13	net	unchanged
3	33	13	10	gross	unchanged
4	33	13	10	gross	max_filled
5	33	9	16	net	unchanged

CodePudding user response：

step1

fill max in measure column

s = df.groupby('colony')['measure'].transform(lambda x: x.fillna(x.max()))

s

0   11.0
1    7.0
2   11.0
3   13.0
4   13.0
5    9.0
Name: measure, dtype: float64

make s to measure column

df[['colony']].assign(measure=s)

result A

    colony  measure
0   22      11.0
1   22      7.0
2   22      11.0
3   33      13.0
4   33      13.0
5   33      9.0

step2

df1 = df[df.columns[::-1]].dropna()

df1

    net/gross   Length  measure colony
1   gross       17      7.0     22
2   net         13      11.0    22
3   gross       10      13.0    33
5   net         16      9.0     33

step3

merge resultA and df1

df[['colony']].assign(measure=s).merge(df1, how='left')

resultB

    colony  measure net/gross   Length
0   22      11.0    net         13
1   22      7.0     gross       17
2   22      11.0    net         13
3   33      13.0    gross       10
4   33      13.0    gross       10
5   33      9.0     net         16

step4

make resultB to desired output(include full code)

import pandas as pd
import numpy as np
s = df.groupby('colony')['measure'].transform(lambda x: x.fillna(x.max()))
df1 = df[df.columns[::-1]].dropna()
s2 = np.where(df['measure'].isna(), 'max_filled', 'unchanged')
(df[['colony']].assign(measure=s).merge(df1, how='left')
 .assign(remark=s2).rename(columns={'Length':'Length_adj'}))

output

    colony  measure net/gross   Length_adj  remark
0   22      11.0    net         13          max_filled
1   22      7.0     gross       17          unchanged
2   22      11.0    net         13          unchanged
3   33      13.0    gross       10          unchanged
4   33      13.0    gross       10          max_filled
5   33      9.0     net         16          unchanged