Home > Enterprise >  Fill NaN based on max value from a group and another string column with the value at the NaN row
Fill NaN based on max value from a group and another string column with the value at the NaN row

Time:11-21

I have an input data as shown:

df = pd.DataFrame({"colony" : [22, 22, 22, 33, 33, 33],
                  "measure" : [np.nan, 7, 11, 13, np.nan, 9,],
                   "net/gross" : [np.nan, "gross", "net", "gross", "np.nan", "net"]})

df

    colony  measure  net/gross
0   22      NaN      NaN  
1   22      7        gross
2   22      11       net
3   33      13       gross
4   33      NaN      NaN
5   33      9        net

I want to fill the NaN in the measure column with maximum value from each group of the colony, then fill the NaN in the net/gross column with the net/gross value at the row where the measure was maximum (e.g fill the NaN at index 0 with the value corresponding to where the measure was max which is "net") and create a remark column to document all the NaN filled rows as "max_filled" and the other rows as "unchanged" to arrive at an output as below:

   colony  measure  net/gross   remarks
0   22      11      net         max_filled
1   22      7       gross       unchanged
2   22      11      net         unchanged
3   33      13      gross       unchanged
4   33      13      gross       max_filled
5   33      9       net         unchanged

CodePudding user response:

My solution

What I would do is compute a column of max

mx=df.groupby('colony').measure.transform(max)

and a list of rows to be filled

f=df.measure.isna()

And then use them to fill what you want

df['remarks']='unchanged'
df.loc[f, 'measure']=mx
df.loc[f, 'net/gross']=df[f]['net/gross']
df.loc[f, 'remarks']='max_filled'

Remark on other answers to similar question

Note that answers to this simpler question, which was just filling NaN with mean value of each group, which you could easily adapt replacing mean with max, but which would not help filling the 2 other columns, seem to be unanimously promoting lambda based solutions.

That is generally a bad idea. I mean, I love lambda, I come from lambda calculus. But in pandas, apply or similar method are just the next worst thing after bad old for loops on the rows (and even sometimes, for loops are faster).

The unanimity of the answers made me doubt tho. But, well, timeit close the debate: even for that simpler problem, my solution is faster than the answers to this questions.

That is, even just filling NaN is faster doing it this way

mx=df.groupby('colony').measure.transform(max)
f=df.measure.isna()
df.loc[f,'measure']=mx

Rather than the proposed way

df["measure"] = df.groupby("colony")["measure"].transform(lambda x: x.fillna(x.mean()))

So, I was at first trying to see how that previous answer could be adapted to your more complex case (where so transformed rows of measure should also impact net/gross and remarks). But there is no reason to do so, since it is faster (2.5 times faster) to compute a whose column of max, and then do simple column affectation.

So, morale is

Never ever use lambda (and for, and apply) when you can avoid it on dataframes.

Even at the cost of computing a whole column of max values, whose only a fraction will really be used, it is better to stick with whole column algebra.

CodePudding user response:

Here is another way using .transform('max') and .transform('idxmax')

g = df.groupby('colony')['measure']

measure_max, ng_max = g.transform('max'),df.loc[g.transform('idxmax'),'net/gross'].reset_index(drop=True)

(df.fillna({'measure':measure_max,'net/gross':ng_max})
.assign(remarks = np.where(df['net/gross'].isna(),'max_filled','unchanged')))

Output:

   colony  measure net/gross     remarks
0      22     11.0       net  max_filled
1      22      7.0     gross   unchanged
2      22     11.0       net   unchanged
3      33     13.0     gross   unchanged
4      33     13.0     gross  max_filled
5      33      9.0       net   unchanged
  • Related