I have an input data as shown:
df = pd.DataFrame({"colony" : [22, 22, 22, 33, 33, 33],
"measure" : [np.nan, 7, 11, 13, np.nan, 9,],
"net/gross" : [np.nan, "gross", "net", "gross", "np.nan", "net"]})
df
colony measure net/gross
0 22 NaN NaN
1 22 7 gross
2 22 11 net
3 33 13 gross
4 33 NaN NaN
5 33 9 net
I want to fill the NaN in the measure column with maximum value from each group of the colony, then fill the NaN in the net/gross column with the net/gross value at the row where the measure was maximum (e.g fill the NaN at index 0 with the value corresponding to where the measure was max which is "net") and create a remark column to document all the NaN filled rows as "max_filled" and the other rows as "unchanged" to arrive at an output as below:
colony measure net/gross remarks
0 22 11 net max_filled
1 22 7 gross unchanged
2 22 11 net unchanged
3 33 13 gross unchanged
4 33 13 gross max_filled
5 33 9 net unchanged
CodePudding user response:
My solution
What I would do is compute a column of max
mx=df.groupby('colony').measure.transform(max)
and a list of rows to be filled
f=df.measure.isna()
And then use them to fill what you want
df['remarks']='unchanged'
df.loc[f, 'measure']=mx
df.loc[f, 'net/gross']=df[f]['net/gross']
df.loc[f, 'remarks']='max_filled'
Remark on other answers to similar question
Note that answers to this simpler question, which was just filling NaN with mean value of each group, which you could easily adapt replacing mean with max, but which would not help filling the 2 other columns, seem to be unanimously promoting lambda based solutions.
That is generally a bad idea. I mean, I love lambda, I come from lambda calculus. But in pandas, apply
or similar method are just the next worst thing after bad old for loops on the rows (and even sometimes, for loops are faster).
The unanimity of the answers made me doubt tho. But, well, timeit close the debate: even for that simpler problem, my solution is faster than the answers to this questions.
That is, even just filling NaN
is faster doing it this way
mx=df.groupby('colony').measure.transform(max)
f=df.measure.isna()
df.loc[f,'measure']=mx
Rather than the proposed way
df["measure"] = df.groupby("colony")["measure"].transform(lambda x: x.fillna(x.mean()))
So, I was at first trying to see how that previous answer could be adapted to your more complex case (where so transformed rows of measure should also impact net/gross
and remarks
). But there is no reason to do so, since it is faster (2.5 times faster) to compute a whose column of max, and then do simple column affectation.
So, morale is
Never ever use lambda (and for, and apply) when you can avoid it on dataframes.
Even at the cost of computing a whole column of max values, whose only a fraction will really be used, it is better to stick with whole column algebra.
CodePudding user response:
Here is another way using .transform('max')
and .transform('idxmax')
g = df.groupby('colony')['measure']
measure_max, ng_max = g.transform('max'),df.loc[g.transform('idxmax'),'net/gross'].reset_index(drop=True)
(df.fillna({'measure':measure_max,'net/gross':ng_max})
.assign(remarks = np.where(df['net/gross'].isna(),'max_filled','unchanged')))
Output:
colony measure net/gross remarks
0 22 11.0 net max_filled
1 22 7.0 gross unchanged
2 22 11.0 net unchanged
3 33 13.0 gross unchanged
4 33 13.0 gross max_filled
5 33 9.0 net unchanged