I have an input data as shown:
df = pd.DataFrame({"colony" : [22, 22, 22, 33, 33, 33],
"measure" : [np.nan, 7, 11, 13, np.nan, 9,],
"Length" : [14, 17, 13, 10, 19,16],
"net/gross" : [np.nan, "gross", "net", "gross", "np.nan", "net"]})
df
colony measure length net/gross
0 22 NaN 14 NaN
1 22 7 17 gross
2 22 11 13 net
3 33 13 10 gross
4 33 NaN 19 NaN
5 33 9 16 net
I want to fill the NaN in the measure column with maximum value from each group of the colony, then fill the NaN in the net/gross column with the net/gross value at the row where the measure was maximum (e.g fill the NaN at index 0 with the value corresponding to where the measure was max which is "net", and 13 on the length_adj column) and create a remark column to document all the NaN filled rows as "max_filled" and the other rows as "unchanged" to arrive at an output as below:
colony measure net/gross length_adj remarks
0 22 11 net 13 max_filled
1 22 7 gross 17 unchanged
2 22 11 net 13 unchanged
3 33 13 gross 10 unchanged
4 33 13 gross 10 max_filled
5 33 9 net 16 unchanged
CodePudding user response:
Here you go:
df['measure'] = df['measure'].fillna(df.groupby('colony')['measure'].transform('max'))
CodePudding user response:
One approach that allows maximum control of each step (but may be less efficient than more direct pandas methods) is to use apply (with axis=1
to iterate rows) with a custom function, passing the dataframe as an argument as well.
- You can use np.isnan to verify that a certain value of a row is or is not nan.
- Without using groupby, you can directly for each row retrieve the dataframe of the corresponding colony group. Then you can retrieve the index of the maximum value found with idxmax()
def my_func(row, df):
if np.isnan(row.measure):
max_index_location = df[df.colony==row.colony]['measure'].idxmax()
row.measure = df.iloc[max_index_location].measure
row['Length'] = df.iloc[max_index_location]['Length']
row['net/gross'] = df.iloc[max_index_location]['net/gross']
row['remarks'] = 'max_filled'
else:
row['remarks'] = 'unchanged'
return row
df = df.apply(lambda x: my_func(x, df), axis=1)
Dataframe will be:
colony | measure | Length | net/gross | remarks | |
---|---|---|---|---|---|
0 | 22 | 11 | 13 | net | max_filled |
1 | 22 | 7 | 17 | gross | unchanged |
2 | 22 | 11 | 13 | net | unchanged |
3 | 33 | 13 | 10 | gross | unchanged |
4 | 33 | 13 | 10 | gross | max_filled |
5 | 33 | 9 | 16 | net | unchanged |
CodePudding user response:
step1
fill max in measure column
s = df.groupby('colony')['measure'].transform(lambda x: x.fillna(x.max()))
s
0 11.0
1 7.0
2 11.0
3 13.0
4 13.0
5 9.0
Name: measure, dtype: float64
make s
to measure column
df[['colony']].assign(measure=s)
result A
colony measure
0 22 11.0
1 22 7.0
2 22 11.0
3 33 13.0
4 33 13.0
5 33 9.0
step2
df1 = df[df.columns[::-1]].dropna()
df1
net/gross Length measure colony
1 gross 17 7.0 22
2 net 13 11.0 22
3 gross 10 13.0 33
5 net 16 9.0 33
step3
merge resultA
and df1
df[['colony']].assign(measure=s).merge(df1, how='left')
resultB
colony measure net/gross Length
0 22 11.0 net 13
1 22 7.0 gross 17
2 22 11.0 net 13
3 33 13.0 gross 10
4 33 13.0 gross 10
5 33 9.0 net 16
step4
make resultB
to desired output(include full code)
import pandas as pd
import numpy as np
s = df.groupby('colony')['measure'].transform(lambda x: x.fillna(x.max()))
df1 = df[df.columns[::-1]].dropna()
s2 = np.where(df['measure'].isna(), 'max_filled', 'unchanged')
(df[['colony']].assign(measure=s).merge(df1, how='left')
.assign(remark=s2).rename(columns={'Length':'Length_adj'}))
output
colony measure net/gross Length_adj remark
0 22 11.0 net 13 max_filled
1 22 7.0 gross 17 unchanged
2 22 11.0 net 13 unchanged
3 33 13.0 gross 10 unchanged
4 33 13.0 gross 10 max_filled
5 33 9.0 net 16 unchanged