I have a data as shown below:
qty_min qty_max region_min region_max subj region
11 1 10 10 ab UK
21 1 nan 20 ab UK
nan nan nan 30 ab UK
nan 2 nan 34 bc US
nan 2 20 nan bc US
10 nan nan nan bc TZ
11 nan nan 47 de TZ
13 3 109 31 de TZ
df = pd.read_clipboard()
print(df)
I would like to fillna()
in each of the columns: qty_min
, qty_max
, region_min
, region_max
based on a pattern.
For example: If there is NaN
in qty_min
and qty_max
columns, we need to fillna()
using groupby
of subj
and ffill().bfill()
.
Similarly, if there is NaN
in region_max
, region_min
, we need to fillna()
using groupby
of region
and ffill().bfill()
So, I tried the below:
df['qty_min'] = df.groupby(['subj'], sort=False)['qty_min'].apply(lambda x: x.ffill().bfill())
df['qty_max'] = df.groupby(['subj'], sort=False)['qty_max'].apply(lambda x: x.ffill().bfill())
df['region_min'] = df.groupby(['region'], sort=False)['region_min'].apply(lambda x: x.ffill().bfill())
df['region_max'] = df.groupby(['region'], sort=False)['region_max'].apply(lambda x: x.ffill().bfill())
As you can see that this is not elegant. Moreover, I have 20 plus columns like this in real data which I would like to fill using the same way (groupby
column and ffill.bfill()
)
I have created a dict
like below manually to identify the corresponding groupby
column for filling NaN
.
I'm open to modifying the way we store this info. You can use whatever data structure is easy.
fillna_dict= {
"subj": ['qty_min','qty_max'],
"region": ['region_min','region_max']
}
Is there any elegant and efficient approach to do this?
I expect my output to be like the below:
CodePudding user response:
As you have separate conditions you need to have several lines.
What you would do is to refactor the code to reuse the groups and a single function:
f = lambda x: x.ffill().bfill()
g1 = df.groupby(['subj'], sort=False)
g2 = df.groupby(['region'], sort=False)
df['qty_min'] = g1['qty_min'].apply(f)
df['qty_max'] = g1['qty_max'].apply(f)
df['region_min'] = g2['region_min'].apply(f)
df['region_max'] = g2['region_max'].apply(f)
Using your dictionary:
f = lambda x: x.ffill().bfill()
fillna_dict= {
"subj": ['qty_min','qty_max'],
"region": ['region_min','region_max']
}
for k, cols in fillna_dict.items():
df[cols] = df.groupby(df[k])[cols].apply(f)
output:
qty_min qty_max region_min region_max subj region
0 11.0 1.0 10.0 10.0 ab UK
1 21.0 1.0 10.0 20.0 ab UK
2 21.0 1.0 10.0 30.0 ab UK
3 10.0 2.0 20.0 34.0 bc US
4 10.0 2.0 20.0 34.0 bc US
5 10.0 2.0 109.0 47.0 bc TZ
6 11.0 3.0 109.0 47.0 de TZ
7 13.0 3.0 109.0 31.0 de TZ
CodePudding user response:
Try to do it in a function:
for k,v in fillna_dict.items():
df[v] = df.groupby([k], sort=False)[v].apply(lambda x: x.ffill().bfill())
Output:
qty_min qty_max region_min region_max subj region
0 11.0 1.0 10.0 10.0 ab UK
1 21.0 1.0 10.0 20.0 ab UK
2 21.0 1.0 10.0 30.0 ab UK
3 10.0 2.0 20.0 34.0 bc US
4 10.0 2.0 20.0 34.0 bc US
5 10.0 2.0 109.0 47.0 bc TZ
6 11.0 3.0 109.0 47.0 de TZ
7 13.0 3.0 109.0 31.0 de TZ
CodePudding user response:
Restructure your dictionary and try:
fillna_dict= {"qty_min": "subj",
"qty_max": "subj",
"region_min": "region",
"region_max": "region"
}
df[list(fillna_dict.keys())] = df[list(fillna_dict.keys())].apply(lambda x: df.groupby(fillna_dict[x.name], sort=False)[x.name].ffill().bfill())
>>> df
qty_min qty_max region_min region_max subj region
0 11.0 1.0 10.0 10.0 ab UK
1 21.0 1.0 10.0 20.0 ab UK
2 21.0 1.0 10.0 30.0 ab UK
3 10.0 2.0 20.0 34.0 bc US
4 10.0 2.0 20.0 34.0 bc US
5 10.0 2.0 109.0 47.0 bc TZ
6 11.0 3.0 109.0 47.0 de TZ
7 13.0 3.0 109.0 31.0 de TZ