I have a huge df with missing entries in the brand column that needs to be filled according to other rows. if the all other 3 columns match, fill the blanks with the existing brand, else fill with 'Other'. if this is my starting df:
df_start = pd.DataFrame({'device_id':[1,1,1,1,2,2,3,3,3,3,4,4,4,4],
'head':['a','a','b','b','a','b','a','b','b','b','a','b','c','d'],
'supplement':['Salt','Salt','Pepper','Pepper','Pepper','Pepper','Salt','Pepper','Salt','Pepper','Pepper','Salt','Pepper','Salt'],
'brand':['white',np.nan,np.nan,'white','white','black',np.nan,np.nan,'white','black',np.nan,'white','black',np.nan]})
how to get this result:
df_end = pd.DataFrame({'device_id':[1,1,1,1,2,2,3,3,3,3,4,4,4,4],
'head':['a','a','b','b','a','b','a','b','b','b','a','b','c','d'],
'supplement':['Salt','Salt','Pepper','Pepper','Pepper','Pepper','Salt','Pepper','Salt','Pepper','Pepper','Salt','Pepper','Salt'],
'brand':['white','white','white','white','white','black','Other','black','white','black','Other','white','black','Other']})
CodePudding user response:
You could try with a groupby
on the columns that need to be the same, in your case 'device_id', 'head', 'supplement', and use forward fill ffill()
, backward fill bfill()
, and at the very end you fillna()
with 'Other', as the leftovers will be the ones with no identical rows in those 3 columns:
result = df_start.groupby(['device_id','head','supplement'])\
.apply(lambda x: x.ffill().bfill().fillna('Other'))
prints:
>>> result
device_id head supplement brand
0 1 a Salt white
1 1 a Salt white
2 1 b Pepper white
3 1 b Pepper white
4 2 a Pepper white
5 2 b Pepper black
6 3 a Salt Other
7 3 b Pepper black
8 3 b Salt white
9 3 b Pepper black
10 4 a Pepper Other
11 4 b Salt white
12 4 c Pepper black
13 4 d Salt Other
CodePudding user response:
A solution not requiring a group by (costly), based on a simple mapping.
from collections import defaultdict
# create a mapping (ddict with key ('device_id', 'head', 'supplement')
# returns 'Other' when missing key
mapping = defaultdict(lambda: 'Other')
mapping.update(df_start.dropna()\
.set_index(['device_id', 'head', 'supplement'])['brand']\
.to_dict())
# apply function using the mapping to get the brand
brand = df_start.iloc[:, :-1].apply(lambda row: mapping[tuple(row)], axis=1)
CodePudding user response:
You can replace the nan
values in the brands
column after the creation of the dataframe. This may not be be the most efficient way but is the simplest one.
df['brand'].replace(np.NaN, "Other")