conditionally filling blanks in pandas according multiple columns from other rows-CodePudding

I have a huge df with missing entries in the brand column that needs to be filled according to other rows. if the all other 3 columns match, fill the blanks with the existing brand, else fill with 'Other'. if this is my starting df:

df_start = pd.DataFrame({'device_id':[1,1,1,1,2,2,3,3,3,3,4,4,4,4],
'head':['a','a','b','b','a','b','a','b','b','b','a','b','c','d'],
'supplement':['Salt','Salt','Pepper','Pepper','Pepper','Pepper','Salt','Pepper','Salt','Pepper','Pepper','Salt','Pepper','Salt'],
'brand':['white',np.nan,np.nan,'white','white','black',np.nan,np.nan,'white','black',np.nan,'white','black',np.nan]})

how to get this result:

df_end = pd.DataFrame({'device_id':[1,1,1,1,2,2,3,3,3,3,4,4,4,4],
'head':['a','a','b','b','a','b','a','b','b','b','a','b','c','d'],
'supplement':['Salt','Salt','Pepper','Pepper','Pepper','Pepper','Salt','Pepper','Salt','Pepper','Pepper','Salt','Pepper','Salt'],
'brand':['white','white','white','white','white','black','Other','black','white','black','Other','white','black','Other']})

CodePudding user response：

You could try with a groupby on the columns that need to be the same, in your case 'device_id', 'head', 'supplement', and use forward fill ffill(), backward fill bfill(), and at the very end you fillna() with 'Other', as the leftovers will be the ones with no identical rows in those 3 columns:

result = df_start.groupby(['device_id','head','supplement'])\
        .apply(lambda x: x.ffill().bfill().fillna('Other'))

prints:

>>> result

    device_id head supplement  brand
0           1    a       Salt  white
1           1    a       Salt  white
2           1    b     Pepper  white
3           1    b     Pepper  white
4           2    a     Pepper  white
5           2    b     Pepper  black
6           3    a       Salt  Other
7           3    b     Pepper  black
8           3    b       Salt  white
9           3    b     Pepper  black
10          4    a     Pepper  Other
11          4    b       Salt  white
12          4    c     Pepper  black
13          4    d       Salt  Other

CodePudding user response：

A solution not requiring a group by (costly), based on a simple mapping.

from collections import defaultdict

# create a mapping (ddict with key ('device_id', 'head', 'supplement')
# returns 'Other' when missing key

mapping = defaultdict(lambda: 'Other')
mapping.update(df_start.dropna()\
       .set_index(['device_id', 'head', 'supplement'])['brand']\
       .to_dict())

# apply function using the mapping to get the brand
brand = df_start.iloc[:, :-1].apply(lambda row: mapping[tuple(row)], axis=1)

CodePudding user response：

You can replace the nan values in the brands column after the creation of the dataframe. This may not be be the most efficient way but is the simplest one.

df['brand'].replace(np.NaN, "Other")