Fill NaN of selected columns based on a dictionary whose keys are column names and values are conten-CodePudding

For the dataframe df1 as follows:

         id  products  black metal  non-ferrous metals  precious metal
0  M0066350    copper          NaN                 NaN             NaN
1  M0066352  aluminum          NaN                 NaN             NaN
2  M0066353      gold          NaN                 NaN             NaN
3  M0066354    silver          NaN                 NaN             NaN
4  S0200837   soybean          NaN                 NaN             NaN
5  S0212350     Apple          NaN                 NaN             NaN
6  S0212351  iron ore          NaN                 NaN             NaN
7  S0212352      coke          NaN                 NaN             NaN
8  S0212353    others          1.0                 NaN             1.0

and I hope to fill columns cols = ['black metal', 'non-ferrous metals', 'precious metal'] with 1s based on customized_dict:

customized_dict = {
    'black metal': ['iron ore', 'coke'], 
    'non-ferrous metals': ['copper', 'aluminum'],
    'precious metal': ['gold', 'silver']
                   }

Please note the keys are from column names of df1 and values are from content of products in df1.

So my question is how could I get the following output:

         id  products  black metal  non-ferrous metals  precious metal
0  M0066350    copper          NaN                 1.0             NaN
1  M0066352  aluminum          NaN                 1.0             NaN
2  M0066353      gold          NaN                 NaN             1.0
3  M0066354    silver          NaN                 NaN             1.0
4  S0200837   soybean          NaN                 NaN             NaN
5  S0212350     Apple          NaN                 NaN             NaN
6  S0212351  iron ore          1.0                 NaN             NaN
7  S0212352      coke          1.0                 NaN             NaN
8  S0212353    others          1.0                 NaN             1.0

EDIT: new data with duplicates in products column.

    id  products  black metal  non-ferrous metals  precious metal
0  S0212350     Apple          NaN                 NaN             NaN
1  M0066352  aluminum          NaN                 1.0             NaN
2  S0212352      coke          1.0                 NaN             NaN
3  S0212354      coke          1.0                 NaN             NaN
4  M0066350    copper          NaN                 1.0             NaN
5  M0066353      gold          NaN                 NaN             1.0
6  S0212351  iron ore          1.0                 NaN             NaN
7  S0212353    others          1.0                 NaN             1.0
8  M0066354    silver          NaN                 NaN             1.0
9  S0200837   soybean          NaN                 NaN             NaN

CodePudding user response：

Using a simple loop on the columns and update:

customized_dict = {
    'black metal': ['iron ore', 'coke'], 
    'non-ferrous metals': ['copper', 'aluminum'],
    'precious metal': ['gold', 'silver']
                   }
df.update(df.iloc[:,2:].apply(lambda c: c[df['products']
                                         .isin(customized_dict[c.name])]
                                         .fillna(1)))

output:

         id  products  black metal  non-ferrous metals  precious metal
0  M0066350    copper          NaN                 1.0             NaN
1  M0066352  aluminum          NaN                 1.0             NaN
2  M0066353      gold          NaN                 NaN             1.0
3  M0066354    silver          NaN                 NaN             1.0
4  S0200837   soybean          NaN                 NaN             NaN
5  S0212350     Apple          NaN                 NaN             NaN
6  S0212351  iron ore          1.0                 NaN             NaN
7  S0212352      coke          1.0                 NaN             NaN
8  S0212353    others          1.0                 NaN             1.0

CodePudding user response：

Use:

# list comprehension for MultiIndex Series with 1
L = [(x, k) for k, v in customized_dict.items() for x in v]
# reshape for DataFrame
df2 = pd.Series(1, index=pd.MultiIndex.from_tuples(L)).unstack()
# replace missing values by products column converted to index
df = df1.set_index('products').combine_first(df2).rename_axis('products').reset_index().reindex(df1.columns, axis=1)
print(df)
         id  products  black metal  non-ferrous metals  precious metal
0  M0066350    copper          NaN                 1.0             NaN
1  M0066352  aluminum          NaN                 1.0             NaN
2  M0066353      gold          NaN                 NaN             1.0
3  M0066354    silver          NaN                 NaN             1.0
4  S0200837   soybean          NaN                 NaN             NaN
5  S0212350     Apple          NaN                 NaN             NaN
6  S0212351  iron ore          1.0                 NaN             NaN
7  S0212352      coke          1.0                 NaN             NaN
8  S0212353    others          1.0                 NaN             1.0

CodePudding user response：

Create a reverse dict mapping and use crosstab to create the updated array then fillna:

reversed_dict = {v: k for k, l in customized_dict.items() for v in l}
df1 = df1.fillna(pd.crosstab(df1.index, df1['products'].map(reversed_dict), values=1, aggfunc='mean'))
print(df1)

# Output
         id  products  black metal  non-ferrous metals  precious metal
0  M0066350    copper          NaN                 1.0             NaN
1  M0066352  aluminum          NaN                 1.0             NaN
2  M0066353      gold          NaN                 NaN             1.0
3  M0066354    silver          NaN                 NaN             1.0
4  S0200837   soybean          NaN                 NaN             NaN
5  S0212350     Apple          NaN                 NaN             NaN
6  S0212351  iron ore          1.0                 NaN             NaN
7  S0212352      coke          1.0                 NaN             NaN
8  S0212353    others          1.0                 NaN             1.0