Over all enter a 1 where the value exists otherwise enter a 0-CodePudding

I have a problem. I have a dataframe df_path. This gives all the values a host_id can have in a given id. I would like to make a One Hot Encoding Matrix out of it. I.e., I would like that a new coloumn Hot Water is generated. If a certain object has the value Hot Watera 1 is entered, if not a 0 should be entered.

The problem is that the host_id with the id is listed too often in df_one, it should only occur only once. How do I do this correctly?

The .melt function and so on must remain and it would be nice if it could go with pandas.

d = {'host_id': [1, 1, 2],
     'id': [10, 11, 20],
     'value': ["Hot Water,Cold Water,Kitchen,Coffee", 
               "Hot Water,Coffee,Something",
               "Hot Water,Coffee"]}
df = pd.DataFrame(data=d)
print(df)
print(df.shape)
print()
df_path = df.copy()
df_path.index = pd.MultiIndex.from_arrays(df_path[['host_id', 'id']].values.T, names=['host_id', 'id'])
df_path = df_path['value'].str.split(',', expand=True)
df_path = df_path.melt(ignore_index=False).dropna()
df_path.reset_index(inplace=True)

print(df_path)



one_hot = pd.get_dummies(df_path['value'])
df_one = df_path.drop('value',axis = 1)
df_one = df_path.join(one_hot)
print()
print(df_one)

[OUT]
# df
   host_id  id                                value
0        1  10  Hot Water,Cold Water,Kitchen,Coffee
1        1  11           Hot Water,Coffee,Something
2        2  20                     Hot Water,Coffee
(3, 3) # df.shape

# df_path
   host_id  id  variable       value
0        1  10         0   Hot Water
1        1  11         0   Hot Water
2        2  20         0   Hot Water
3        1  10         1  Cold Water
4        1  11         1      Coffee
5        2  20         1      Coffee
6        1  10         2     Kitchen
7        1  11         2   Something
8        1  10         3      Coffee

What I got

# df_one
   host_id  id  variable       value  Coffee  Cold Water  Hot Water  Kitchen  \
0        1  10         0   Hot Water       0           0          1        0   
1        1  11         0   Hot Water       0           0          1        0   
2        2  20         0   Hot Water       0           0          1        0   
3        1  10         1  Cold Water       0           1          0        0   
4        1  11         1      Coffee       1           0          0        0   
5        2  20         1      Coffee       1           0          0        0   
6        1  10         2     Kitchen       0           0          0        1   
7        1  11         2   Something       0           0          0        0   
8        1  10         3      Coffee       1           0          0        0   

   Something  
0          0  
1          0  
2          0  
3          0  
4          0  
5          0  
6          0  
7          1  
8          0

What I want

   host_id  id                                value  Coffee  Cold Water  Hot Water Kitchen Som.
0        1  10  Hot Water,Cold Water,Kitchen,Coffee       1           1          1       1    0
1        1  11           Hot Water,Coffee,Something       1           0          1       0    1
2        2  20                     Hot Water,Coffee       1           0          1       0    0  
(3, 8) # df_one.shape

CodePudding user response：

You could try something like:

grouped_df_one = df_one.groupby(['id']).max()

And merge it with your original df

df.merge(grouped_df_one, on='id')

Then select the columns you need. Or you could already select the columns you need in grouped_df_one.

In one line: df.merge(df_one.groupby(['id']).max().iloc[:,3:], on='id')

CodePudding user response：

convert value into list, explode, get dummies and merge.

df=df.assign(value=df['value'].str.split(',')).explode('value', ignore_index=True)



 df.merge(pd.crosstab([df['id'],df['host_id']], df['value']).reset_index(), how='left', on=['id','host_id'])

host_id  id       value  Coffee  Cold Water  Hot Water  Kitchen  Something
0        1  10   Hot Water       1           1          1        1          0
1        1  10  Cold Water       1           1          1        1          0
2        1  10     Kitchen       1           1          1        1          0
3        1  10      Coffee       1           1          1        1          0
4        1  11   Hot Water       1           0          1        0          1
5        1  11      Coffee       1           0          1        0          1
6        1  11   Something       1           0          1        0          1
7        2  20   Hot Water       1           0          1        0          0
8        2  20      Coffee       1           0          1        0          0