I have a problem. I have a dataframe df_path
. This gives all the values a host_id
can have in a given id
. I would like to make a One Hot Encoding Matrix out of it. I.e., I would like that a new coloumn Hot Water
is generated. If a certain object has the value Hot Water
a 1 is entered, if not a 0 should be entered.
The problem is that the host_id
with the id
is listed too often in df_one
, it should only occur only once. How do I do this correctly?
The .melt
function and so on must remain and it would be nice if it could go with pandas.
d = {'host_id': [1, 1, 2],
'id': [10, 11, 20],
'value': ["Hot Water,Cold Water,Kitchen,Coffee",
"Hot Water,Coffee,Something",
"Hot Water,Coffee"]}
df = pd.DataFrame(data=d)
print(df)
print(df.shape)
print()
df_path = df.copy()
df_path.index = pd.MultiIndex.from_arrays(df_path[['host_id', 'id']].values.T, names=['host_id', 'id'])
df_path = df_path['value'].str.split(',', expand=True)
df_path = df_path.melt(ignore_index=False).dropna()
df_path.reset_index(inplace=True)
print(df_path)
one_hot = pd.get_dummies(df_path['value'])
df_one = df_path.drop('value',axis = 1)
df_one = df_path.join(one_hot)
print()
print(df_one)
[OUT]
# df
host_id id value
0 1 10 Hot Water,Cold Water,Kitchen,Coffee
1 1 11 Hot Water,Coffee,Something
2 2 20 Hot Water,Coffee
(3, 3) # df.shape
# df_path
host_id id variable value
0 1 10 0 Hot Water
1 1 11 0 Hot Water
2 2 20 0 Hot Water
3 1 10 1 Cold Water
4 1 11 1 Coffee
5 2 20 1 Coffee
6 1 10 2 Kitchen
7 1 11 2 Something
8 1 10 3 Coffee
What I got
# df_one
host_id id variable value Coffee Cold Water Hot Water Kitchen \
0 1 10 0 Hot Water 0 0 1 0
1 1 11 0 Hot Water 0 0 1 0
2 2 20 0 Hot Water 0 0 1 0
3 1 10 1 Cold Water 0 1 0 0
4 1 11 1 Coffee 1 0 0 0
5 2 20 1 Coffee 1 0 0 0
6 1 10 2 Kitchen 0 0 0 1
7 1 11 2 Something 0 0 0 0
8 1 10 3 Coffee 1 0 0 0
Something
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
What I want
host_id id value Coffee Cold Water Hot Water Kitchen Som.
0 1 10 Hot Water,Cold Water,Kitchen,Coffee 1 1 1 1 0
1 1 11 Hot Water,Coffee,Something 1 0 1 0 1
2 2 20 Hot Water,Coffee 1 0 1 0 0
(3, 8) # df_one.shape
CodePudding user response:
You could try something like:
grouped_df_one = df_one.groupby(['id']).max()
And merge it with your original df
df.merge(grouped_df_one, on='id')
Then select the columns you need. Or you could already select the columns you need in grouped_df_one.
In one line:
df.merge(df_one.groupby(['id']).max().iloc[:,3:], on='id')
CodePudding user response:
convert value into list, explode, get dummies and merge.
df=df.assign(value=df['value'].str.split(',')).explode('value', ignore_index=True)
df.merge(pd.crosstab([df['id'],df['host_id']], df['value']).reset_index(), how='left', on=['id','host_id'])
host_id id value Coffee Cold Water Hot Water Kitchen Something
0 1 10 Hot Water 1 1 1 1 0
1 1 10 Cold Water 1 1 1 1 0
2 1 10 Kitchen 1 1 1 1 0
3 1 10 Coffee 1 1 1 1 0
4 1 11 Hot Water 1 0 1 0 1
5 1 11 Coffee 1 0 1 0 1
6 1 11 Something 1 0 1 0 1
7 2 20 Hot Water 1 0 1 0 0
8 2 20 Coffee 1 0 1 0 0