I have used sklearn
's LabelEncoder
to generate unique encoding of combination of two columns:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv("data.csv", sep=",")
df
# A B
# 0 1 Yes
# 1 2 No
# 2 3 Yes
# 3 4 Yes
as following:
df['AB'] = df.apply(lambda row: hash((row['A'], row['B'])), axis=1)
le = LabelEncoder()
df['C'] = le.fit_transform(df['AB'])
A B C
0 1 Yes 1
1 2 No 6
2 3 Yes 3
3 4 Yes 4
How can I generate a dictionary of keys
and values
for the (original columns and the classes) and the labelencoder classes? I can do that for Hashes in AB
as:
values=le.transform(le.classes_)
keys=le.classes_
dic=dict(zip(keys,values))
What I am missing here is the keys
for the hash
function of column AB
to produce something like this:
{(1, Yes): 0, (2, No): 6 ,...}
CodePudding user response:
One option is to set the index by A and B, then call to_dict
:
out = df.set_index(['A','B'])['C'].to_dict()
Output:
{(1, 'Yes'): 3, (2, 'No'): 1, (3, 'Yes'): 0, (4, 'Yes'): 2}