I'd like to create class labels for a permutation of two columns using sklearn
's LabelEncoder()
. How do I achieve the following behavior?
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv("data.csv", sep=",")
df
# A B
# 0 1 Yes
# 1 2 No
# 2 3 Yes
# 3 4 Yes
I'd like to have the permutation of combination of A && B rather than encoding these two columns separately:
df['A'].astype('category')
#Categories (4, int64): [1, 2, 3, 4, ]
df['B'].astype('category')
#Categories (2, object): ['Yes','No']
#Column C should have 4 * 2 classes:
(1,Yes)=1 (1,No)=5
(2,Yes)=2 (2,No)=6
(3,Yes)=3 (3,No)=7
(4,Yes)=4 (4,No)=8
#Newdf
# A B C
# 0 1 Yes 1
# 1 2 No 6
# 2 3 Yes 3
# 3 4 Yes 4
CodePudding user response:
We can create the mapping df with cross merge
out = df.merge(df[['B']].drop_duplicates().merge(df.A.drop_duplicates(),how='cross').assign(C=lambda x : x.index 1))
Out[415]:
A B C
0 1 Yes 1
1 2 No 6
2 3 Yes 3
3 4 Yes 4
More info
df[['B']].drop_duplicates().merge(df.A.drop_duplicates(),how='cross').assign(C=lambda x : x.index 1)
Out[417]:
B A C
0 Yes 1 1
1 Yes 2 2
2 Yes 3 3
3 Yes 4 4
4 No 1 5
5 No 2 6
6 No 3 7
7 No 4 8
CodePudding user response:
You can create additional column merging values from 2 columns into one tuple. But LabelEncoder
cannot encode the tuples, so additionally you need to get hash()
of the tuple:
df['AB'] = df.apply(lambda row: hash((row['A'], row['B'])), axis=1)
le = LabelEncoder()
df['C'] = le.fit_transform(df['AB'])