Home > OS >  LabelEncoding a permutation of combination of columns
LabelEncoding a permutation of combination of columns

Time:04-21

I'd like to create class labels for a permutation of two columns using sklearn's LabelEncoder(). How do I achieve the following behavior?

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv("data.csv", sep=",")
df
#    A    B    
# 0  1  Yes 
# 1  2   No 
# 2  3  Yes 
# 3  4  Yes

I'd like to have the permutation of combination of A && B rather than encoding these two columns separately:

df['A'].astype('category')
#Categories (4, int64): [1, 2, 3, 4, ]

df['B'].astype('category')
#Categories (2, object): ['Yes','No']

#Column C should have 4 * 2 classes:
(1,Yes)=1  (1,No)=5
(2,Yes)=2  (2,No)=6
(3,Yes)=3  (3,No)=7
(4,Yes)=4  (4,No)=8

#Newdf
#    A    B  C    
# 0  1  Yes  1
# 1  2   No  6
# 2  3  Yes  3
# 3  4  Yes  4

CodePudding user response:

We can create the mapping df with cross merge

out = df.merge(df[['B']].drop_duplicates().merge(df.A.drop_duplicates(),how='cross').assign(C=lambda x : x.index 1))
Out[415]: 
   A    B  C
0  1  Yes  1
1  2   No  6
2  3  Yes  3
3  4  Yes  4

More info

df[['B']].drop_duplicates().merge(df.A.drop_duplicates(),how='cross').assign(C=lambda x : x.index 1)
Out[417]: 
     B  A  C
0  Yes  1  1
1  Yes  2  2
2  Yes  3  3
3  Yes  4  4
4   No  1  5
5   No  2  6
6   No  3  7
7   No  4  8

CodePudding user response:

You can create additional column merging values from 2 columns into one tuple. But LabelEncoder cannot encode the tuples, so additionally you need to get hash() of the tuple:

df['AB'] = df.apply(lambda row: hash((row['A'], row['B'])), axis=1)
le = LabelEncoder()
df['C'] = le.fit_transform(df['AB'])
  • Related