Labelencoder and OneHotEncoder within the same for loop-CodePudding

I am writing a for loop to try to do an encoding for all of my values in a dataset. I have plenty of categorical values and initially the for loop works for the label encoder but I am trying to include a onehotencoder instead of using get_dummies on a separate line.

sample data:

               STYP_DESC  Gender       RACE_DESC DEGREE               MAJR_DESC1 FTPT  Target
0                   New  Female           White     BA  Business Administration   FT       1
1  New 1st Time Freshmn  Female           White     BA               Studio Art   FT       1
2                   New    Male           White   MBAX  Business Administration   FT       1
3                   New  Female         Unknown     JD             Juris Doctor   PT       1
4                   New  Female  Asian-American   MBAX  Business Administration   PT       1

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in X_train.columns[1:]:
    if X_train[col].dtype == 'object':
        if len(list(X_train[col].unique())) <= 2:
            le.fit(X_train[col])
            X_train[col] = le.transform(X_train[col])
            le_count  = 1
        else:
            enc.fit(X_train[[col]])
            X_train[[col]] = enc.transform(X_train[[col]])
            enc_count  =1
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))

but when I run it, I don't get errors but the encoding is super weird with a slew of tuples being inserted into my new dataset.

When I run the code without the everything in the else clause, it runs fine and I can simply use get_dummies to encode the other variables.

The only issue is when I use get_dummies, I drop_first is set to true; but I lose track of what is supposed to be 0 and what's supposed to be 1. (i.e. this problem is a major issue for tracking Gender and FTPT.

Any suggestions on this? I would use get_dummies but since I'm doing the preprocessing stage after splitting my data I'm worried about a category possibly being dropped out.

CodePudding user response：

Change the transform line encoding else part as below

X_train[col] = enc.transform(X_train[[col]]).toarray()

Here I'm copying the full code, you may try it directly. So error may be some other part of your code, please check.

styp = ['New','New 1st Time Freshmn','New','New','New']
gend = ['Female','Female','Male','Female','Female']
race = ['White','White','Unknown','Unknown','Asian-American']
deg = ['BA','BA','MBAX','JD','MBAX']
maj = ['Business Administration','Studio Art','Business Administration','Juris Doctor','Business Administration']
ftpt = ['FT','FT','FT','PT','PT']

df = pd.DataFrame({'STYP_DESC':styp, 'Gender':gend, 'RACE_DESC':race,'DEGREE':deg,\
     'MAJR_DESC1':maj, 'FTPT':ftpt})

le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')

le_count = 0
enc_count = 0

for col in df.columns[1:]:
    if df[col].dtype == 'object':
        if len(list(df[col].unique())) <= 2:
            le.fit(df[col])
            df[col] = le.transform(df[col])
            le_count  = 1
        else:
            enc.fit(df[[col]])
            df[col] = enc.transform(df[[col]]).toarray()
            enc_count  =1
print(df)
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))