I am writing a for loop to try to do an encoding for all of my values in a dataset. I have plenty of categorical values and initially the for loop works for the label encoder but I am trying to include a onehotencoder instead of using get_dummies on a separate line.
sample data:
STYP_DESC Gender RACE_DESC DEGREE MAJR_DESC1 FTPT Target
0 New Female White BA Business Administration FT 1
1 New 1st Time Freshmn Female White BA Studio Art FT 1
2 New Male White MBAX Business Administration FT 1
3 New Female Unknown JD Juris Doctor PT 1
4 New Female Asian-American MBAX Business Administration PT 1
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in X_train.columns[1:]:
if X_train[col].dtype == 'object':
if len(list(X_train[col].unique())) <= 2:
le.fit(X_train[col])
X_train[col] = le.transform(X_train[col])
le_count = 1
else:
enc.fit(X_train[[col]])
X_train[[col]] = enc.transform(X_train[[col]])
enc_count =1
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))
but when I run it, I don't get errors but the encoding is super weird with a slew of tuples being inserted into my new dataset.
When I run the code without the everything in the else clause, it runs fine and I can simply use get_dummies to encode the other variables.
The only issue is when I use get_dummies, I drop_first is set to true; but I lose track of what is supposed to be 0 and what's supposed to be 1. (i.e. this problem is a major issue for tracking Gender and FTPT.
Any suggestions on this? I would use get_dummies but since I'm doing the preprocessing stage after splitting my data I'm worried about a category possibly being dropped out.
CodePudding user response:
Change the transform line encoding else part as below
X_train[col] = enc.transform(X_train[[col]]).toarray()
Here I'm copying the full code, you may try it directly. So error may be some other part of your code, please check.
styp = ['New','New 1st Time Freshmn','New','New','New']
gend = ['Female','Female','Male','Female','Female']
race = ['White','White','Unknown','Unknown','Asian-American']
deg = ['BA','BA','MBAX','JD','MBAX']
maj = ['Business Administration','Studio Art','Business Administration','Juris Doctor','Business Administration']
ftpt = ['FT','FT','FT','PT','PT']
df = pd.DataFrame({'STYP_DESC':styp, 'Gender':gend, 'RACE_DESC':race,'DEGREE':deg,\
'MAJR_DESC1':maj, 'FTPT':ftpt})
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in df.columns[1:]:
if df[col].dtype == 'object':
if len(list(df[col].unique())) <= 2:
le.fit(df[col])
df[col] = le.transform(df[col])
le_count = 1
else:
enc.fit(df[[col]])
df[col] = enc.transform(df[[col]]).toarray()
enc_count =1
print(df)
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))