Home > front end >  How do I handle multiple non-ordinal categorical variables?
How do I handle multiple non-ordinal categorical variables?

Time:10-30

I grabbed a dataset online that contains data on NBA players this year. I am trying to run a Linear Regression on the dataset to see how many points a given player might score on average given the features: TeamName, Position, Age, Minutes Played Per Game. But, I can't wrap my head around how to handle the first two columns, which are my categorical variables. I just started a data science course on Udemy and the instructor hasn't really explained what to do in this scenario since his examples of OneHotEncoding only apply to datasets with one categorical variable.

My Code:

#Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Import Dataset

dataset = pd.read_csv('nba_clean.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

#Encode Dataset

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0, 1])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X))

#Splitting the Dataset into Training set and Test Set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0)

#Perform Multiple Linear Regression on Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Compare predicted values to true values
y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2)
new_y_pred = y_pred.reshape(len(y_pred), 1)
new_y_test = y_test.reshape(len(y_test), 1)
print(np.concatenate((new_y_pred, new_y_test), 1))

CodePudding user response:

Your column tranformer has to handle all the different column types: You have to replace

 ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0, 1])], remainder = 'passthrough')

With the following type of code:

First define your column typelists:

num_f  = ['age', 'points', ...]
ord_f  = ['bbb', 'ccc', ...]
cat_f  = ['aaa', 'ddd', ...]
drop_f = []

Then create a transformer for each type of value

# create a transformer for the categorical values
cat_tr = Pipeline(steps=[
    ('onehot', OneHotEncoder())])

# create a transformer for the categorical ordinal values
ord_tr = Pipeline(steps=[
    ('ordinal', OrdinalEncoder())])

# create a transformed for the numerical values
num_tr = Pipeline(steps=[
    ('scaler', StandardScaler())])

ct = ColumnTransformer(transformers=[
    ("drop",'drop' ,drop_f)
    ,("cat", cat_tr, cat_f)
    ,("ord", ord_tr, ord_f)
    ,("num", num_tr, num_f)
    ],remainder='passthrough')

CodePudding user response:

You can convert certain column(s) to one-hot using pandas function:

pandas.get_dummies(data, column=["TeamName", "Position"])

Like this:

df = pd.DataFrame({
        "Player": ['player1', 'player2', 'player3'],
        "TeamName": ['Lakers', 'Spurs', 'Lakers'],
        "Position":['point guard', 'center', 'forward']
        })
    
df
           Player TeamName     Position
       0  player1   Lakers  point guard
       1  player2    Spurs       center
       2  player3   Lakers      forward


pd.get_dummies(df, columns=['TeamName', 'Position'], prefix='', prefix_sep='')

    Player   Lakers   Spurs   center   forward   point guard
0  player1        1       0        0         0             1
1  player2        0       1        1         0             0
2  player3        1       0        0         1             0
  • Related