How to combine OneHotEncoder (for strings) with Float features-CodePudding

I'd like to use LogisticRegression to combine X features that are strings and floats. This question is similar to this question: Logistic regression on One-hot encoding

There is a comment:

I would like to add that your answer is partially correct. Indeed, if only LabelEncode the strings, and not one_hot encode them. That will create false results since some string will worth "more" than others. – Mornor Jun 27, 2017 at 7:19 2 If anyone is wondering what Mornor means, this is because label encode will be numerical values. Ex: France = 0, Italy = 1, etc. That means that some cities are worth more than others. With one-hot encoding each city has the same value: Ex: France = [1, 0], Italy = [0,1]. Also don't forget to the dummy variable trap algosome.com/articles/dummy-variable-trap-regression.html. – Juan Acevedo Jan 20,

However these are just comments. I would like to see the code that combines them as it's not intuitively obvious how to combine them.

Here is the code:

def build_model(results: List[Result]) -> Tuple[LogisticRegression, OneHotEncoder]:
    home_names = np.array([r.fixture.home_team.name for r in results])
    away_names = np.array([r.fixture.away_team.name for r in results])
    home_goals = np.array([r.home_goals for r in results])
    away_goals = np.array([r.away_goals for r in results])
    home_spis = np.array([r.home_spi for r in results])
    away_spis = np.array([r.away_spi for r in results])
    home_imps = np.array([r.home_imp for r in results])
    away_imps = np.array([r.away_imp for r in results])

    team_names = np.array(list(home_names)   list(away_names)).reshape(-1, 1)
    team_encoding = OneHotEncoder(sparse=False).fit(team_names)

    encoded_home_names = team_encoding.transform(home_names.reshape(-1, 1))
    encoded_away_names = team_encoding.transform(away_names.reshape(-1, 1))

    team_spis = np.array(list(home_spis)   list(away_spis)).reshape(-1, 1)
    home_spis_reshaped = np.array(list(home_spis) ).reshape(-1, 1)
    away_spis_reshaped = np.array(list(away_spis) ).reshape(-1, 1)



    x: NDArray[float64] = np.concatenate(
        [encoded_home_names, encoded_away_names, home_spis_reshaped, away_spis_reshaped], 1)  # type: ignore
    y = np.sign(home_goals - away_goals)

    model = LogisticRegression(penalty="l2", fit_intercept=False, multi_, C=1)
    model.fit(x, y)

    return model, team_encoding

        if n_features != self.n_features_in_:
>           raise ValueError(
                f"X has {n_features} features, but {self.__class__.__name__} "
                f"is expecting {self.n_features_in_} features as input."
            )
E           ValueError: X has 1416 features, but LogisticRegression is expecting 1418 features as input.

../../env/lib/python3.10/site-packages/sklearn/base.py:400: ValueError

So it looks like I have to add the home/away spi float scores in before calling fit on the OneHotEncoder, but I'm unclear the best way to do this. Thanks

Solution based upon Alexander's help:

def build_model(results: List[Result]) -> Tuple[LogisticRegression, OneHotEncoder]:
home_names = np.array([r.fixture.home_team.name for r in results])
away_names = np.array([r.fixture.away_team.name for r in results])
home_goals = np.array([r.home_goals for r in results])
away_goals = np.array([r.away_goals for r in results])
home_spis = np.array([r.home_spi for r in results])
away_spis = np.array([r.away_spi for r in results])
home_imps = np.array([r.home_imp for r in results])
away_imps = np.array([r.away_imp for r in results])
team_names = np.array(list(home_names)   list(away_names)).reshape(-1, 1)

team_features = [home_names, away_names, home_spis, away_spis, home_imps, away_imps]

df = pd.DataFrame(team_features).transpose()
df.columns = ['home_team', 'away_team', 'home_spi', 'away_spi', 'home_importance', 'away_importance']
cat_columns = ["home_team", "away_team"]
model = LogisticRegression(penalty="l2", fit_intercept=False, multi_, C=1)
team_encoding = OneHotEncoder(sparse=False).fit(team_names)
pipe = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("encode", team_encoding, cat_columns),
        ],
        remainder="passthrough"
    ),
    SimpleImputer(),
    model
)

y = np.sign(home_goals - away_goals)
pipe = pipe.fit(df, y)
return model, team_encoding

CodePudding user response：

TL;DR: Don't reinvent pipelines, and don't do what that answer suggests (See also: why shouldn't I use LabelEncoder for categorical features?).

Here's a sample of the Titanic dataset that shows a mix of ordinal integer values (pclass), sex encoded as [male, female], and continuous age and fare variables:

      pclass     sex      age      fare
0          1  female  29.0000  211.3375
1          1    male   0.9167  151.5500
2          1  female   2.0000  151.5500
3          1    male  30.0000  151.5500
4          1  female  25.0000  151.5500

We'll OneHotEncode the categorical columns, impute missing values, and fit a LogisticRegression model as the last step in a pipeline. Notice the use of passthrough in the ColumnTransformer, which indicates that we should not apply a categorical encoding to the float features. MRE:

from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

titanic, target = fetch_openml(data_id=40945, parser="auto", return_X_y=True)
titanic = titanic[["pclass", "sex", "age", "fare"]]

cat_columns = ["sex", "pclass"]
cat_categories = [["male", "female"], [1, 2, 3]]

pipe = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("encode", OneHotEncoder(categories=cat_categories), cat_columns),
        ],
        remainder="passthrough"
    ),
    SimpleImputer(),
    LogisticRegression()
)

X_train, X_test, y_train, y_test = train_test_split(titanic, target)

pipe = pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))
# 0.7926