I'd like to use LogisticRegression to combine X features that are strings and floats. This question is similar to this question: Logistic regression on One-hot encoding
There is a comment:
I would like to add that your answer is partially correct. Indeed, if only LabelEncode the strings, and not one_hot encode them. That will create false results since some string will worth "more" than others. – Mornor Jun 27, 2017 at 7:19 2 If anyone is wondering what Mornor means, this is because label encode will be numerical values. Ex: France = 0, Italy = 1, etc. That means that some cities are worth more than others. With one-hot encoding each city has the same value: Ex: France = [1, 0], Italy = [0,1]. Also don't forget to the dummy variable trap algosome.com/articles/dummy-variable-trap-regression.html. – Juan Acevedo Jan 20,
However these are just comments. I would like to see the code that combines them as it's not intuitively obvious how to combine them.
Here is the code:
def build_model(results: List[Result]) -> Tuple[LogisticRegression, OneHotEncoder]:
home_names = np.array([r.fixture.home_team.name for r in results])
away_names = np.array([r.fixture.away_team.name for r in results])
home_goals = np.array([r.home_goals for r in results])
away_goals = np.array([r.away_goals for r in results])
home_spis = np.array([r.home_spi for r in results])
away_spis = np.array([r.away_spi for r in results])
home_imps = np.array([r.home_imp for r in results])
away_imps = np.array([r.away_imp for r in results])
team_names = np.array(list(home_names) list(away_names)).reshape(-1, 1)
team_encoding = OneHotEncoder(sparse=False).fit(team_names)
encoded_home_names = team_encoding.transform(home_names.reshape(-1, 1))
encoded_away_names = team_encoding.transform(away_names.reshape(-1, 1))
team_spis = np.array(list(home_spis) list(away_spis)).reshape(-1, 1)
home_spis_reshaped = np.array(list(home_spis) ).reshape(-1, 1)
away_spis_reshaped = np.array(list(away_spis) ).reshape(-1, 1)
x: NDArray[float64] = np.concatenate(
[encoded_home_names, encoded_away_names, home_spis_reshaped, away_spis_reshaped], 1) # type: ignore
y = np.sign(home_goals - away_goals)
model = LogisticRegression(penalty="l2", fit_intercept=False, multi_, C=1)
model.fit(x, y)
return model, team_encoding
if n_features != self.n_features_in_:
> raise ValueError(
f"X has {n_features} features, but {self.__class__.__name__} "
f"is expecting {self.n_features_in_} features as input."
)
E ValueError: X has 1416 features, but LogisticRegression is expecting 1418 features as input.
../../env/lib/python3.10/site-packages/sklearn/base.py:400: ValueError
So it looks like I have to add the home/away spi float scores in before calling fit on the OneHotEncoder, but I'm unclear the best way to do this. Thanks
Solution based upon Alexander's help:
def build_model(results: List[Result]) -> Tuple[LogisticRegression, OneHotEncoder]:
home_names = np.array([r.fixture.home_team.name for r in results])
away_names = np.array([r.fixture.away_team.name for r in results])
home_goals = np.array([r.home_goals for r in results])
away_goals = np.array([r.away_goals for r in results])
home_spis = np.array([r.home_spi for r in results])
away_spis = np.array([r.away_spi for r in results])
home_imps = np.array([r.home_imp for r in results])
away_imps = np.array([r.away_imp for r in results])
team_names = np.array(list(home_names) list(away_names)).reshape(-1, 1)
team_features = [home_names, away_names, home_spis, away_spis, home_imps, away_imps]
df = pd.DataFrame(team_features).transpose()
df.columns = ['home_team', 'away_team', 'home_spi', 'away_spi', 'home_importance', 'away_importance']
cat_columns = ["home_team", "away_team"]
model = LogisticRegression(penalty="l2", fit_intercept=False, multi_, C=1)
team_encoding = OneHotEncoder(sparse=False).fit(team_names)
pipe = make_pipeline(
ColumnTransformer(
transformers=[
("encode", team_encoding, cat_columns),
],
remainder="passthrough"
),
SimpleImputer(),
model
)
y = np.sign(home_goals - away_goals)
pipe = pipe.fit(df, y)
return model, team_encoding
CodePudding user response:
TL;DR: Don't reinvent pipelines, and don't do what that answer suggests (See also: why shouldn't I use LabelEncoder for categorical features?).
Here's a sample of the Titanic dataset that shows a mix of ordinal integer values (pclass
), sex
encoded as [male
, female
], and continuous age
and fare
variables:
pclass sex age fare
0 1 female 29.0000 211.3375
1 1 male 0.9167 151.5500
2 1 female 2.0000 151.5500
3 1 male 30.0000 151.5500
4 1 female 25.0000 151.5500
We'll OneHotEncode the categorical columns, impute missing values, and fit a LogisticRegression
model as the last step in a pipeline. Notice the use of passthrough
in the ColumnTransformer
, which indicates that we should not apply a categorical encoding to the float features. MRE:
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
titanic, target = fetch_openml(data_id=40945, parser="auto", return_X_y=True)
titanic = titanic[["pclass", "sex", "age", "fare"]]
cat_columns = ["sex", "pclass"]
cat_categories = [["male", "female"], [1, 2, 3]]
pipe = make_pipeline(
ColumnTransformer(
transformers=[
("encode", OneHotEncoder(categories=cat_categories), cat_columns),
],
remainder="passthrough"
),
SimpleImputer(),
LogisticRegression()
)
X_train, X_test, y_train, y_test = train_test_split(titanic, target)
pipe = pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))
# 0.7926