How to use polars dataframes with scikitlearn?-CodePudding

I'm unable to use polars dataframes with scikitlearn for ML training. I need help.

Currently I'm doing all the dataframe preprocessing in polars and during model training i'm converting it into a pandas one in order for it to work. So i need to know is there any method to directly use polars dataframe as it is for ML training without changing it to pandas. Please help.

CodePudding user response：

You must call to_numpy when passing a DataFrame to sklearn. Though sometimes sklearn can work on polars Series it is still good type hygiene to transform to the type the host library expects.

import polars as pl
from sklearn.linear_model import LinearRegression

data = pl.DataFrame(
    np.random.randn(100, 5)
)

x = data.select([
    pl.all().exclude("column_0"),
])

y = data.select(pl.col("column_0").alias("y"))


x_train = x[:80]
y_train = y[:80]

x_test = x[80:]
y_test = y[80:]


m = LinearRegression()

m.fit(X=x_train.to_numpy(), y=y_train.to_numpy())
m.predict(x_test.to_numpy())

CodePudding user response：

encoding_transformer1 = ColumnTransformer(
    [("Normalizer", Normalizer(), ['Age', 'Fare']),
     ("One-hot encoder",
      OneHotEncoder(dtype=int, handle_unknown='infrequent_if_exist'),
      ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'])],
    n_jobs=-1,
    verbose=True,
    verbose_feature_names_out=True)

encoding_transformer1.fit(xtrain)
train_data = encoding_transformer1.transform(xtrain).tocsr()
test_data = encoding_transformer1.transform(xtest).tocsr()

I'm getting this error:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

what should i do?