I'm unable to use polars dataframes with scikitlearn for ML training.
Currently I'm doing all the dataframe preprocessing in polars and during model training i'm converting it into a pandas one in order for it to work.
Is there any method to directly use polars dataframe as it is for ML training without changing it to pandas?
CodePudding user response:
You must call to_numpy
when passing a DataFrame
to sklearn. Though sometimes sklearn
can work on polars Series
it is still good type hygiene to transform to the type the host library expects.
import polars as pl
from sklearn.linear_model import LinearRegression
data = pl.DataFrame(
np.random.randn(100, 5)
)
x = data.select([
pl.all().exclude("column_0"),
])
y = data.select(pl.col("column_0").alias("y"))
x_train = x[:80]
y_train = y[:80]
x_test = x[80:]
y_test = y[80:]
m = LinearRegression()
m.fit(X=x_train.to_numpy(), y=y_train.to_numpy())
m.predict(x_test.to_numpy())
CodePudding user response:
encoding_transformer1 = ColumnTransformer(
[("Normalizer", Normalizer(), ['Age', 'Fare']),
("One-hot encoder",
OneHotEncoder(dtype=int, handle_unknown='infrequent_if_exist'),
['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'])],
n_jobs=-1,
verbose=True,
verbose_feature_names_out=True)
encoding_transformer1.fit(xtrain)
train_data = encoding_transformer1.transform(xtrain).tocsr()
test_data = encoding_transformer1.transform(xtest).tocsr()
I'm getting this error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
what should i do?