How do I turn preprocessed data from pipelines into dataframes?


I have a piece of code that is a preprocessing file for my data. Everything is kosher until I have to feed that preprocessed data into a fit function that takes pandas dataframes and arrays. How can I turn this training data into a dataframe for feeding? As of the pipeline.fit() function, the datatype is a column transformer and not a pandas df.


import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# generate the data
data = pd.DataFrame({
    'y':  [1, 2, 3, 4, 5],
    'x1': [6, 7, 8, np.nan, np.nan],
    'x2': [9, 10, 11, np.nan, np.nan],
    'x3': ['a', 'b', 'c', np.nan, np.nan],
    'x4': [np.nan, np.nan, 'd', 'e', 'f']

# extract the features and target
x = data.drop(labels=['y'], axis=1)
y = data['y']

# split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# map the features to the corresponding types (numerical or categorical)
numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = x_train.select_dtypes(include=['object']).columns.tolist()

# define the numerical features pipeline
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())

# define the categorical features pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))

# define the overall pipeline
preprocessor_pipeline = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)

# fit the pipeline to the training data

# apply the pipeline to the training and test data
x_train_ = preprocessor_pipeline.transform(x_train)
x_test_ = preprocessor_pipeline.transform(x_test)

Bonus: Do I need to preprocess my labels (y_train) as well?

to transform your pipeline results into dataframes you would just need this:

x_train_df = pd.DataFrame(data=x_train_)
x_test_df = pd.DataFrame(data=x_test_)

As your labels y are already numeric in most cases there is no further preprocessing needed. But it also depends on the ML model you want to use in the next step.

