Pandas : Linear Regression apply standard scaler to some columns-CodePudding

So I have the following dataset :

new_data=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/regression_sales.csv')

I then did a simple linear regression :

y = np.array(new_data['sales_per_day'])
X = np.array(new_data[['number_orders', 'number_items', 'number_segments', 'year', 'month', 'day']])
X.shape, y.shape
train, test = train_test_split(df, test_size=0.2, train_size=0.8, random_state = 77)
from sklearn.linear_model import LinearRegression
regression_model = LinearRegression(fit_intercept=True)
regression_model.fit(X, y)

I now want to standardize 'number_orders', 'number_items', 'number_segments', and here is what I tried :

from sklearn.preprocessing import StandardScaler
Std_Scaler = StandardScaler()
Std_data = Std_Scaler.fit_transform(X_train)
Std_data = pd.DataFrame(Std_Scaler.transform(X_test), columns=['number_items', 'number_orders', 'number_segments'])

However I get the following error ValueError: Wrong number of items passed 6, placement implies 3.

The thing is I only want to standardize those three columns and the other three(year, month, day)but I can't seem to make it work.

Would you know of any way to only standardize part of the dataset ?

CodePudding user response：

You can split your data frame like this:

X = new_data[['number_orders', 'number_items', 'number_segments', 'year', 'month', 'day']]
y = new_data['sales_per_day']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, 
train_size=0.8, random_state = 77)

Define columns to scale:

Cols = ['number_items', 'number_orders', 'number_segments']

Then you need to make a copy since you are modifying the data frame, so something like:

X_train = X_train.copy()
X_test = X_test.copy()
X_train[Cols] = Std_Scaler.fit_transform(X_train[Cols])
X_test[Cols] = Std_Scaler.fit_transform(X_test[Cols])