So I have the following dataset :
new_data=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/regression_sales.csv')
I then did a simple linear regression :
y = np.array(new_data['sales_per_day'])
X = np.array(new_data[['number_orders', 'number_items', 'number_segments', 'year', 'month', 'day']])
X.shape, y.shape
train, test = train_test_split(df, test_size=0.2, train_size=0.8, random_state = 77)
from sklearn.linear_model import LinearRegression
regression_model = LinearRegression(fit_intercept=True)
regression_model.fit(X, y)
I now want to standardize 'number_orders', 'number_items', 'number_segments'
, and here is what I tried :
from sklearn.preprocessing import StandardScaler
Std_Scaler = StandardScaler()
Std_data = Std_Scaler.fit_transform(X_train)
Std_data = pd.DataFrame(Std_Scaler.transform(X_test), columns=['number_items', 'number_orders', 'number_segments'])
However I get the following error ValueError: Wrong number of items passed 6, placement implies 3
.
The thing is I only want to standardize those three columns and the other three(year, month, day
)but I can't seem to make it work.
Would you know of any way to only standardize part of the dataset ?
CodePudding user response:
You can split your data frame like this:
X = new_data[['number_orders', 'number_items', 'number_segments', 'year', 'month', 'day']]
y = new_data['sales_per_day']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,
train_size=0.8, random_state = 77)
Define columns to scale:
Cols = ['number_items', 'number_orders', 'number_segments']
Then you need to make a copy since you are modifying the data frame, so something like:
X_train = X_train.copy()
X_test = X_test.copy()
X_train[Cols] = Std_Scaler.fit_transform(X_train[Cols])
X_test[Cols] = Std_Scaler.fit_transform(X_test[Cols])