Scaler Transform help sklearn-CodePudding

I'm working on a logistic regression assignment and my professor has this code example.

What is the new_x variable and why are we transforming it as a matrix?

data = pd.DataFrame( {’id’: [ 1,2,3,4,5,6,7,8], ’Label’: [’green’, ’green’, ’green’, ’green’,
’red’, ’red’, ’red’, ’red’],
’Height’: [5, 5.5, 5.33, 5.75, 6.00, 5.92, 5.58, 5.92],
’Weight’: [100, 150, 130, 150, 180, 190, 170, 165], ’Foot’: [6, 8, 7, 9, 13, 11, 12, 10]},
columns = [’id’, ’Height’, ’Weight’, ’Foot’, ’Label’] )

X = data[[’Height’, ’Weight’]].values 
scaler = StandardScaler() 
scaler.fit(X)
X = scaler.transform(X)
Y = data[’Label’].values 
log_reg_classifier = LogisticRegression()
log_reg_classifier.fit(X,Y)
new_x = scaler.transform(np.asmatrix([6, 160])) 
predicted = log_reg_classifier.predict(new_x) 
accuracy = log_reg_classifier.score(X, Y)

CodePudding user response：

Let's take it step by step.

data = pd.DataFrame( {’id’: [ 1,2,3,4,5,6,7,8], ’Label’: [’green’, ’green’, ’green’, ’green’,
’red’, ’red’, ’red’, ’red’],
’Height’: [5, 5.5, 5.33, 5.75, 6.00, 5.92, 5.58, 5.92],
’Weight’: [100, 150, 130, 150, 180, 190, 170, 165], ’Foot’: [6, 8, 7, 9, 13, 11, 12, 10]},
columns = [’id’, ’Height’, ’Weight’, ’Foot’, ’Label’] )

You create an initial feature matrix that contains the columns [’id’, ’Height’, ’Weight’, ’Foot’, ’Label’].

scaler = StandardScaler() 
scaler.fit(X)
X = scaler.transform(X)
Y = data[’Label’].values

You than obtain a np.array, that contains only weight and height using data[[’Height’, ’Weight’]].values. See pandas docs on slicing for more info. You can obtain the size of the feature matrix with X.shape i. e., [n,2].

X = data[[’Height’, ’Weight’]].values 
scaler = StandardScaler() 
scaler.fit(X)
X = scaler.transform(X)
Y = data[’Label’].values 
log_reg_classifier = LogisticRegression()
log_reg_classifier.fit(X,Y)

You use those two features only to train the logistic regression after standardization. That is your classifier is learned on two features (i. e., height and weight) only, but mutliple samples. Every classifier in sklearn implements the fit() method to fit the classifier to the training data.

As your model is trained on a feature matrix with two features, your sample that you want to predict (new_x) also needs two features. Thus, you first create a np.asmatrix([6, 160] with shape [1,2] and elements [height=6,weight=160], scale it and pass it to your trained model. log_reg_classifier.predict(new_x) returns the prediction. You assess the performance of the classifier by comparing the prediction with the true label and calculating the (mean) accuracy. Et voila.

new_x = scaler.transform(np.asmatrix([6, 160])) 
predicted = log_reg_classifier.predict(new_x) 
accuracy = log_reg_classifier.score(X, Y)