I have implemented logistic regression from scratch, however when I run the script the algorithm always predict the wrong label.
I've tried changing the training output and test_output by switching all 1 to 0 and vice versa but it always predict the wrong label.
I also noticed that changing the "-" sign to " ", when updating the weigths and the bias, the script correctly predicts the label.
What am I doing wrong?
This is the code I've written:
# IMPORTS
import numpy as np
# HYPERPARAMETERS
EPOCHS = 1000
LEARNING_RATE = 0.1
# FUNCTIONS
def sigmoid(z):
return 1 / (1 np.exp(-z))
def cost(y_pred, training_outputs, m):
j = - np.sum(training_outputs * np.log(y_pred) (1 - training_outputs) * np.log(1 - y_pred)) / m
return j
# ENTRY
if __name__ == "__main__":
# Training input and output
x = np.array([[1, 1, 1], [0, 0, 0], [1, 0, 1]])
training_outputs = np.array([1, 0, 1])
# Test input and output
test_input = np.array([[0, 1, 1]])
test_output = np.array([0])
# Weigths
w = np.array([0.3, 0.3, 0.3])
# Biases
b = 0
m = 3
# Training
for iteration in range(EPOCHS):
print("Iteration n.", iteration, end= "\r")
# Compute log odds
z = np.dot(x, w) b
# Compute predicted probability
y_pred = sigmoid(z)
# Back propagation
dz = y_pred - training_outputs
dw = np.dot(x, dz) / m
db = np.sum(dz) / m
# Update weights and bias according to the gradient descent algorithm
w = w - LEARNING_RATE * dw
b = b - LEARNING_RATE * db
print("Model trained. Proceeding with model evaluation...")
# Test
# Compute log odds
z = np.dot(test_input, w) b
# Compute predicted probability
y_pred = sigmoid(z)
print(y_pred)
# Compute cost
cost = cost(y_pred, test_output, m)
print(cost)
CodePudding user response:
There was an incorrect assumption pointed out by @J_H:
>>> from sklearn.linear_model import LogisticRegression
>>> import numpy as np
>>> x = np.array([[1, 1, 1], [0, 0, 0], [1, 0, 1]])
>>> y = np.array([1, 0, 1])
>>> clf = LogisticRegression().fit(x, y)
>>> clf.predict([[0, 1, 1]])
array([1])
scikit-learn
at appears to believe that test_output
should be a 1
rather than a 0
.
A few more recommendations:
m
should be fine to remove (it's a constant, so it could be included in theLEARNING_RATE
)w
should be initialized proportional to the number of columns inx
(i.e.,x.shape[1]
)dw = np.dot(x, dz)
should benp.dot(dz, x)
- Prediction in logistic regression depends on a threshold, usually
0.5
Taking this into account would look something like the following.
# Initialize weights and bias
w, b = np.zeros(X.shape[1]), 0
for _ in range(EPOCHS):
# Compute log odds
z = np.dot(x, w) b
# Compute predicted probability
y_pred = sigmoid(z)
# Back propagation
dz = y_pred - training_outputs
dw = np.dot(dz, x)
db = np.sum(dz)
# Update
w = w - LEARNING_RATE * dw
b = b - LEARNING_RATE * db
# Test
z = np.dot(test_input, w) b
test_pred = sigmoid(z) >= 0.5
print(test_pred)
And a complete example on random train/test sets created with sklearn.datasets.make_classification
could look like this—which usually gets within a few decimals of the scikit-learn
implementation as well:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
EPOCHS = 100
LEARNING_RATE = 0.01
def sigmoid(z):
return 1 / (1 np.exp(-z))
if __name__ == "__main__":
X, y = make_classification(n_samples=1000, n_features=5)
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Initialize `w` and `b`
w, b = np.zeros(X.shape[1]), 0
for _ in range(EPOCHS):
z = np.dot(X_train, w) b
y_pred = sigmoid(z)
dz = y_pred - y_train
dw = np.dot(dz, X_train)
db = np.sum(dz)
w = w - LEARNING_RATE * dw
b = b - LEARNING_RATE * db
# Test
z = np.dot(X_test, w) b
test_pred = sigmoid(z) >= 0.5
print(accuracy_score(y_test, test_pred))