Given array of numbers from 1-20 (X_train) and array of binary values from 0 or 1 (y_train) passing it to Logistic Regression algorithm and then training the model. Trying to predict with below X_test gives me incorrect data.
Created the sample train and test data as shown below. Please suggest what's wrong with the code.
import numpy as np
X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype=float).reshape(-1, 1)
y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0], dtype=float)
X_test = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 55, 88, 99, 100], dtype=float).reshape(-1, 1)
from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)
y_predict = logreg.predict(X_test)
print(y_predict)
Output :
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0.]
CodePudding user response:
A similar question was already asked here. I would like to use this post as inspiration for my solution.
But first let me mention two things:
A logistic regression is very beneficial in terms of time, performance and explainability if you have some kind of nested-linear relationships between your feature(s) and label, but obviously that is not the case for your example. You want to estimate a discontinous function that equals one if your input is odd and zero otherwise, which is not easily achieved.
Your data representation is not good. I think this point is more critical for your prediction goal as a better data representation does lead to a better prediction.
Next, I would like to share an alternative data representation. This new representation does yield perfect prediction results, even for a simple untuned logistic regression.
Code:
import numpy as np
from sklearn import linear_model
X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0], dtype=float)
X_test = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 55, 88, 99, 100])
def convert_repr(x):
return list(map(int, list(str(format(x, '016b')))))
# Change data representation
X_train = np.array(list(map(convert_repr, X_train)))
X_test = np.array(list(map(convert_repr, X_test)))
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)
y_predict = logreg.predict(X_test)
print(y_predict)
Output:
[1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0.]
As you can see, the data is more important than the actual model.
CodePudding user response:
It is an ill-posed task: for logistic regression to work, there should be some point on the x axis which separates the high probability region from the low probability region of the target class (as you see in the test output). With alternating even/odd outputs, this can obviously not be a correct model, thus learning is completely instable.
You either need better features, or a more complex model with can take care of the more complicated space.
CodePudding user response:
I don't think your code is wrong. Your model is too simple to learn such a complex non-linear behaviour. If you use a decision tree which is more flexible instead it can learn the training data better:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import graphviz
import numpy as np
X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype=float).reshape(-1, 1)
y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0], dtype=float)
X_test = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 55, 88, 99, 100], dtype=float).reshape(-1, 1)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
print(y_predict)
Output:
[1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]
You can also visualize the tree with:
dot_data = tree.export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("Odd-Even")
It will generate a pdf file of the tree. For this you need to install graphviz
by pip install graphviz