Predict if a number is odd or even using Logistic Regression formula y = x % 2 0-CodePudding

Given array of numbers from 1-20 (X_train) and array of binary values from 0 or 1 (y_train) passing it to Logistic Regression algorithm and then training the model. Trying to predict with below X_test gives me incorrect data.

Created the sample train and test data as shown below. Please suggest what's wrong with the code.

import numpy as np

X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype=float).reshape(-1, 1)
y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0], dtype=float)
X_test = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 55, 88, 99, 100], dtype=float).reshape(-1, 1)

from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)
y_predict = logreg.predict(X_test)
print(y_predict)

Output :
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0.]

CodePudding user response：

A similar question was already asked here. I would like to use this post as inspiration for my solution.

But first let me mention two things:

A logistic regression is very beneficial in terms of time, performance and explainability if you have some kind of nested-linear relationships between your feature(s) and label, but obviously that is not the case for your example. You want to estimate a discontinous function that equals one if your input is odd and zero otherwise, which is not easily achieved.
Your data representation is not good. I think this point is more critical for your prediction goal as a better data representation does lead to a better prediction.

Next, I would like to share an alternative data representation. This new representation does yield perfect prediction results, even for a simple untuned logistic regression.

Code:

import numpy as np
from sklearn import linear_model

X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0], dtype=float)
X_test = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 55, 88, 99, 100])

def convert_repr(x):
    return list(map(int, list(str(format(x, '016b')))))

# Change data representation
X_train = np.array(list(map(convert_repr, X_train)))
X_test = np.array(list(map(convert_repr, X_test)))

logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)
y_predict = logreg.predict(X_test)
print(y_predict)

Output:

[1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0.]

As you can see, the data is more important than the actual model.

CodePudding user response：

It is an ill-posed task: for logistic regression to work, there should be some point on the x axis which separates the high probability region from the low probability region of the target class (as you see in the test output). With alternating even/odd outputs, this can obviously not be a correct model, thus learning is completely instable.

You either need better features, or a more complex model with can take care of the more complicated space.

CodePudding user response：

I don't think your code is wrong. Your model is too simple to learn such a complex non-linear behaviour. If you use a decision tree which is more flexible instead it can learn the training data better:

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import graphviz 
import numpy as np

X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype=float).reshape(-1, 1)
y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0], dtype=float)
X_test = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 55, 88, 99, 100], dtype=float).reshape(-1, 1)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
print(y_predict)

Output:

[1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]

You can also visualize the tree with:

dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("Odd-Even")

It will generate a pdf file of the tree. For this you need to install graphviz by pip install graphviz