I tried to do logistic regression using both sklearn and statsmodels libraries. Their result is close, but not the same. For example, the (slope, intercept) pair obtained by sklearn is (-0.84371207, 1.43255005), while the pair obtained by statsmodels is (-0.8501, 1.4468). Why and how to make them same?
import pandas as pd
import statsmodels.api as sm
from sklearn import linear_model
# Part I: sklearn logistic
url = "https://github.com/pcsanwald/kaggle-titanic/raw/master/train.csv"
titanic_train = pd.read_csv(url)
train_X = pd.DataFrame([titanic_train["pclass"]]).T
train_Y = titanic_train["survived"]
model_1 = linear_model.LogisticRegression(solver = 'lbfgs')
model_1.fit(train_X, train_Y)
print(model_1.coef_) # print slopes
print(model_1.intercept_ ) # print intercept
# Part II: statsmodels logistic
train_X['intercept'] = 1
model_2=sm.Logit(train_Y,train_X, method='lbfgs')
result=model_2.fit()
print(result.summary2())
CodePudding user response:
Sklearn uses L2 regularisation by default and statsmodels does not. Try specifying penalty= 'none' in the sklearn model parameters and rerun