I am trying to calculate the variance of the coefficients for logistic regression using bootstrap and I am using scikit-learn and statsmodels to compare results. I am using the Default dataset from the ISLR website which can be found in the zip forlder here or here as a plain csv file. I am using the following codes to perform the bootstrap:
Import the Dataset and create the response variable
default_df = pd.read_csv("./Machine-Learning-Books-With-Python/Introduction to Statistical Learning/data/default.csv")
default_df['default_01'] = np.where(default_df.default == 'Yes', 1, 0)
Next, I am defining the boot function which will take care of the random sampling for my dataset:
def boot(X, bootSample_size=None):
'''
Sampling observations from a dataframe
Parameters
------------
X : pandas dataframe
Data to be resampled
bootSample_size: int, optional
Dimension of the bootstrapped samples
Returns
------------
bootSample_X : pandas dataframe
Resampled data
Examples
----------
To resample data from the X dataframe:
>> boot(X)
The resampled data will have length equal to len(X).
To resample data from the X dataframe in order to have length 5:
>> boot(X,5)
References
------------
http://nbviewer.jupyter.org/gist/aflaxman/6871948
'''
#assign default size if non-specified
if bootSample_size == None:
bootSample_size = len(X)
#create random integers to use as indices for bootstrap sample based on original data
bootSample_i = (np.random.rand(bootSample_size)*len(X)).astype(int)
bootSample_i = np.array(bootSample_i)
bootSample_X = X.iloc[bootSample_i]
return bootSample_X
Finally, I define two functions that will perform the logistic regression and extract the parameters:
Using statsmodels
def boot_fn(data):
lr = smf.logit(formula='default_01 ~ income balance', data=data).fit(disp=0)
return lr.params
Using scikit-learn
def boot_fn2(data):
X = data[['income', 'balance']]
y = data.default_01
logit = LogisticRegression(C = 1e9)
logit.fit(X, y)
return logit.coef_
Finally the loop to run the functions 100 times and store the results:
coef_sk = []
coef_sm = []
for _ in np.arange(100):
data = boot(default_df)
coef_sk.append(boot_fn2(data))
coef_sm.append(boot_fn(data))
Taking the mean for both coef_sk (scikit-learn) and coef_sm (statsmodels) I see that the one generated using statsmodels is much closer to the real value and also for different runs the scikit-learn coefficients appear to diverge quite a lot from the actual value. Could you please explain why this happens? I find it very confusing since I would expect that for the same datasets, results should be the same (at least marginally different). However, in this case, results differ a lot, which leads me to believe that there is something wrong with the way I am running the sk-learn version. Would appreciate any help and I would be more than happy to provide additional clarifications.
CodePudding user response:
Although you set the C parameter to be high to minimize, sklearn by default uses lbfgs
solver to find your optimal parameters while statsmodels uses newton
.
You can try doing this to get similar coefficients:
def boot_fn2(data):
X = data[['income', 'balance']]
y = data.default_01
logit = LogisticRegression(penalty="none",max_iter=1000,solver = "newton-cg")
logit.fit(X, y)
return logit.coef_
If I run this with the above function:
coef_sk = []
coef_sm = []
for _ in np.arange(50):
data = boot(default_df)
coef_sk.append(boot_fn2(data))
coef_sm.append(boot_fn(data))
You will instantly see that it throws a lot of warnings about being unable to converge:
LineSearchWarning: The line search algorithm did not converge
Although the coefficients are similar now, it points to a larger issue with your dataset, similar to this question
np.array(coef_sm)[:,1:].mean(axis=0)
array([2.14570133e-05, 5.68280785e-03])
np.array(coef_sk).mean(axis=0)
array([[2.14352318e-05, 5.68116402e-03]])
Your dependent variables are quite huge and this poses a problem for the optimization methods available in sklearn. You can just scale two of your dependent variables down if you want to interpret the coefficients:
default_df[['balance','income']] = default_df[['balance','income']]/100
Else it's always a good practice to scale your independent variables first and apply the regression:
from sklearn.preprocessing import StandardScaler
default_df[['balance','income']] = StandardScaler().fit_transform(default_df[['balance','income']])
def boot_fn(data):
lr = smf.logit(formula='default_01 ~ income balance', data=data).fit(disp=0)
return lr.params
def boot_fn2(data):
X = data[['income', 'balance']]
y = data.default_01
logit = LogisticRegression(penalty="none")
logit.fit(X, y)
return logit.coef_
coef_sk = []
coef_sm = []
for _ in np.arange(50):
data = boot(default_df)
#print(data.default_01.mean())
coef_sk.append(boot_fn2(data))
coef_sm.append(boot_fn(data))
Now you'll see the coefficients are similar:
np.array(coef_sm)[:,1:].mean(axis=0)
array([0.26517582, 2.71598194])
np.array(coef_sk).mean(axis=0)
array([[0.26517504, 2.71598548]])