Parameter estimation in logistic model by negative log-likelihood minimization

I am currently attempting to estimate the parameters of a logistic regression model "by hand" on the iris dataset via minimisation of cross-entropy. Please note, when I say iris dataset, it has been changed such that there are only two classes - Setosa and Other. It was also normalised via the scale function:

library(dplyr)
library(optimx)
iriso <- iris %>%
  mutate(Species = ifelse(Species == "setosa", "setosa", "other")) %>%
  mutate(Species_n = ifelse(Species == "setosa", 1, 0)) %>%
  as.data.frame()

iriso[,1:4] <- scale(iriso[,1:4])

Based on my understanding: should everything be correct, when the optimisation is completed we should obtain the same set of parameters each time - no matter the starting point for the optimisation algorithm. However, the parameter estimates following optimisation are bouncing around:

Some functions being defined:

X <- model.matrix(~.,data=iriso[,1:4])
Y <- model.matrix(~0 Species_n,data=iriso)

e <- exp(1)
sigmoid <- function(y){
  1/(1   e^-y)
}

#w is an array of weights. x is matrix of observations
logistique <- function(w, x){
  sigmoid(
    y = w[1]*x[,1]   w[2]*x[,2]   w[3]*x[,3]   w[4]*x[,4]   w[5]*x[,5]
    )
}

#y is obsrved values
entropie <- function(w, y, x){
  prob_pred <- logistique(w = w, x = x)

  -sum(
    y*log(prob_pred)   (1-y)*log(1-prob_pred)
  ) 
}

Optimisation step:

for(i in 1:5){
  w0 <- rnorm(n = 5) #set of initial parameters
  optimx(par = w0, fn = entropie,  
         method = "Nelder-Mead",
         y = iriso$Species_n, x = X) %>%
    print()
}

I can not seem to understand why I am not obtaining consistent answers. Is there something wrong with the code above? Is there a concept I am not aware of? Something I missed?

Thanks.

CodePudding user response：

The main issue is that you have "complete separation" in your dataset. With those predictors, you can identify Species_n without any error at all. In this kind of situation, the logistic model has no MLE, it improves more and more as the estimated coefficients get more extreme in the right direction.

The way to detect this is to look at the predicted probabilities or logits. When I ran your model once, I got estimates that were

[1] -21.208757  -3.827454   4.601657  -5.271226 -25.119453

These estimates give y values (the logits) with absolute values all greater than 13, so the probabilities will be essentially zero or one.

A couple of other minor issues:

There's no point in calculating e^-y, you can just use exp(-y) and will get the same result a bit quicker.
Calculating log(prob_pred) and log(1-prob_pred) directly is likely to be quite inaccurate, or give overflows or underflows. It's better to work out those expressions analytically so that you don't get rounding error at extreme values of prob_pred. For example, 1-prob_pred = exp(-y)/(1 exp(-y)), so log(1-prob_pred) = -y-log(1 exp(-y)) = -y-log1p(exp(-y)).