I am currently attempting to estimate the parameters of a logistic regression model "by hand" on the iris
dataset via minimisation of cross-entropy. Please note, when I say iris
dataset, it has been changed such that there are only two classes - Setosa and Other. It was also normalised via the scale
function:
library(dplyr)
library(optimx)
iriso <- iris %>%
mutate(Species = ifelse(Species == "setosa", "setosa", "other")) %>%
mutate(Species_n = ifelse(Species == "setosa", 1, 0)) %>%
as.data.frame()
iriso[,1:4] <- scale(iriso[,1:4])
Based on my understanding: should everything be correct, when the optimisation is completed we should obtain the same set of parameters each time - no matter the starting point for the optimisation algorithm. However, the parameter estimates following optimisation are bouncing around:
Some functions being defined:
X <- model.matrix(~.,data=iriso[,1:4])
Y <- model.matrix(~0 Species_n,data=iriso)
e <- exp(1)
sigmoid <- function(y){
1/(1 e^-y)
}
#w is an array of weights. x is matrix of observations
logistique <- function(w, x){
sigmoid(
y = w[1]*x[,1] w[2]*x[,2] w[3]*x[,3] w[4]*x[,4] w[5]*x[,5]
)
}
#y is obsrved values
entropie <- function(w, y, x){
prob_pred <- logistique(w = w, x = x)
-sum(
y*log(prob_pred) (1-y)*log(1-prob_pred)
)
}
Optimisation step:
for(i in 1:5){
w0 <- rnorm(n = 5) #set of initial parameters
optimx(par = w0, fn = entropie,
method = "Nelder-Mead",
y = iriso$Species_n, x = X) %>%
print()
}
I can not seem to understand why I am not obtaining consistent answers. Is there something wrong with the code above? Is there a concept I am not aware of? Something I missed?
Thanks.
CodePudding user response:
The main issue is that you have "complete separation" in your dataset. With those predictors, you can identify Species_n
without any error at all. In this kind of situation, the logistic model has no MLE, it improves more and more as the estimated coefficients get more extreme in the right direction.
The way to detect this is to look at the predicted probabilities or logits. When I ran your model once, I got estimates that were
[1] -21.208757 -3.827454 4.601657 -5.271226 -25.119453
These estimates give y
values (the logits) with absolute values all greater than 13, so the probabilities will be essentially zero or one.
A couple of other minor issues:
There's no point in calculating
e^-y
, you can just useexp(-y)
and will get the same result a bit quicker.Calculating
log(prob_pred)
andlog(1-prob_pred)
directly is likely to be quite inaccurate, or give overflows or underflows. It's better to work out those expressions analytically so that you don't get rounding error at extreme values ofprob_pred
. For example,1-prob_pred = exp(-y)/(1 exp(-y))
, solog(1-prob_pred) = -y-log(1 exp(-y)) = -y-log1p(exp(-y))
.