Entropy calculation gives NaN - is applying na.omit a valid tweak?-CodePudding

By definition, the entropy is defined as:

entropy <- function (p) sum(-p * log(p))

I'm performing LCA using the poLCA package and trying to calculate entropy, which for some of my models are outputting NaN.

error_prior <- entropy(lca2$P) # Class proportions model 2
error_post <- mean(apply(lca2$posterior, 1, entropy), na.rm = TRUE)
results[2,8] <- round(((error_prior - error_post) / error_prior), 3)

From the answer to this question: Entropy output is NaN for some class solutions and not others, I learnt that it is caused by zeros in p and it can be resolved by adding na.omit to the function as follows:

entropy <- function (p) sum(na.omit(-p * log(p)))

My question is - is this technical tweak mathematically valid without affecting the integrity of the calculation?

In my case, around 1/3 of the values in p are zeros. I'm really unsure if I should use na.omit or find another way to resolve this problem.

CodePudding user response：

It is valid, but not transparent at first glance. The reason is that the mathematical limit of xlog(x) as x -> 0 is 0 (we can prove this using L'Hospital Rule). In this regard, the most robust definition of the function should be

entropy.safe <- function (p) {
  if (any(p > 1 | p < 0)) stop("probability must be between 0 and 1")
  log.p <- numeric(length(p))
  safe <- p != 0
  log.p[safe] <- log(p[safe])
  sum(-p * log.p)
}

But simply dropping p = 0 cases gives identical results, because the result at p = 0 is 0 and contributes nothing to the sum anyway.

entropy.brutal <- function (p) {
  if (any(p > 1 | p < 0)) stop("probability must be between 0 and 1")
  log.p <- log(p)
  ## as same as sum(na.omit(-p * log.p))
  sum(-p * log.p, na.rm = TRUE)
}

## p has a single 0
( p <- seq(0, 1, by = 0.1) )
#[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
entropy.brutal(p)
#[1] 2.455935
entropy.safe(p)
#[1] 2.455935

## half of p are zeros
p[1:5] <- 0
p
#[1] 0.0 0.0 0.0 0.0 0.0 0.5 0.6 0.7 0.8 0.9 1.0
entropy.brutal(p)
#[1] 1.176081
entropy.safe(p)
#[1] 1.176081

In conclusion, we can use either entropy.brutal or entropy.safe.