Home > Blockchain >  Negative Binomial model offset seems to be creating a 2 level factor
Negative Binomial model offset seems to be creating a 2 level factor

Time:08-02

I am trying to fit some data to a negative binomial model and run a pairwise comparison using emmeans. The data has two different sample sizes, 15 and 20 (num_sample in the example below).
I have set up two data frames: good.data which produces the expected result of offset() using random sample sizes between 15 and 20, and bad.data using a sample size of either 15 or 20, which seems to produce a factor of either 15 or 20. The bad.data pairwise comparison produces way too many comparisons compared to the good.data, even though they should produce the same number?

set.seed(1)
library(dplyr)
library(emmeans)
library(MASS)
# make data that works
data.frame(site=c(rep("A",24),
                  rep("B",24),
                  rep("C",24),
                  rep("D",24),
                  rep("E",24)),
           trt_time=rep(rep(c(10,20,30),8),5),
           pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
           storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
           num_sample=sample(c(15,17,20),24*5,T),# more than 2 sample sizes...
           bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->good.data
# make data that doesn't work
data.frame(site=c(rep("A",24),
                  rep("B",24),
                  rep("C",24),
                  rep("D",24),
                  rep("E",24)),
           trt_time=rep(rep(c(10,20,30),8),5),
           pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
           storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
           num_sample=sample(c(15,20),24*5,T),# only 2 sample sizes...
           bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->bad.data
# fit models
good.data%>%
  mutate(trt_time=factor(trt_time),
         pre_trt=factor(pre_trt),
         storage_time=factor(storage_time))%>%
  MASS::glm.nb(bad~trt_time:pre_trt:storage_time offset(log(num_sample)),
               data=.)->mod.good
bad.data%>%
  mutate(trt_time=factor(trt_time),
         pre_trt=factor(pre_trt),
         storage_time=factor(storage_time))%>%
  MASS::glm.nb(bad~trt_time:pre_trt:storage_time offset(log(num_sample)),
               data=.)->mod.bad
  
# pairwise comparison
emmeans::emmeans(mod.good,pairwise~trt_time:pre_trt:storage_time offset(log(num_sample)))$contrasts%>%as.data.frame()
emmeans::emmeans(mod.bad,pairwise~trt_time:pre_trt:storage_time offset(log(num_sample)))$contrasts%>%as.data.frame()

CodePudding user response:

To be honest I'm not exactly sure what's going on with the expansion (276, the 'correct' number of contrasts, is choose(24,2), the 'incorrect' number of contrasts is 1128 = choose(48,2)), but I would say that you should probably be following the guidance in the "offsets" section of one of the emmeans vignettes where it says

If a model is fitted and its formula includes an offset() term, then by default, the offset is computed and included in the reference grid. ... However, many users would like to ignore the offset for this kind of model, because then the estimates we obtain are rates per unit value of the (logged) offset. This may be accomplished by specifying an offset parameter in the call ...

The most natural choice for setting the offset is to 0 (i.e. make predictions etc. for a sample size of 1), but in this case I don't think it matters.

get_contr <- function(x) as_tibble(x$contrasts)
cfun <- function(m) {
   emmeans::emmeans(m,
      pairwise~trt_time:pre_trt:storage_time, offset=0) |>
    get_contr()
}
nrow(cfun(mod.good))  ## 276
nrow(cfun(mod.bad))   ## 276

From a statistical point of view I question the wisdom of looking at 276 pairwise comparisons, but that's a different issue ...

CodePudding user response:

First , I think you should look up how to use emmeans.The intent is not to give a duplicate of the model formula, but rather to specify which factors you want the marginal means of.

However, that is not the issue here. What emmeans does first is to setup a reference grid that consists of all combinations of

  • the levels of each factor
  • the average of each numeric predictor; except if a numeric predictor has just two different values, then both its values are included.

It is that exception you have run against. Since num_samples has just 2 values of 15 and 20, both levels are kept separate rather than averaged. If you want them averaged, add cov.keep = 1 to the emmeans call. It has nothing to do with offsets you specify in emmeans-related functions; it has to do with the fact that num_samples is a predictor in your model.

The reason for the exception is that a lot of people specify models with indicator variables (e.g., female having values of 1 if true and 0 if false) in place of factors. We generally want those treated like factors rather than numeric predictors.

  • Related