using svyglm from a glm perspective-CodePudding

I have been using glm to run logistic regressions. However, I need to use svyglm because my weights are probability weights. However, I really don't understand the documentation for svyglm.

As an example, I am using this data setup, which should give a pweight that suggests this is a perfectly representative sample:

data("mtcars")    
mtcars$pweight <- ifelse(mtcars$cyl == 4, (11/32), ifelse(mtcars$cyl == 6, (7/32), (14/32)))
mtcars$goodcar <- rbinom(n=32, size=1, prob=0.3)

If I use the unweighted glm command for the regression, I get the following result:

mylogit <- glm(goodcar ~ mpg   disp   hp,  data = mtcars, family = "binomial", weights = pweight)
summary(mylogit)

call:
glm(formula = goodcar ~ mpg   disp   hp, family = "binomial", 
    data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6633  -0.7404  -0.4651   0.8658   1.9408  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.279806   4.800398  -0.058   0.9535  
mpg          0.016486   0.145431   0.113   0.9097  
disp         0.020622   0.009257   2.228   0.0259 *
hp          -0.041091   0.021168  -1.941   0.0522 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 39.75  on 31  degrees of freedom
Residual deviance: 31.27  on 28  degrees of freedom
AIC: 39.27

Number of Fisher Scoring iterations: 5

Based on the above, I tried to set it up like this:

design <- svydesign(
  id = ~0 , #no clusters
  data = mtcars ,
  weights = ~pweight 
)

mysvyglm <- svyglm(goodcar ~ mpg   disp   hp,  data = mtcars, design, family = "binomial")
summary(mysvyglm)
    
Call:
svyglm(formula = goodcar ~ mpg   disp   hp, design = design, 
    family = "binomial", data = mtcars)

Survey design:
svydesign(id = ~0, data = mtcars, weights = ~pweight)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.756867   4.655046   0.163   0.8720  
mpg         -0.019180   0.150333  -0.128   0.8994  
disp         0.017538   0.008974   1.954   0.0607 .
hp          -0.037779   0.017197  -2.197   0.0365 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 0.8997231)

Number of Fisher Scoring iterations: 5

Am I wrong to think that the glm and the svyglm should be the same in this case, as the pweights show the sample is representative? I feel like I'm missing something and getting this setup wrong.

CodePudding user response：

The mtcars dataset is small: it has only 32 observations. Let's generate a larger dataset. The simulation is simple and doesn't take the inverse weight sampling into account.

And I've added a third model, geepack::geeglm, that can fit a GLM with inverse probability weighting.

With the larger dataset, all three models have the same estimates for the coefficients and their standard errors.

Part 1. Generate data

library("survey")
library("geepack")
library("tidyverse")

set.seed(1234)

inv_logit <- function(x) {
  plogis(x)
}

n <- 100

# Gnerate dataset
mydata <-
  tibble(
    id = seq(n),
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = rnorm(n),
    y = rbinom(n,
      size = 1,
      prob = inv_logit(0.1 * x1   0.2 * x2   0.3 * x3)   rnorm(n, sd = 0.1)
    ),
    pweight = runif(n)
  ) %>%
  # geeglm requires the data is ordered by time (or observation) within each cluster
  arrange(id)

Part 2. Fit models

fit.glm <- glm(
  y ~ x1   x2   x3,
  weights = pweight,
  family = "binomial",
  data = mydata
)
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
tidy(fit.glm)
#> # A tibble: 4 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)   -0.319     0.340    -0.938  0.348 
#> 2 x1             0.638     0.362     1.77   0.0776
#> 3 x2             0.591     0.372     1.59   0.112 
#> 4 x3             0.972     0.446     2.18   0.0291

design <- svydesign(
  id = ~1,
  weights = ~pweight,
  data = mydata
)

fit.svyglm <- svyglm(
  y ~ x1   x2   x3,
  design = design,
  family = "binomial",
  data = mydata
)
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
tidy(fit.svyglm)
#> # A tibble: 4 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)   -0.319     0.269     -1.18 0.239  
#> 2 x1             0.638     0.293      2.18 0.0317 
#> 3 x2             0.591     0.297      1.99 0.0497 
#> 4 x3             0.972     0.329      2.95 0.00397

fit.geeglm <- geeglm(
  y ~ x1   x2   x3,
  id = mydata$id,
  weights = pweight,
  family = "binomial",
  corstr = "independence",
  data = mydata
)
#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
tidy(fit.geeglm)
#> # A tibble: 4 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)   -0.319     0.268      1.42 0.234  
#> 2 x1             0.638     0.291      4.80 0.0284 
#> 3 x2             0.591     0.296      3.99 0.0457 
#> 4 x3             0.972     0.328      8.80 0.00301

^{Created on 2022-03-16 by the reprex package (v2.0.1)}