Home > database >  Incorrect results using the new `svyquantile()` with `svyby()`
Incorrect results using the new `svyquantile()` with `svyby()`

Time:12-26

In July of 2021, the survey package overhauled the svyquantile() function. Quoting from the package's NEWS:

svyquantile() has been COMPLETELY REWRITTEN. The old version is available as oldsvyquantile().

The svyquantile() was being used in one of my packages when this change occurred. A co-author updated the code to retain use of the old version of the function for continuity. Now that it's been 1.5 years since the release, we're updating my code to use the new version. However, we're am experiencing an issue with the new version when combining its usage with svyby().

When we use the updated function, the results for all levels of the by variable come out equivalent. Also, the returned data frame duplicates each of the columns (once for each level of the by variable).

My guess is that rather than showing the results for the by variable in each row, the results are mistakenly being added as new columns (with repeated column names, and duplicated rows).

Am I doing something wrong with svyby(), or is this a bug in the code? The survey package is so widely used and I am no expert in its use, so I lean toward there being an issue with my code. But I am having trouble diagnosing the problem.

Thank you!

suppressPackageStartupMessages(library(survey))
packageVersion("survey")
#> [1] '4.1.1'
data(api)

# OLD svyquantile, results as expected
svyby(
  # svyby args
  formula = ~api00,
  by = ~both,
  design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
  FUN = oldsvyquantile,
  # args being passed to oldsvyquantile
  na.rm = TRUE,
  keep.var = FALSE,
  quantiles = 0.5
)
#>     both statistic
#> No    No     631.0
#> Yes  Yes     653.5


# NEW svyquantile, I don't understand what I am doing wrong
svyby(
  # svyby args
  formula = ~api00,
  by = ~both,
  design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
  FUN = svyquantile,
  # args being passed to svyquantile
  na.rm = TRUE,
  keep.var = FALSE,
  quantiles = 0.5
)
#>     both statistic.statistic.api00.quantile statistic.statistic.api00.ci.2.5
#> No    No                                631                              547
#> Yes  Yes                                631                              547
#>     statistic.statistic.api00.ci.97.5 statistic.statistic.api00.se
#> No                                722                     39.75493
#> Yes                               722                     39.75493
#>     statistic.statistic.api00.quantile statistic.statistic.api00.ci.2.5
#> No                                 655                              566
#> Yes                                655                              566
#>     statistic.statistic.api00.ci.97.5 statistic.statistic.api00.se
#> No                                717                     34.94774
#> Yes                               717                     34.94774

# results are as expected when used outside `svyby()`
svyquantile(
  x = ~api00,
  design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
  na.rm = TRUE,
  keep.var = FALSE,
  quantiles = 0.5
)
#> $api00
#>     quantile ci.2.5 ci.97.5       se
#> 0.5      652    561     714 35.66788
#> 
#> attr(,"hasci")
#> [1] TRUE
#> attr(,"class")
#> [1] "newsvyquantile"

Created on 2022-12-25 with reprex v2.0.2

CodePudding user response:

It's a bug. It's fixed in the development version, on r-forge (https://r-forge.r-project.org/R/?group_id=1788) but that's currently not building for an unrelated reason. The dev version should work again in a day or two.

CodePudding user response:

As the author Thomas Lumley put it on the release news:

There’s a new version of the survey package on CRAN (yay!). A lot of it is minor or relatively esoteric fixes. There’s one big change, which will break some people’s code. The svyquantile function has been completely rewritten.

It is mostly evident with the function itself:

library(survey)
data(api)
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
svyquantile(~api00, quantiles = 0.5, design = design)

#$api00
#   quantile ci.2.5 ci.97.5       se
#0.5      652    561     714 35.66788

#attr(,"hasci")
#[1] TRUE
#attr(,"class")
#[1] "newsvyquantile"

...the old output was more concise:

oldsvyquantile(~api00, quantiles = 0.5, design = design)
#      0.5
#api00 652 

There are some tweaks needed to update the output. In your case, you only need to remove keep.var = FALSE:

svyby(
   # svyby args
   formula = ~api00,
   by = ~both,
   design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
   FUN = svyquantile,
   # args being passed to svyquantile
   na.rm = TRUE,
   quantiles = 0.5
 )
#    both api00 se.api00
#No    No   631 39.75493
#Yes  Yes   655 34.94774  

Caution: the calculated quantile may differ between the two functions. Not by a large amount in reasonable sized data, but it can be notorious in small sample sizes (n < 10). This is because the interpolation rule between two data points that may be caught on the data distribution limit defined by the quantile. This rule differs between oldsvyquantile and newsvyquantile. The best resource I've found to resume the interpolation mix-up is this 1997 paper from Hyndman & Fan.

The good thing about newsvyquantile is that it easily lets you choose which interpolation rule you like. The 9 rules from Hyndman & Fan's paper are available.

  • Related