In July of 2021, the survey package overhauled the svyquantile()
function. Quoting from the package's NEWS:
svyquantile()
has been COMPLETELY REWRITTEN. The old version is available asoldsvyquantile()
.
The svyquantile()
was being used in one of my packages when this change occurred. A co-author updated the code to retain use of the old version of the function for continuity. Now that it's been 1.5 years since the release, we're updating my code to use the new version. However, we're am experiencing an issue with the new version when combining its usage with svyby()
.
When we use the updated function, the results for all levels of the by
variable come out equivalent. Also, the returned data frame duplicates each of the columns (once for each level of the by
variable).
My guess is that rather than showing the results for the by
variable in each row, the results are mistakenly being added as new columns (with repeated column names, and duplicated rows).
Am I doing something wrong with svyby()
, or is this a bug in the code? The survey package is so widely used and I am no expert in its use, so I lean toward there being an issue with my code. But I am having trouble diagnosing the problem.
Thank you!
suppressPackageStartupMessages(library(survey))
packageVersion("survey")
#> [1] '4.1.1'
data(api)
# OLD svyquantile, results as expected
svyby(
# svyby args
formula = ~api00,
by = ~both,
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
FUN = oldsvyquantile,
# args being passed to oldsvyquantile
na.rm = TRUE,
keep.var = FALSE,
quantiles = 0.5
)
#> both statistic
#> No No 631.0
#> Yes Yes 653.5
# NEW svyquantile, I don't understand what I am doing wrong
svyby(
# svyby args
formula = ~api00,
by = ~both,
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
FUN = svyquantile,
# args being passed to svyquantile
na.rm = TRUE,
keep.var = FALSE,
quantiles = 0.5
)
#> both statistic.statistic.api00.quantile statistic.statistic.api00.ci.2.5
#> No No 631 547
#> Yes Yes 631 547
#> statistic.statistic.api00.ci.97.5 statistic.statistic.api00.se
#> No 722 39.75493
#> Yes 722 39.75493
#> statistic.statistic.api00.quantile statistic.statistic.api00.ci.2.5
#> No 655 566
#> Yes 655 566
#> statistic.statistic.api00.ci.97.5 statistic.statistic.api00.se
#> No 717 34.94774
#> Yes 717 34.94774
# results are as expected when used outside `svyby()`
svyquantile(
x = ~api00,
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
na.rm = TRUE,
keep.var = FALSE,
quantiles = 0.5
)
#> $api00
#> quantile ci.2.5 ci.97.5 se
#> 0.5 652 561 714 35.66788
#>
#> attr(,"hasci")
#> [1] TRUE
#> attr(,"class")
#> [1] "newsvyquantile"
Created on 2022-12-25 with reprex v2.0.2
CodePudding user response:
It's a bug. It's fixed in the development version, on r-forge (https://r-forge.r-project.org/R/?group_id=1788) but that's currently not building for an unrelated reason. The dev version should work again in a day or two.
CodePudding user response:
As the author Thomas Lumley put it on the release news:
There’s a new version of the survey package on CRAN (yay!). A lot of it is minor or relatively esoteric fixes. There’s one big change, which will break some people’s code. The svyquantile function has been completely rewritten.
It is mostly evident with the function itself:
library(survey)
data(api)
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
svyquantile(~api00, quantiles = 0.5, design = design)
#$api00
# quantile ci.2.5 ci.97.5 se
#0.5 652 561 714 35.66788
#attr(,"hasci")
#[1] TRUE
#attr(,"class")
#[1] "newsvyquantile"
...the old output was more concise:
oldsvyquantile(~api00, quantiles = 0.5, design = design)
# 0.5
#api00 652
There are some tweaks needed to update the output. In your case, you only need to remove keep.var = FALSE
:
svyby(
# svyby args
formula = ~api00,
by = ~both,
design = svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc),
FUN = svyquantile,
# args being passed to svyquantile
na.rm = TRUE,
quantiles = 0.5
)
# both api00 se.api00
#No No 631 39.75493
#Yes Yes 655 34.94774
Caution: the calculated quantile may differ between the two functions. Not by a large amount in reasonable sized data, but it can be notorious in small sample sizes (n < 10). This is because the interpolation rule between two data points that may be caught on the data distribution limit defined by the quantile. This rule differs between oldsvyquantile
and newsvyquantile
. The best resource I've found to resume the interpolation mix-up is this 1997 paper from Hyndman & Fan.
The good thing about newsvyquantile
is that it easily lets you choose which interpolation rule you like. The 9 rules from Hyndman & Fan's paper are available.