What I'm seeking to do is run a mean/standard deviation calculation, as well as a statistical test, along a set of variables. What seems right to do is build the function such that one can pass the list of column names through the function.
One possibly complicating factor is that for this specific data frame, it requires certain functions relating to survey data.
library(radiant.data) #for weighted.sd
library(survey) #survey functions
library(srvyr) #survey functions
#building a df
df <- data.frame("GroupingFactor" = c(1, 1, 0, 0),
"VarofInterest1" = c(1, 1, 1, 0),
"VarofInterest2" = c(1, 0, 0, 0),
"PSU" = c(1, 2, 1, 2),
"SAMPWEIGHT" = c(0, 23254, 343, 5652),
"STRATA" = c(6133, 6131, 6145, 6152))
options(survey.adjust.domain.lonely=TRUE) #adjusting for the one PSU
options(survey.lonely.psu="adjust")
svy <- svydesign(~PSU, weights = ~SAMPWEIGHT, strata = ~STRATA, data = df, nest = TRUE, check.strata = FALSE) #the design
#here is what i would like to iterate
df %>%
group_by(GroupingFactor) %>%
summarise(mean = weighted.mean(VarofInterest1, SAMPWEIGHT, na.rm =T), sd = weighted.sd(VarofInterest1, SAMPWEIGHT, na.rm =T)) #for mean and SD
svychisq(~GroupingFactor VarofInterest1, svy, statistic = 'Chisq') #the test of interest
Everything AFTER creating the svy object is what I'd ideally automate across a list of variables, e.g., applied to a list including VarofInterest2, a VarofInterest3, and so on.
The final product is a table/tibble including all the variable names, each one's mean and standard deviation and the output of the Chi-squared test (e.g., test statistic/X-squared and p-value).
I would also take a reference for doing this on non-survey weighted data! (i.e., just running, say, a dozen t-tests using a similar premise of feeding a list of variables you'd like to run the t-test against with a grouping factor).
Edit: Intended output
GroupingFactor | Mean | SD | Statistic | p | Variable |
---|---|---|---|---|---|
0 | .25 | .25 | 341.14 | .014 | VarofInterest1 |
1 | .50 | .00 | N/A | N/A | VarofInterest1 |
OR separate functions/table generating functions, one of just the means/SDs:
GroupingFactor | Mean | SD | Variable |
---|---|---|---|
0 | .50 | .25 | VarofInterest1 |
1 | .25 | .00 | VarofInterest1 |
and then a second with the test statistic and p-values:
Variable | Statistic | p |
---|---|---|
VarofInterest1 | 4131.11 | .001 |
VarofInterest2 | 131.14 | .131 |
CodePudding user response:
You can write a function f()
that takes the data, the group variable, and the variable of interest, and return the statistics.. You would need to modify the below example for survey data, but this might give you starting point.
f <- function(df, g, v) {
v_string = quo_name(enquo(v))
g_string = quo_name(enquo(v))
chi_result = chisq.test(df[[v_string]], df[[g_string]])
df %>%
group_by({{g}}) %>%
summarize(Mean = mean({{v}}, na.rm=T),SD = sd({{v}}, na.rm=T)) %>%
mutate(variable=v_string,
statistic=chi_result$statistic,
pvalue=chi_result$p.value)
}
bind_rows(
lapply(c("VarofInterest1", "VarofInterest2"),\(i) f(df,GroupingFactor,!!sym(i)))
)
Output:
# A tibble: 4 × 6
GroupingFactor Mean SD variable statistic pvalue
<dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 0 0.5 0.707 VarofInterest1 0.444 0.505
2 1 1 0 VarofInterest1 0.444 0.505
3 0 0 0 VarofInterest2 0.444 0.505
4 1 0.5 0.707 VarofInterest2 0.444 0.505