I have several variables with a large amount of missing data, I am trying to calculate a mean and standard deviation of Z scores across multiple columns, but only if they have >3 of the variables of interest (it does not matter which of the Z scores as long as there were three). The aim is to be able to say which subjects have a mean Z score one or more SDs away from a reference groups mean Z score. My data is structured like below (but a lot larger)
Subject | Zscore1 | Zscore2 | Zscore3 | Zscore4 | Zscore5 | Group |
---|---|---|---|---|---|---|
as4 | -2.4 | -1.2 | NA | -2.3 | 1.6 | ER |
as1 | -2.6 | -1.4 | -0.7 | -1.4 | -1.4 | TL |
as8 | -2.7 | NA | -0.1 | NA | -1.1 | TL |
as10 | NA | -1.7 | -0.9 | 2.3 | 1.7 | ER |
So i would like to code such as:
IF there are 3 or more Z scores present What is the mean and SD of the Z-scores
Any help would be appreciated.
CodePudding user response:
Do you need something like this:
library(dplyr)
df %>%
mutate(not_na_count = rowSums(!is.na(select(., -c(Subject, Group))))) %>%
filter(not_na_count >3) %>%
rowwise() %>%
mutate(avg = mean(c_across(2:6), na.rm = TRUE),
sd = sd(c_across(2:6), na.rm = TRUE)
)
Subject Zscore1 Zscore2 Zscore3 Zscore4 Zscore5 Group not_na_count avg sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 as4 -2.4 -1.2 NA -2.3 1.6 ER 4 -1.08 1.86
2 as1 -2.6 -1.4 -0.7 -1.4 -1.4 TL 5 -1.5 0.686
3 as10 NA -1.7 -0.9 2.3 1.7 ER 4 0.35 1.95