I have the following dataframe:
> df <- data.frame(name = c("Bill","Sally","John","Lucy","Jake","Rob","Sarah"),
score1 = c(1,2,1,4,5,3,4),
score2 = c(4,2,3,2,1,3,NA),
score3 = c(3,4,1,4,NA,NA,NA),
score4 = c(4,3,4,NA,NA,NA,NA),
score5 = c(4,4,NA,NA,NA,NA,NA),
score6 = c(1,NA,NA,NA,NA,NA,NA))
> df
name score1 score2 score3 score4 score5 score6
1 Bill 1 4 3 4 4 1
2 Sally 2 2 4 3 4 NA
3 John 1 3 1 4 NA NA
4 Lucy 4 2 4 NA NA NA
5 Jake 5 1 NA NA NA NA
6 Rob 3 3 NA NA NA NA
7 Sarah 4 NA NA NA NA NA
I would like to create a column that is the mean of the scores, only for rows with at least 4 non-NA values. I would like to do this within dplyr and do so in a way that I specify the score columns (my df has 100 columns) rather than, say, excluding the name column. The output should be:
df
name score1 score2 score3 score4 score5 score6 avg_score
1 Bill 1 4 3 4 4 1 2.833
2 Sally 2 2 4 3 4 NA 3.000
3 John 1 3 1 4 NA NA 2.250
4 Lucy 4 2 4 NA NA NA NA
5 Jake 5 1 NA NA NA NA NA
6 Rob 3 3 NA NA NA NA NA
7 Sarah 4 NA NA NA NA NA NA
A bit more detail if it's helpful: I'm calculating single-item measures of factors composed of multiple Likert-type items in a survey. I only want to include people who completed a majority of the items (i.e., answered a majority of questions) within a given factor. I have tried nesting filter() within mutate() , using mutate() and case_when(), rowwise() and group_by(), and a variety of other approaches, none of which have been successful. Any help you can provide would be greatly appreciated!
CodePudding user response:
library(dplyr)
df %>%
mutate(
row_not_na = rowSums(!is.na(across(starts_with("score")))),
row_mean = rowMeans(across(starts_with("score")), na.rm = TRUE),
result = ifelse(row_not_na >= 4, row_mean, NA)
)
# name score1 score2 score3 score4 score5 score6 row_not_na row_mean result
# 1 Bill 1 4 3 4 4 1 6 2.833333 2.833333
# 2 Sally 2 2 4 3 4 NA 5 3.000000 3.000000
# 3 John 1 3 1 4 NA NA 4 2.250000 2.250000
# 4 Lucy 4 2 4 NA NA NA 3 3.333333 NA
# 5 Jake 5 1 NA NA NA NA 2 3.000000 NA
# 6 Rob 3 3 NA NA NA NA 2 3.000000 NA
# 7 Sarah 4 NA NA NA NA NA 1 4.000000 NA
You could, of course, drop the row_not_na
and row_mean
columns, or use a long single ifelse()
line to not name them in the first place.