Home > Back-end >  Find conditional mean across rows for selected columns in dplyr
Find conditional mean across rows for selected columns in dplyr

Time:08-09

I have the following dataframe:

> df <- data.frame(name = c("Bill","Sally","John","Lucy","Jake","Rob","Sarah"),
                  score1 = c(1,2,1,4,5,3,4),
                  score2 = c(4,2,3,2,1,3,NA),
                  score3 = c(3,4,1,4,NA,NA,NA),
                  score4 = c(4,3,4,NA,NA,NA,NA),
                  score5 = c(4,4,NA,NA,NA,NA,NA),
                  score6 = c(1,NA,NA,NA,NA,NA,NA))
> df
   name score1 score2 score3 score4 score5 score6
1  Bill      1      4      3      4      4      1
2 Sally      2      2      4      3      4     NA
3  John      1      3      1      4     NA     NA
4  Lucy      4      2      4     NA     NA     NA
5  Jake      5      1     NA     NA     NA     NA
6   Rob      3      3     NA     NA     NA     NA
7 Sarah      4     NA     NA     NA     NA     NA

I would like to create a column that is the mean of the scores, only for rows with at least 4 non-NA values. I would like to do this within dplyr and do so in a way that I specify the score columns (my df has 100 columns) rather than, say, excluding the name column. The output should be:

df
   name score1 score2 score3 score4 score5 score6 avg_score
1  Bill      1      4      3      4      4      1     2.833
2 Sally      2      2      4      3      4     NA     3.000
3  John      1      3      1      4     NA     NA     2.250
4  Lucy      4      2      4     NA     NA     NA        NA
5  Jake      5      1     NA     NA     NA     NA        NA
6   Rob      3      3     NA     NA     NA     NA        NA
7 Sarah      4     NA     NA     NA     NA     NA        NA

A bit more detail if it's helpful: I'm calculating single-item measures of factors composed of multiple Likert-type items in a survey. I only want to include people who completed a majority of the items (i.e., answered a majority of questions) within a given factor. I have tried nesting filter() within mutate() , using mutate() and case_when(), rowwise() and group_by(), and a variety of other approaches, none of which have been successful. Any help you can provide would be greatly appreciated!

CodePudding user response:

library(dplyr)
df %>% 
  mutate(
    row_not_na = rowSums(!is.na(across(starts_with("score")))),
    row_mean = rowMeans(across(starts_with("score")), na.rm = TRUE),
    result = ifelse(row_not_na >= 4, row_mean, NA)
  )
#   name score1 score2 score3 score4 score5 score6 row_not_na row_mean   result
# 1  Bill      1      4      3      4      4      1          6 2.833333 2.833333
# 2 Sally      2      2      4      3      4     NA          5 3.000000 3.000000
# 3  John      1      3      1      4     NA     NA          4 2.250000 2.250000
# 4  Lucy      4      2      4     NA     NA     NA          3 3.333333       NA
# 5  Jake      5      1     NA     NA     NA     NA          2 3.000000       NA
# 6   Rob      3      3     NA     NA     NA     NA          2 3.000000       NA
# 7 Sarah      4     NA     NA     NA     NA     NA          1 4.000000       NA

You could, of course, drop the row_not_na and row_mean columns, or use a long single ifelse() line to not name them in the first place.

  • Related