R remove rows with NA in groups of columns containing the same string-CodePudding

I have a dataframe that contains multiple variables each measured with multiple items at two different time points. What I want to remove all rows with NA entries in groups of columns containing the same part of a string. Some of these groups contain multiple columns (e.g., grep("learn"), some only one (e.g., T1_age. This is my original dataframe (a part of it):

data <- data.frame(
      T1_age = c(39, 30, 20, 48, 27, 55, 37, 50, 50, 37),
      T1_sex = c(2, 1, 1, 2, 2, 1, 1, 2, 1, 1),
   T2_learn1 = c(2, NA, 3, 4, 1, NA, NA, 2, 4, 4),
   T2_learn2 = c(1, NA, 4, 4, 1, NA, NA, 2, 4, 4),
   T2_learn3 = c(2, NA, 4, 4, 1, NA, NA, 3, 4, 4),
   T2_learn4 = c(2, NA, 2, 5, 5, NA, NA, 5, 5, 5),
   T2_learn5 = c(4, NA, 3, 4, 3, NA, NA, 3, 4, 3),
     T2_aut1 = c(NA, NA, 4, 4, 4, NA, NA, 3, 5, 4),
     T2_aut2 = c(NA, NA, 4, 4, 4, NA, NA, 3, 5, 5),
     T2_aut3 = c(NA, NA, 4, 4, 3, NA, NA, 3, 5, 5),
    T2_ssup1 = c(1, NA, 4, 5, 4, NA, NA, 2, 4, 3),
    T2_ssup2 = c(3, NA, 4, 5, 5, NA, NA, 3, 4, 4),
    T2_ssup3 = c(4, NA, 4, 5, 5, NA, NA, 4, 4, 4),
    T2_ssup4 = c(2, NA, 3, 5, 5, NA, NA, 3, 4, 4),
   T3_learn1 = c(3, NA, NA, 4, 4, NA, NA, 3, 3, 4),
   T3_learn2 = c(1, NA, NA, 4, 3, NA, NA, 3, 3, 4),
   T3_learn3 = c(3, NA, NA, 4, 4, NA, NA, 3, 3, 5),
   T3_learn4 = c(4, NA, NA, 5, 4, NA, NA, 4, 5, 5),
   T3_learn5 = c(4, NA, NA, 3, 4, NA, NA, 3, 3, 4),
     T3_aut1 = c(NA, NA, NA, 4, 4, NA, NA, 3, 5, 5),
     T3_aut2 = c(NA, NA, NA, 3, 4, NA, NA, 3, 5, 5),
     T3_aut3 = c(NA, NA, NA, 3, 2, NA, NA, 3, 5, 5),
    T3_ssup1 = c(3, NA, NA, 5, 4, NA, NA, 2, 4, 1),
    T3_ssup2 = c(3, NA, NA, 5, 5, NA, NA, 4, 5, 5),
    T3_ssup3 = c(4, NA, NA, 5, 5, NA, NA, 4, 5, 3),
    T3_ssup4 = c(3, NA, NA, 5, 5, NA, NA, 4, 5, 4)
)

Now I already found a very horrible solution and I believe that could be improved. So this code basically does what I want:

library(dplyr)
library(tidyr)

data <- data %>% filter(rowSums(is.na(.[ , grep("learn", colnames(.))])) != ncol(.[ , grep("learn", colnames(.))]))
data <- data %>% filter(rowSums(is.na(.[ , grep("aut", colnames(.))])) != ncol(.[ , grep("aut", colnames(.))]))
data <- data %>% filter(rowSums(is.na(.[ , grep("ssup", colnames(.))])) != ncol(.[ , grep("ssup", colnames(.))]))
data <- data %>% drop_na(T1_age)
data <- data %>% drop_na(T1_sex)

So the new data frame (and what I want to achieve) looks like this:


data2 <- data.frame(
      T1_age = c(20, 48, 27, 50, 50, 37),
      T1_sex = c(1, 2, 2, 2, 1, 1),
   T2_learn1 = c(3, 4, 1, 2, 4, 4),
   T2_learn2 = c(4, 4, 1, 2, 4, 4),
   T2_learn3 = c(4, 4, 1, 3, 4, 4),
   T2_learn4 = c(2, 5, 5, 5, 5, 5),
   T2_learn5 = c(3, 4, 3, 3, 4, 3),
     T2_aut1 = c(4, 4, 4, 3, 5, 4),
     T2_aut2 = c(4, 4, 4, 3, 5, 5),
     T2_aut3 = c(4, 4, 3, 3, 5, 5),
    T2_ssup1 = c(4, 5, 4, 2, 4, 3),
    T2_ssup2 = c(4, 5, 5, 3, 4, 4),
    T2_ssup3 = c(4, 5, 5, 4, 4, 4),
    T2_ssup4 = c(3, 5, 5, 3, 4, 4),
   T3_learn1 = c(NA, 4, 4, 3, 3, 4),
   T3_learn2 = c(NA, 4, 3, 3, 3, 4),
   T3_learn3 = c(NA, 4, 4, 3, 3, 5),
   T3_learn4 = c(NA, 5, 4, 4, 5, 5),
   T3_learn5 = c(NA, 3, 4, 3, 3, 4),
     T3_aut1 = c(NA, 4, 4, 3, 5, 5),
     T3_aut2 = c(NA, 3, 4, 3, 5, 5),
     T3_aut3 = c(NA, 3, 2, 3, 5, 5),
    T3_ssup1 = c(NA, 5, 4, 2, 4, 1),
    T3_ssup2 = c(NA, 5, 5, 4, 5, 5),
    T3_ssup3 = c(NA, 5, 5, 4, 5, 3),
    T3_ssup4 = c(NA, 5, 5, 4, 5, 4)
        )

Could you help me improve this a bit? Thank you!!!

CodePudding user response：

You may iterate over grep in an sapply and check if the rowSums in the slices reach their number of columns.

V <- c('learn', 'aut', 'ssup')

res <- data[!rowSums(sapply(V, \(v) {
  X <- data[grep(v, names(data))]
  rowSums(is.na(X)) == dim(X)[2]
})), ]

stopifnot(all.equal(res, data2, check.attributes=FALSE))

Or probably just checking if the sums of NA's in the "hot" columns reach the number of columns (without the demographics) is enough.

res1 <- data[rowSums(is.na(data[grep(paste(V, collapse='|'), names(data))])) != 
               dim(data[-(1:2)])[2], ]

stopifnot(all.equal(res1, data2, check.attributes=FALSE))

data2 is the result data frame you provide in OP. dim(data)[2] gives the same as ncol(data).

Note: R version 4.1.2 (2021-11-01)

CodePudding user response：

The code below should work as I have tested through changing a random element to NA. What I did was subsetting the original data into T2 and T3 data (as the respective column numbers are identical), then using vectorization through is.na().

data_T2 <- data %>% subset(select = c(3:14))
data_T3 <- data %>% subset(select = c(15:26))
data[!(is.na(data_T2) & is.na(data_T3))[,1],]