I have a dataframe that contains multiple variables each measured with multiple items at two different time points. What I want to remove all rows with NA entries in groups of columns containing the same part of a string. Some of these groups contain multiple columns (e.g., grep("learn")
, some only one (e.g., T1_age
. This is my original dataframe (a part of it):
data <- data.frame(
T1_age = c(39, 30, 20, 48, 27, 55, 37, 50, 50, 37),
T1_sex = c(2, 1, 1, 2, 2, 1, 1, 2, 1, 1),
T2_learn1 = c(2, NA, 3, 4, 1, NA, NA, 2, 4, 4),
T2_learn2 = c(1, NA, 4, 4, 1, NA, NA, 2, 4, 4),
T2_learn3 = c(2, NA, 4, 4, 1, NA, NA, 3, 4, 4),
T2_learn4 = c(2, NA, 2, 5, 5, NA, NA, 5, 5, 5),
T2_learn5 = c(4, NA, 3, 4, 3, NA, NA, 3, 4, 3),
T2_aut1 = c(NA, NA, 4, 4, 4, NA, NA, 3, 5, 4),
T2_aut2 = c(NA, NA, 4, 4, 4, NA, NA, 3, 5, 5),
T2_aut3 = c(NA, NA, 4, 4, 3, NA, NA, 3, 5, 5),
T2_ssup1 = c(1, NA, 4, 5, 4, NA, NA, 2, 4, 3),
T2_ssup2 = c(3, NA, 4, 5, 5, NA, NA, 3, 4, 4),
T2_ssup3 = c(4, NA, 4, 5, 5, NA, NA, 4, 4, 4),
T2_ssup4 = c(2, NA, 3, 5, 5, NA, NA, 3, 4, 4),
T3_learn1 = c(3, NA, NA, 4, 4, NA, NA, 3, 3, 4),
T3_learn2 = c(1, NA, NA, 4, 3, NA, NA, 3, 3, 4),
T3_learn3 = c(3, NA, NA, 4, 4, NA, NA, 3, 3, 5),
T3_learn4 = c(4, NA, NA, 5, 4, NA, NA, 4, 5, 5),
T3_learn5 = c(4, NA, NA, 3, 4, NA, NA, 3, 3, 4),
T3_aut1 = c(NA, NA, NA, 4, 4, NA, NA, 3, 5, 5),
T3_aut2 = c(NA, NA, NA, 3, 4, NA, NA, 3, 5, 5),
T3_aut3 = c(NA, NA, NA, 3, 2, NA, NA, 3, 5, 5),
T3_ssup1 = c(3, NA, NA, 5, 4, NA, NA, 2, 4, 1),
T3_ssup2 = c(3, NA, NA, 5, 5, NA, NA, 4, 5, 5),
T3_ssup3 = c(4, NA, NA, 5, 5, NA, NA, 4, 5, 3),
T3_ssup4 = c(3, NA, NA, 5, 5, NA, NA, 4, 5, 4)
)
Now I already found a very horrible solution and I believe that could be improved. So this code basically does what I want:
library(dplyr)
library(tidyr)
data <- data %>% filter(rowSums(is.na(.[ , grep("learn", colnames(.))])) != ncol(.[ , grep("learn", colnames(.))]))
data <- data %>% filter(rowSums(is.na(.[ , grep("aut", colnames(.))])) != ncol(.[ , grep("aut", colnames(.))]))
data <- data %>% filter(rowSums(is.na(.[ , grep("ssup", colnames(.))])) != ncol(.[ , grep("ssup", colnames(.))]))
data <- data %>% drop_na(T1_age)
data <- data %>% drop_na(T1_sex)
So the new data frame (and what I want to achieve) looks like this:
data2 <- data.frame(
T1_age = c(20, 48, 27, 50, 50, 37),
T1_sex = c(1, 2, 2, 2, 1, 1),
T2_learn1 = c(3, 4, 1, 2, 4, 4),
T2_learn2 = c(4, 4, 1, 2, 4, 4),
T2_learn3 = c(4, 4, 1, 3, 4, 4),
T2_learn4 = c(2, 5, 5, 5, 5, 5),
T2_learn5 = c(3, 4, 3, 3, 4, 3),
T2_aut1 = c(4, 4, 4, 3, 5, 4),
T2_aut2 = c(4, 4, 4, 3, 5, 5),
T2_aut3 = c(4, 4, 3, 3, 5, 5),
T2_ssup1 = c(4, 5, 4, 2, 4, 3),
T2_ssup2 = c(4, 5, 5, 3, 4, 4),
T2_ssup3 = c(4, 5, 5, 4, 4, 4),
T2_ssup4 = c(3, 5, 5, 3, 4, 4),
T3_learn1 = c(NA, 4, 4, 3, 3, 4),
T3_learn2 = c(NA, 4, 3, 3, 3, 4),
T3_learn3 = c(NA, 4, 4, 3, 3, 5),
T3_learn4 = c(NA, 5, 4, 4, 5, 5),
T3_learn5 = c(NA, 3, 4, 3, 3, 4),
T3_aut1 = c(NA, 4, 4, 3, 5, 5),
T3_aut2 = c(NA, 3, 4, 3, 5, 5),
T3_aut3 = c(NA, 3, 2, 3, 5, 5),
T3_ssup1 = c(NA, 5, 4, 2, 4, 1),
T3_ssup2 = c(NA, 5, 5, 4, 5, 5),
T3_ssup3 = c(NA, 5, 5, 4, 5, 3),
T3_ssup4 = c(NA, 5, 5, 4, 5, 4)
)
Could you help me improve this a bit? Thank you!!!
CodePudding user response:
You may iterate over grep
in an sapply
and check if the rowSums
in the slices reach their number of columns.
V <- c('learn', 'aut', 'ssup')
res <- data[!rowSums(sapply(V, \(v) {
X <- data[grep(v, names(data))]
rowSums(is.na(X)) == dim(X)[2]
})), ]
stopifnot(all.equal(res, data2, check.attributes=FALSE))
Or probably just checking if the sums of NA
's in the "hot" columns reach the number of columns (without the demographics) is enough.
res1 <- data[rowSums(is.na(data[grep(paste(V, collapse='|'), names(data))])) !=
dim(data[-(1:2)])[2], ]
stopifnot(all.equal(res1, data2, check.attributes=FALSE))
data2
is the result data frame you provide in OP. dim(data)[2]
gives the same as ncol(data)
.
Note: R version 4.1.2 (2021-11-01)
CodePudding user response:
The code below should work as I have tested through changing a random element to NA
. What I did was subsetting the original data into T2 and T3 data (as the respective column numbers are identical), then using vectorization through is.na()
.
data_T2 <- data %>% subset(select = c(3:14))
data_T3 <- data %>% subset(select = c(15:26))
data[!(is.na(data_T2) & is.na(data_T3))[,1],]