I have a dateframe with three columns, one of which has a variable for participant ID with no NA values and the other two (target variables) have some scattered throughout. I'm trying to use the solution explained here (remove rows where all columns are NA except 2 columns) to remove rows where both of the target variables have NAs, but for some reason my implementation of it seems to indiscriminately remove all NAs.
Here is a sample of what the unprocessed df looks like:
ID | a | b |
---|---|---|
1 | ab | NA |
1 | NA | ab |
1 | NA | NA |
Here is what I want the processed df to look like:
ID | a | b |
---|---|---|
1 | ab | NA |
1 | NA | ab |
And here is the code I'm using to try to accomplish this:
na_rows = df %>%
select(-"ID") %>%
is.na() %>%
rowSums() > 0
processeddf <- df %>%
filter(!na_rows)
However, this code returns a df which has removed any row containing NA at all. So for the above sample, it would return an empty df. Where am I going wrong here? I can't figure out where my logical error is occurring.
CodePudding user response:
We may use if_all
in filter
- select the columns a to b in if_all
, apply the is.na
(check for NA), the output will be TRUE for a row if both a and b have NA, negate (!
) to convert TRUE-> FALSE and FALSE->TRUE
library(dplyr)
df %>%
filter(!if_all(a:b, is.na))
-output
ID a b
1 1 ab <NA>
2 1 <NA> ab
Or instead of negating (!
), we may use complete.cases
with if_any
df %>%
filter(if_any(a:b, complete.cases))
ID a b
1 1 ab <NA>
2 1 <NA> ab
Regarding the issue in OP's code, the logic is created by looking whether there is atleast one NA (> 0
) which is true for all the rows. Instead, it should be all NA and then negate
na_rows <- df %>%
select(-"ID") %>%
is.na() %>%
{rowSums(.) == ncol(.)}
data
df <- structure(list(ID = c(1L, 1L, 1L), a = c("ab", NA, NA), b = c(NA,
"ab", NA)), class = "data.frame", row.names = c(NA, -3L))
CodePudding user response:
And additionally a data.table solution (thanks for the data akrun).
library(data.table)
dt = structure(list(ID = c(1L, 1L, 1L), a = c("ab", NA, NA),
b = c(NA, "ab", NA)), class = "data.table", row.names = c(NA, -3L))
dt[!(is.na(a) & is.na(b))]
Output:
ID a b
1: 1 ab <NA>
2: 1 <NA> ab