Home > Software engineering >  Selecting Rows with Missing Data in a Range of Columns
Selecting Rows with Missing Data in a Range of Columns

Time:09-05

There are several ways to identify and manipulate individual cells with missing data in R, e.g., with complete.cases or even rowSums.

However, I've not been able to find---or figure out myself---an expedient way to select rows that have missing data within a subsetted range of columns.

For example, in dataframe df:

df <- data.frame(D1 = c('A', 'B', 'C', 'D'),
                 D2 = c(NA, 0, 1, 1),
                 V1 = c(11, NA, 33, NA),
                 V2 = c(111, 222, NA, NA)
                 )
df

# D1  D2  V1  V2
#  A  NA  11 111    
#  B   0  NA 222    
#  C   1  33  NA    
#  D   1  NA  NA    

I would like to select all rows that have missing data in both columns V1 and V2, thus selecting row D but not rows B or C (or A).

I have a larger range of columns than given in that toy example, so selecting a set of columns with, e.g., && could make for a long command.

N.B., a similar SO question addresses selecting rows where none are NSs.

CodePudding user response:

You can try this:

df %>% filter(is.na(V1) & is.na(V2))

OUTPUT

  D1 D2 V1 V2
1  D  1 NA NA

CodePudding user response:

You can use dplyr::if_all. You can select the columns very flexibly with tidyselect, for instance using :, c, starts_with...

library(dplyr)
df %>% 
  filter(if_all(V1:V2, is.na))

#  D1 D2 V1 V2
#1  D  1 NA NA

Also works (this shows the flexibility of tidyselect):

filter(df, if_all(3:4, is.na))
filter(df, if_all(starts_with("V"), is.na))
filter(df, if_all(c(V1, V2), is.na))
filter(df, if_all((last_col()-1):last_col(), is.na))
filter(df, if_all(num_range("V", 1:2), is.na))
  • Related