Filter rows for specific columns in R-CodePudding

I need to filter rows if some of the variables have missing values. Here is an example dataset.

df <- data.frame(id = c(1,2,3,4,5),
                 v1 = c(5,6,7,8,1),
                 v2 = c(5,9,34,2,1),
                 a1 = c(1,NA,NA,2,3),
                 a2 = c(NA,1,NA,8,9))

> df
  id v1 v2 a1 a2
1  1  5  5  1 NA
2  2  6  9 NA  1
3  3  7 34 NA NA
4  4  8  2  2  8
5  5  1  1  3  9

From columns 4 and 5, if there is any missingness, I need to filter them. How can I code by specifying the column number (4th column) to the end of the columns? Because I have multiple and differentiating in the number of columns for different datasets.

How can I get this filtered dataset below?

> df1
  id v1 v2 a1 a2
1  1  5  5  1 NA
2  2  6  9 NA  1
3  3  7 34 NA NA

CodePudding user response：

in base R you could do:

df[!complete.cases(df[4:ncol(df)]),]
  id v1 v2 a1 a2
1  1  5  5  1 NA
2  2  6  9 NA  1
3  3  7 34 NA NA

subset(df, !complete.cases(df[4:ncol(df)]))
  id v1 v2 a1 a2
1  1  5  5  1 NA
2  2  6  9 NA  1
3  3  7 34 NA NA

CodePudding user response：

Base R Approach

Approach 1

Using complete.cases

df[!complete.cases(df[, 4:ncol(df)]),]

Approach 2

Using apply function

df[apply(df[, 4:ncol(df)], 1, FUN = function(x) any(is.na(x))), ]

Approach 3

Using data.table library (Faster Approach)

library(data.table)
setDT(df)
df[!complete.cases(df[, 4:ncol(df)])]

Dataset used

df <- data.frame(id = c(1,2,3,4,5),
                 v1 = c(5,6,7,8,1),
                 v2 = c(5,9,34,2,1),
                 a1 = c(1,NA,NA,2,3),
                 a2 = c(NA,1,NA,8,9))

CodePudding user response：

Another base option:

df[rowSums(is.na(df[, 4:ncol(df)])) > 0, ]

And a faster and less verbose any(is.na(x)) alternative:

df[apply(df[, 4:ncol(df)], 1, anyNA), ]

Output:

  id v1 v2 a1 a2
1  1  5  5  1 NA
2  2  6  9 NA  1
3  3  7 34 NA NA

CodePudding user response：

My contribution is this simple line of code from dplyr

df <- data.frame(id = c(1,2,3,4,5),
                   v1 = c(5,6,7,8,1),
                   v2 = c(5,9,34,2,1),
                   a1 = c(1,NA,NA,2,3),
                   a2 = c(NA,1,NA,8,9))
                   
    new_df <- df %>% 
    filter_all(any_vars(is.na(.)))

print(new_df)

id v1 v2 a1 a2
1  1  5  5  1 NA
2  2  6  9 NA  1
3  3  7 34 NA NA

CodePudding user response：

Use if_any in filter - specify the column index range (:) from 4 to last_col()

library(dplyr)
df %>% 
  filter(if_any(4:last_col(), is.na))
  id v1 v2 a1 a2
1  1  5  5  1 NA
2  2  6  9 NA  1
3  3  7 34 NA NA