I need to filter rows if some of the variables have missing values. Here is an example dataset.
df <- data.frame(id = c(1,2,3,4,5),
v1 = c(5,6,7,8,1),
v2 = c(5,9,34,2,1),
a1 = c(1,NA,NA,2,3),
a2 = c(NA,1,NA,8,9))
> df
id v1 v2 a1 a2
1 1 5 5 1 NA
2 2 6 9 NA 1
3 3 7 34 NA NA
4 4 8 2 2 8
5 5 1 1 3 9
From columns 4 and 5, if there is any missingness, I need to filter them. How can I code by specifying the column number (4th column) to the end of the columns? Because I have multiple and differentiating in the number of columns for different datasets.
How can I get this filtered dataset below?
> df1
id v1 v2 a1 a2
1 1 5 5 1 NA
2 2 6 9 NA 1
3 3 7 34 NA NA
CodePudding user response:
in base R you could do:
df[!complete.cases(df[4:ncol(df)]),]
id v1 v2 a1 a2
1 1 5 5 1 NA
2 2 6 9 NA 1
3 3 7 34 NA NA
subset(df, !complete.cases(df[4:ncol(df)]))
id v1 v2 a1 a2
1 1 5 5 1 NA
2 2 6 9 NA 1
3 3 7 34 NA NA
CodePudding user response:
Base R Approach
Approach 1
Using complete.cases
df[!complete.cases(df[, 4:ncol(df)]),]
Approach 2
Using apply
function
df[apply(df[, 4:ncol(df)], 1, FUN = function(x) any(is.na(x))), ]
Approach 3
Using data.table
library (Faster Approach)
library(data.table)
setDT(df)
df[!complete.cases(df[, 4:ncol(df)])]
Dataset used
df <- data.frame(id = c(1,2,3,4,5),
v1 = c(5,6,7,8,1),
v2 = c(5,9,34,2,1),
a1 = c(1,NA,NA,2,3),
a2 = c(NA,1,NA,8,9))
CodePudding user response:
Another base
option:
df[rowSums(is.na(df[, 4:ncol(df)])) > 0, ]
And a faster and less verbose any(is.na(x))
alternative:
df[apply(df[, 4:ncol(df)], 1, anyNA), ]
Output:
id v1 v2 a1 a2
1 1 5 5 1 NA
2 2 6 9 NA 1
3 3 7 34 NA NA
CodePudding user response:
My contribution is this simple line of code from dplyr
df <- data.frame(id = c(1,2,3,4,5),
v1 = c(5,6,7,8,1),
v2 = c(5,9,34,2,1),
a1 = c(1,NA,NA,2,3),
a2 = c(NA,1,NA,8,9))
new_df <- df %>%
filter_all(any_vars(is.na(.)))
print(new_df)
id v1 v2 a1 a2
1 1 5 5 1 NA
2 2 6 9 NA 1
3 3 7 34 NA NA
CodePudding user response:
Use if_any
in filter
- specify the column index range (:
) from 4 to last_col()
library(dplyr)
df %>%
filter(if_any(4:last_col(), is.na))
id v1 v2 a1 a2
1 1 5 5 1 NA
2 2 6 9 NA 1
3 3 7 34 NA NA