I am trying to drop some columns that have less than 5 valid values. Here is an example dataset.
df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
i1 = c(0,1,1,1,1,0,0,1,NA,1),
i2 = c(1,0,0,1,0,1,1,0,0,NA),
i3 = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,0),
i4 = c(NA,1,NA,NA,NA,NA,NA,NA,1,NA))
> df
id i1 i2 i3 i4
1 1 0 1 NA NA
2 2 1 0 NA 1
3 3 1 0 NA NA
4 4 1 1 NA NA
5 5 1 0 NA NA
6 6 0 1 NA NA
7 7 0 1 NA NA
8 8 1 0 NA NA
9 9 NA 0 NA 1
10 10 1 NA 0 NA
in this case, columns i3
and i4
needs to be dropped from the data frame.
How can I get the desired dataset below:
> df
id i1 i2
1 1 0 1
2 2 1 0
3 3 1 0
4 4 1 1
5 5 1 0
6 6 0 1
7 7 0 1
8 8 1 0
9 9 NA 0
10 10 1 NA
CodePudding user response:
You can keep cols with at least 5 non-missing values with:
df[colSums(!is.na(df)) >= 5]
CodePudding user response:
Can use discard
from the purrr
package:
library(purrr)
df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
i1 = c(0,1,1,1,1,0,0,1,NA,1),
i2 = c(1,0,0,1,0,1,1,0,0,NA),
i3 = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,0),
i4 = c(NA,1,NA,NA,NA,NA,NA,NA,1,NA))
df %>%
discard(~ sum(!is.na(.))<5)
#> id i1 i2
#> 1 1 0 1
#> 2 2 1 0
#> 3 3 1 0
#> 4 4 1 1
#> 5 5 1 0
#> 6 6 0 1
#> 7 7 0 1
#> 8 8 1 0
#> 9 9 NA 0
#> 10 10 1 NA
Created on 2022-11-10 with reprex v2.0.2
While this is likely slower than base R methods (for datasets with extremely many columns > 1000), I generally feel the readability of the code is far superior. In addition, it is easy to do more complicated statements.
CodePudding user response:
Using R base, another approach...
> df[, sapply(df, function(x) sum(is.na(x))) < 5]
id i1 i2
1 1 0 1
2 2 1 0
3 3 1 0
4 4 1 1
5 5 1 0
6 6 0 1
7 7 0 1
8 8 1 0
9 9 NA 0
10 10 1 NA
CodePudding user response:
A performance comparison of the different answers given in this post:
funs = list(
colSums = function(df){df[colSums(!is.na(df)) >= nrow/10]},
sapply = function(df){df[, sapply(df, function(x) sum(!is.na(x))) >= nrow/10]},
discard = function(df){df %>% discard(~ sum(!is.na(.)) < nrow/10)},
mutate = function(df){df %>% mutate(across(where(~ sum(!is.na(.)) < nrow/10), ~ NULL))},
select = function(df){df %>% select(where(~ sum(!is.na(.)) >= nrow/10))})
ncol = 10000
nrow = 100
df = replicate(ncol, sample(c(1:9, NA), nrow, TRUE)) %>% as_tibble()
avrtime = map_dbl(funs, function(f){
duration = c()
for(i in 1:10){
t1 = Sys.time()
f(df)
t2 = Sys.time()
duration[i] = as.numeric(t2 - t1)}
return(mean(duration))})
avrtime[order(avrtime)]
The average time taken by each (in seconds):
colSums sapply discard select mutate
0.04510500 0.04692972 0.29207475 0.29451160 0.31755514
CodePudding user response:
Using select
library(dplyr)
df %>%
select(where(~ sum(complete.cases(.x)) >=5))
-output
id i1 i2
1 1 0 1
2 2 1 0
3 3 1 0
4 4 1 1
5 5 1 0
6 6 0 1
7 7 0 1
8 8 1 0
9 9 NA 0
10 10 1 NA
Or in base R
Filter(\(x) sum(complete.cases(x)) >=5 , df)