Keep only columns with a certain percentage of values-CodePudding

type_1 type_2 type_3
0      0      1
0      1      0
0      0      1
0      1      1
1      0      1
1      0      1

I want to keep columns that have over 50% 1 values, which is only column type_3.

How can I do this in dplyr?

CodePudding user response：

You can do:

library(dplyr)

dat  |>
    select(all_of(
        names(dat)[sapply(dat, \(x) sum(x)/length(x)>0.5)]
    )
)

This takes advantage of the fact that you are in particular looking for 1s and the only values are 0 and 1. More generally, you can do:

VALUE_TO_MATCH = 1
dat  |>
    select(all_of(
        names(dat)[sapply(dat, \(x) sum(x==VALUE_TO_MATCH)/length(x)>0.5)]
    )
)

Data

dat  <- read.table(text = "type_1 type_2 type_3
0      0      1
0      1      0
0      0      1
0      1      1
1      0      1
1      0      1", h = T)

CodePudding user response：

Another dplyr option using select with where:

df <- read.table(text = "type_1 type_2 type_3
0      0      1
0      1      0
0      0      1
0      1      1
1      0      1
1      0      1", header = TRUE)

library(dplyr)
df %>% 
  select(where(~mean(.) > 0.5))
#>   type_3
#> 1      1
#> 2      0
#> 3      1
#> 4      1
#> 5      1
#> 6      1

^{Created on 2022-07-25 by the reprex package (v2.0.1)}

Base R option using colMeans:

df <- read.table(text = "type_1 type_2 type_3
0      0      1
0      1      0
0      0      1
0      1      1
1      0      1
1      0      1", header = TRUE)

df[which(colMeans(df) > 0.5)]
#>   type_3
#> 1      1
#> 2      0
#> 3      1
#> 4      1
#> 5      1
#> 6      1

^{Created on 2022-07-25 by the reprex package (v2.0.1)}