Remove columns in R by two conditions in two different columns-CodePudding

I am using this data.frame. I need to apply statistical tests (wilcox.test) for each column by comparing the '0' to the '1' groups, but I can only do that if each group has at least 2 values. How can I remove all columns for which the group size of '0' or the group size of '1' is smaller than 2? Then I can be running my code without errors. So in this example the pear and cherry columns would be removed.

 df <- data.frame(group=c(rep(0,10),rep(1,10)),
      apple = as.numeric(c(runif(20, -1, 18))),
      pear = as.numeric(c(rep("NA",12), runif(8, 2, 7))),
      banana = as.numeric(c(runif(10, 1, 3), runif(10, 2.5, 6))),
      cherry = as.numeric(c(runif(9, 5, 12), rep("NA",10), 4.31)),
      kiwi = as.numeric(c(rep("NA",8), runif(12, -1, 6))))

CodePudding user response：

You can use select where to select variables with a function. I anticipated using select with group_by to deal with this issue, but dplyr seems unable to support that. So a workaround is to use tapply(or ave) for grouping:

library(dplyr)

df %>%
  select(where(~ all(tapply(.x, df$group, \(x) sum(!is.na(x)) >= 2))))

   group      apple   banana        kiwi
1      0  7.9768511 1.183422          NA
2      0 -0.6611309 1.948172          NA
3      0  0.6690410 1.556230          NA
4      0  1.3582682 1.063583          NA
5      0  4.5359535 2.972903          NA
6      0  8.8755979 2.074685          NA
7      0  2.9280202 1.734720          NA
8      0  7.4065231 1.460041          NA
9      0  0.8837726 1.109268  1.54898128
10     0 -0.9704649 2.447073  4.27753379
11     1  3.2403002 4.839462 -0.88546624
12     1  0.4561026 4.703763  2.50467817
13     1 10.2888012 3.920268  2.62292534
14     1  3.4619229 3.010228  4.67953823
15     1  0.2207555 5.582971  3.71465882
16     1 -0.3694006 3.326906  4.17280678
17     1 13.1442999 3.018943  3.39256613
18     1  6.7433707 2.989773  0.04379258
19     1 16.0372570 2.839262  4.41795547
20     1 15.7012046 2.982483  3.13632483