I am using this data.frame
. I need to apply statistical tests (wilcox.test
) for each column by comparing the '0' to the '1' groups, but I can only do that if each group has at least 2 values. How can I remove all columns for which the group size of '0' or the group size of '1' is smaller than 2? Then I can be running my code without errors. So in this example the pear
and cherry
columns would be removed.
df <- data.frame(group=c(rep(0,10),rep(1,10)),
apple = as.numeric(c(runif(20, -1, 18))),
pear = as.numeric(c(rep("NA",12), runif(8, 2, 7))),
banana = as.numeric(c(runif(10, 1, 3), runif(10, 2.5, 6))),
cherry = as.numeric(c(runif(9, 5, 12), rep("NA",10), 4.31)),
kiwi = as.numeric(c(rep("NA",8), runif(12, -1, 6))))
CodePudding user response:
You can use select
where
to select variables with a function. I anticipated using select
with group_by
to deal with this issue, but dplyr
seems unable to support that. So a workaround is to use tapply
(or ave
) for grouping:
library(dplyr)
df %>%
select(where(~ all(tapply(.x, df$group, \(x) sum(!is.na(x)) >= 2))))
group apple banana kiwi
1 0 7.9768511 1.183422 NA
2 0 -0.6611309 1.948172 NA
3 0 0.6690410 1.556230 NA
4 0 1.3582682 1.063583 NA
5 0 4.5359535 2.972903 NA
6 0 8.8755979 2.074685 NA
7 0 2.9280202 1.734720 NA
8 0 7.4065231 1.460041 NA
9 0 0.8837726 1.109268 1.54898128
10 0 -0.9704649 2.447073 4.27753379
11 1 3.2403002 4.839462 -0.88546624
12 1 0.4561026 4.703763 2.50467817
13 1 10.2888012 3.920268 2.62292534
14 1 3.4619229 3.010228 4.67953823
15 1 0.2207555 5.582971 3.71465882
16 1 -0.3694006 3.326906 4.17280678
17 1 13.1442999 3.018943 3.39256613
18 1 6.7433707 2.989773 0.04379258
19 1 16.0372570 2.839262 4.41795547
20 1 15.7012046 2.982483 3.13632483