I have SNP (allele) (column V1) frequencies (column V8) for different populations (column V9). Each SNP has an ID (specified in column V1). I want to remove the IDs whose value in V8 (allele frequency) is 0 or 1 in at least one group (dataframe is grouped by V9). Specifically, I want to remove the ID from the dataframe (all the groups), not only from the group where the condition is met.
V1 V8 V9
1: rs10002235 0.324468 CARIBBEAN
2: rs10002235 0.176471 ADYGEI
3: rs10002235 0.305402 EUR
4: rs10002235 0.240384 AFR
5: rs10002235 0.495604 AMISH
6: rs10002235 0.096153 LATINO
1: rs33333235 0.5 CARIBBEAN
2: rs33333235 0.4 ADYGEI
3: rs33333235 0.3 EUR
4: rs33333235 0.001 AFR
5: rs33333235 0.4 AMISH
6: rs33333235 0.09 LATINO
If rs10002235 frequency (V8) was <0.01 or >0.99 in any (at least one) of the groups specified in V9, it should be dropped from the dataframe.
Output would be like so:
V1 V8 V9
1: rs10002235 0.324468 CARIBBEAN
2: rs10002235 0.176471 ADYGEI
3: rs10002235 0.305402 EUR
4: rs10002235 0.240384 AFR
5: rs10002235 0.495604 AMISH
6: rs10002235 0.096153 LATINO
CodePudding user response:
If the object is data.table
, grouped by
the 'V1', check if any
of the values in vector (c(0, 1)
) are %in%
'V8', negate (!
), extract the row sequence (.I
), extract ($tmp
) and use that to subset the groups (assuming precision is taken care off)
library(data.table)
dt1[dt1[, .(tmp = .I[(!any(V8 > 0.99|V8 <= 0.01)) &
! any(c(0, 1) %in% V8)]), by = V1]$tmp]
-output
V1 V8 V9
<char> <num> <char>
1: rs10002235 0.324468 CARIBBEAN
2: rs10002235 0.176471 ADYGEI
3: rs10002235 0.305402 EUR
4: rs10002235 0.240384 AFR
5: rs10002235 0.495604 AMISH
6: rs10002235 0.096153 LATINO
data
dt1 <- structure(list(V1 = c("rs10002235", "rs10002235", "rs10002235",
"rs10002235", "rs10002235", "rs10002235", "rs33333235", "rs33333235",
"rs33333235", "rs33333235", "rs33333235", "rs33333235"), V8 = c(0.324468,
0.176471, 0.305402, 0.240384, 0.495604, 0.096153, 0.5, 0.4, 0.3,
0.001, 0.4, 0.09), V9 = c("CARIBBEAN", "ADYGEI", "EUR", "AFR",
"AMISH", "LATINO", "CARIBBEAN", "ADYGEI", "EUR", "AFR", "AMISH",
"LATINO")), class = c("data.table", "data.frame"), row.names = c(NA,
-12L))
CodePudding user response:
Consider base R's ave
to calculate groupwise max of condition. Then filter on this calculation.
cond <- ave(
as.integer(dt$V8 < 0.01 | dt$V8 > 0.99),
dt$V1,
FUN = max
)
dt[cond == 1]