Remove ID from all groups if ID meets a condition in at least one group in R-CodePudding

I have SNP (allele) (column V1) frequencies (column V8) for different populations (column V9). Each SNP has an ID (specified in column V1). I want to remove the IDs whose value in V8 (allele frequency) is 0 or 1 in at least one group (dataframe is grouped by V9). Specifically, I want to remove the ID from the dataframe (all the groups), not only from the group where the condition is met.

  V1       V8            V9
1: rs10002235 0.324468     CARIBBEAN
2: rs10002235 0.176471     ADYGEI
3: rs10002235 0.305402     EUR
4: rs10002235 0.240384     AFR
5: rs10002235 0.495604     AMISH
6: rs10002235 0.096153     LATINO
1: rs33333235 0.5          CARIBBEAN
2: rs33333235 0.4          ADYGEI
3: rs33333235 0.3          EUR
4: rs33333235 0.001        AFR
5: rs33333235 0.4          AMISH
6: rs33333235 0.09         LATINO

If rs10002235 frequency (V8) was <0.01 or >0.99 in any (at least one) of the groups specified in V9, it should be dropped from the dataframe.

Output would be like so:

   V1       V8            V9
    1: rs10002235 0.324468     CARIBBEAN
    2: rs10002235 0.176471        ADYGEI
    3: rs10002235 0.305402           EUR
    4: rs10002235 0.240384           AFR
    5: rs10002235 0.495604         AMISH
    6: rs10002235 0.096153        LATINO

CodePudding user response：

If the object is data.table, grouped by the 'V1', check if any of the values in vector (c(0, 1)) are %in% 'V8', negate (!), extract the row sequence (.I), extract ($tmp) and use that to subset the groups (assuming precision is taken care off)

library(data.table)
dt1[dt1[, .(tmp = .I[(!any(V8 > 0.99|V8 <= 0.01)) & 
    ! any(c(0, 1) %in% V8)]), by = V1]$tmp]

-output

        V1       V8        V9
       <char>    <num>    <char>
1: rs10002235 0.324468 CARIBBEAN
2: rs10002235 0.176471    ADYGEI
3: rs10002235 0.305402       EUR
4: rs10002235 0.240384       AFR
5: rs10002235 0.495604     AMISH
6: rs10002235 0.096153    LATINO

data

dt1 <- structure(list(V1 = c("rs10002235", "rs10002235", "rs10002235", 
"rs10002235", "rs10002235", "rs10002235", "rs33333235", "rs33333235", 
"rs33333235", "rs33333235", "rs33333235", "rs33333235"), V8 = c(0.324468, 
0.176471, 0.305402, 0.240384, 0.495604, 0.096153, 0.5, 0.4, 0.3, 
0.001, 0.4, 0.09), V9 = c("CARIBBEAN", "ADYGEI", "EUR", "AFR", 
"AMISH", "LATINO", "CARIBBEAN", "ADYGEI", "EUR", "AFR", "AMISH", 
"LATINO")), class = c("data.table", "data.frame"), row.names = c(NA, 
-12L))

CodePudding user response：

Consider base R's ave to calculate groupwise max of condition. Then filter on this calculation.

cond <- ave(
     as.integer(dt$V8 < 0.01 | dt$V8 > 0.99),
     dt$V1, 
     FUN = max
)

dt[cond == 1]