Home > Mobile >  faster way to replace values in R data.table
faster way to replace values in R data.table

Time:06-30

It's been a while since I wrote R code and I'm trying to get along with data.table right now. Now I have a data.table (from a variant call) and I'd like to replace some values with words. I think fcase() would be good here, but I just can't get it to work. This is my working code:

rawdata[rawdata == "0/0" | rawdata == "0|0"] <- "REF"
rawdata[rawdata == "0/1" | rawdata == "0|1"] <- "HET"
rawdata[rawdata == "1/0" | rawdata == "1|0"] <- "HET"
rawdata[rawdata == "1/1" | rawdata == "1|1"] <- "ALT"
rawdata[rawdata == "./." | rawdata == ".|."] <- NA
for (i in 1:nrow(rawdata)) {
  for (j in 6:ncol(rawdata)) {
    if ((rawdata[i,..j] != "REF") & (rawdata[i,..j] != "HET") & (rawdata[i,..j] != "ALT") & !is.na(rawdata[i,..j])) {
      rawdata[i,j] <- NA
    }
  }
}

So, what it is doing is replacing all 0/0, 0|0 with "REF", all 0/1, 0|1, 1/0, 1|0 with "HET", all 1/1, 1|1 with "ALT" and all ./., .|. with NA. Afterwards every entry that is not "REF", "HET" or "ALT" shall be assigned NA, but not for the first 5 columns. The code works, it's just not very elegant and especially the for/for loop is taking ages. rawdata has 8 columns and about 26000 rows. I am open to suggestions.

Thanks :)

structure(list(`# [1]CHROM` = c("manually removed"),
`[2]POS` = c("manually removed"),
`[3]ID` = c("manually removed"),
`[4]REF` = c("manually removed"),
`[5]ALT` = c("manually removed"),
Sample1 = c("manually removed"),
Sample2 = c("manually removed"),
Sample3 = c("manually removed")),
row.names = c(NA, -6L),
.internal.selfref = <pointer: 0x55a3227c91e0>,
class = c("data.table", "data.frame"))

CodePudding user response:

with fcase:

library(data.table)
rawdata <- structure(list(`# [1]CHROM` = c("manually removed"),
                          `[2]POS` = c("0/0"),
                          `[3]ID` = c("0/1"),
                          `[4]REF` = c("1/0"),
                          `[5]ALT` = c("1/1"),
                          Sample1 = c("manually removed"),
                          Sample2 = c("manually removed"),
                          Sample3 = c("manually removed")),
                     row.names = c(NA, -6L),
                     class = c("data.table", "data.frame"))


rawdata[,(colnames(rawdata)):=lapply(.SD,function(x) fcase(x == "0/0" | x == "0|0", "REF",
                                                           x == "0/1" | x == "0|1", "HET", 
                                                           x == "1/0" | x == "1|0", "HET", 
                                                           x == "1/1" | x == "1|1", "ALT",
                                                           T, NA_character_))][]
#>    # [1]CHROM [2]POS  [3]ID [4]REF [5]ALT Sample1 Sample2 Sample3
#>        <char> <char> <char> <char> <char>  <char>  <char>  <char>
#> 1:       <NA>    REF    HET    HET    ALT    <NA>    <NA>    <NA>

CodePudding user response:

Here is one possible way to solve your problem. Note that values not specified in cases (like .|., etc.) will be become NA)

cases = c(REF = "0/0", REF = "0|0", 
          HET = "0/1", HET = "0|1", HET = "1/0", HET = "1|0", 
          ALT = "1/1", ALT = "1|1")

cols = c("Sample1", "Sample2", "Sample3")  # names of the columns from 6 to 8 

rawdata[, (cols) := lapply(.SD, function(x) names(cases)[chmatch(x, cases)]), .SDcols=cols]
  • Related