It's been a while since I wrote R code and I'm trying to get along with data.table right now. Now I have a data.table (from a variant call) and I'd like to replace some values with words. I think fcase() would be good here, but I just can't get it to work. This is my working code:
rawdata[rawdata == "0/0" | rawdata == "0|0"] <- "REF"
rawdata[rawdata == "0/1" | rawdata == "0|1"] <- "HET"
rawdata[rawdata == "1/0" | rawdata == "1|0"] <- "HET"
rawdata[rawdata == "1/1" | rawdata == "1|1"] <- "ALT"
rawdata[rawdata == "./." | rawdata == ".|."] <- NA
for (i in 1:nrow(rawdata)) {
for (j in 6:ncol(rawdata)) {
if ((rawdata[i,..j] != "REF") & (rawdata[i,..j] != "HET") & (rawdata[i,..j] != "ALT") & !is.na(rawdata[i,..j])) {
rawdata[i,j] <- NA
}
}
}
So, what it is doing is replacing all 0/0, 0|0 with "REF", all 0/1, 0|1, 1/0, 1|0 with "HET", all 1/1, 1|1 with "ALT" and all ./., .|. with NA. Afterwards every entry that is not "REF", "HET" or "ALT" shall be assigned NA, but not for the first 5 columns. The code works, it's just not very elegant and especially the for/for loop is taking ages. rawdata has 8 columns and about 26000 rows. I am open to suggestions.
Thanks :)
structure(list(`# [1]CHROM` = c("manually removed"),
`[2]POS` = c("manually removed"),
`[3]ID` = c("manually removed"),
`[4]REF` = c("manually removed"),
`[5]ALT` = c("manually removed"),
Sample1 = c("manually removed"),
Sample2 = c("manually removed"),
Sample3 = c("manually removed")),
row.names = c(NA, -6L),
.internal.selfref = <pointer: 0x55a3227c91e0>,
class = c("data.table", "data.frame"))
CodePudding user response:
with fcase
:
library(data.table)
rawdata <- structure(list(`# [1]CHROM` = c("manually removed"),
`[2]POS` = c("0/0"),
`[3]ID` = c("0/1"),
`[4]REF` = c("1/0"),
`[5]ALT` = c("1/1"),
Sample1 = c("manually removed"),
Sample2 = c("manually removed"),
Sample3 = c("manually removed")),
row.names = c(NA, -6L),
class = c("data.table", "data.frame"))
rawdata[,(colnames(rawdata)):=lapply(.SD,function(x) fcase(x == "0/0" | x == "0|0", "REF",
x == "0/1" | x == "0|1", "HET",
x == "1/0" | x == "1|0", "HET",
x == "1/1" | x == "1|1", "ALT",
T, NA_character_))][]
#> # [1]CHROM [2]POS [3]ID [4]REF [5]ALT Sample1 Sample2 Sample3
#> <char> <char> <char> <char> <char> <char> <char> <char>
#> 1: <NA> REF HET HET ALT <NA> <NA> <NA>
CodePudding user response:
Here is one possible way to solve your problem. Note that values not specified in cases
(like .|.
, etc.) will be become NA
)
cases = c(REF = "0/0", REF = "0|0",
HET = "0/1", HET = "0|1", HET = "1/0", HET = "1|0",
ALT = "1/1", ALT = "1|1")
cols = c("Sample1", "Sample2", "Sample3") # names of the columns from 6 to 8
rawdata[, (cols) := lapply(.SD, function(x) names(cases)[chmatch(x, cases)]), .SDcols=cols]