How to replace specific string values for several files in R?-CodePudding

I have 50 files (each with 1 million - 2 million rows) all with a variant_id column I want to make changes to - the files all have a layout like this:

variant_id                    ...
chr1_665098_G_A_b38           ...
chr2_665097_C_T_b38           ...
chr3_665094_A_GG_b38          ...
chr10_23458_TTTCAAG_C_b38     ...

I want to edit the variant_id column to become:

variant_id
1:665098
2:665097
3:665094
10:23458

I am trying to make this change to all my files at the same time by:

#Read in all files
temp = list.files(pattern="*.txt")
for (i in 1:length(temp)) assign(temp[i], fread(temp[i]))

#Edit variant_id strings for every dataset in environment
my_func <- function(x) {
  x <- x %>%
    select(variant_id, pval_nominal) %>%
    mutate(variant_id = sub("^([^-]*-[^-]*).*", "\\1", variant_id))
}

e <- .GlobalEnv
nms <- ls(pattern = ".txt$", envir = e)
for(nm in nms) e[[nm]] <- my_func(e[[nm]])

I am stuck on the mutate(variant_id = sub("^([^-]*-[^-]*).*", "\\1", variant_id)) - with not knowing how best to use sub to implement all the changes I need with chr being removed, the first _ becoming a : and then having all characters after the 2nd numeric value being removed. How can I get this working? Is there a better function to try? Any help is appreciated.

Input example data:

df <- structure(list(variant_id = c("chr1_665098_G_A_b38", "chr2_665097_C_T_b38", 
"chr3_665094_A_GG_b38", "chr10_23458_TTTCAAG_C_b38\xca")), row.names = c(NA, 
-4L), class = c("data.table", "data.frame"))

CodePudding user response：

We can use sub to capture the characters and replace with the backreference of the captured groups

library(data.table)
df[, variant_id := sub("chr(\\d )_(\\d )_.*", "\\1:\\2", variant_id)]

-output

> df
   variant_id
1:   1:665098
2:   2:665097
3:   3:665094
4:   10:23458

If it is more than one file, read the files in a list, and keep it in the list

lst1 <- lapply(temp, function(x) fread(x)[,
    variant_id := sub("chr(\\d )_(\\d )_.*", "\\1:\\2", variant_id)][])

CodePudding user response：

Here is a fully reproducible example of your situation.

The goal here is to show you not only another possible solution for your regex, but also an alternative way to set up your code.

I noticed that in your function you are selecting 2 specific columns, so I added that option in my code.

# reproducible example
df <- data.frame(variant_id = c("chr1_665098_G_A_b38", "chr2_665097_C_T_b38", 
                                "chr3_665094_A_GG_b38", "chr10_23458_TTTCAAG_C_b38\xca"),
                 pval_nominal = c(0.005,0.01),
                 filler = letters[1:2])
folder <- tempdir()
write.csv(df, file.path(folder, "test1.txt"))
write.csv(df, file.path(folder, "test2.txt"))

# library
library(data.table)

# read all files: use full paths! you'll avoid a lot of issues
temp <- list.files(folder, pattern = "*.txt", full.names = TRUE)

# read files with lappply and make a list of them!
l <- lapply(temp, fread, sep = ",")

# select columns and modify variant_id
# if you use data.table you generally want to stick with it and not to mix it with dplyr and viceversa (but that depends on you)
l <- lapply(l, function(d) d[,.(variant_id = sub("^\\D (\\d )_(\\d ).*", "\\1:\\2", variant_id), pval_nominal)])
l
#> [[1]]
#>    variant_id pval_nominal
#> 1:   1:665098        0.005
#> 2:   2:665097        0.010
#> 3:   3:665094        0.005
#> 4:   10:23458        0.010
#> 
#> [[2]]
#>    variant_id pval_nominal
#> 1:   1:665098        0.005
#> 2:   2:665097        0.010
#> 3:   3:665094        0.005
#> 4:   10:23458        0.010

^{Created on 2021-11-18 by the reprex package (v2.0.0)}