Home > Enterprise >  How to I use regular expressions to match a substring?
How to I use regular expressions to match a substring?

Time:04-30

I want to change the rownames of cov_stats, such that it contains a substring of the FileName column values. I only want to retain the string that begins with "SRR" followed by 8 digits (e.g., SRR18826803).

cov_list <- list.files(path="./stats/", full.names=T) 
cov_stats <- rbindlist(sapply(cov_list, fread, simplify=F), use.names=T, idcol="FileName")
rownames(cov_stats) <- gsub("^\.\/\SRR*_\stats.\txt", "SRR*", cov_stats[["FileName"]])

Second attempt

rownames(cov_stats) <- gsub("^SRR[:digit:]*", "", cov_stats[["FileName"]])

Original strings

> cov_stats[["FileName"]]
 [1] "./stats/SRR18826803_stats.txt" "./stats/SRR18826804_stats.txt"
 [3] "./stats/SRR18826805_stats.txt" "./stats/SRR18826806_stats.txt"
 [5] "./stats/SRR18826807_stats.txt" "./stats/SRR18826808_stats.txt"

Desired substring output

 [1] "SRR18826803" "SRR18826804"
 [3] "SRR18826805" "SRR18826806"
 [5] "SRR18826807" "SRR18826808"

CodePudding user response:

Would this work for you?

library(stringr)

stringr::str_extract(cov_stats[["FileName"]], "SRR.{0,8}")

CodePudding user response:

You can use

rownames(cov_stats) <- sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats[["FileName"]])

See the regex demo. Details:

  • ^ - start of string
  • \./stats/ - ./stats/ string
  • (SRR\d{8}) - Group 1 (\1): SRR string and then eight digits
  • .* - the rest of the string till its end.

Note that sub is used (not gsub) because there is only one expected replacement operation in the input string (since the regex matches the whole string).

See the R demo:

cov_stats <- c("./stats/SRR18826803_stats.txt", "./stats/SRR18826804_stats.txt", "./stats/SRR18826805_stats.txt", "./stats/SRR18826806_stats.txt", "./stats/SRR18826807_stats.txt")
sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats)
## => [1] "SRR18826803" "SRR18826804" "SRR18826805" "SRR18826806" "SRR18826807"

An equivalent extraction stringr approach:

library(stringr)
rownames(cov_stats) <- str_extract(cov_stats[["FileName"]], "SRR\\d{8}")
  • Related