How do I use regex to match a substring before/after the colon?-CodePudding

I want to split info into id and aa, where id represents the substring from the start to the first colon :, and aa is the remaining substring.

library(stringr)

df=read.csv("variant_calls.txt", sep="\t")
info=df["AAChange.refGene"]
id=stringr::str_extract(info, "SRR.(\\d{6})")
aa=info[!id]

> dput(info)
structure(list(AAChange.refGene = c("NM_002725:c.C301T:p.P101S", 
"NM_001024940:c.T1054A:p.Y352N", "NM_001098209:c.T109C:p.S37P", 
"NM_152539:c.G955A:p.E319K", "NM_032421:c.A2422G:p.T808A", "NM_003141:c.G431A:p.G144E", 
"NM_006645:c.C749T:p.S250L", "NM_206927:c.C778A:p.P260T", "NM_012240:c.G209A:p.G70E", 
"NM_152336:c.A382C:p.K128Q", "NM_002773:c.G750C:p.W250C", "NM_001797:c.C2125T:p.R709W", 
"NM_058216:c.C797A:p.A266D", "NM_198977:c.C1543T:p.R515W", "NM_000307:c.C356T:p.A119V"
)), row.names = c(NA, -15L), class = "data.frame")

Expected output:

id = NM_001024940, NM_001098209, NM_152539, NM_032421...
aa = c.T1054A:p.Y352N, c.T109C:p.S37P, c.G955A:p.E319K, c.A2422G:p.T808A

CodePudding user response：

In this case, you can try using seperate() instead of the regex.

library(tidyr)

df %>% 
  separate(AAChange.refGene, c("id", "aa"), sep = ":", extra = "merge")

CodePudding user response：

You could use sub for a base R option:

df$id <- sub(":.*", "", df$AAChange.refGene)
df$aa <- sub(".*?:", "", df$AAChange.refGene)