I want to split info
into id
and aa
, where id
represents the substring from the start to the first colon :
, and aa
is the remaining substring.
library(stringr)
df=read.csv("variant_calls.txt", sep="\t")
info=df["AAChange.refGene"]
id=stringr::str_extract(info, "SRR.(\\d{6})")
aa=info[!id]
> dput(info)
structure(list(AAChange.refGene = c("NM_002725:c.C301T:p.P101S",
"NM_001024940:c.T1054A:p.Y352N", "NM_001098209:c.T109C:p.S37P",
"NM_152539:c.G955A:p.E319K", "NM_032421:c.A2422G:p.T808A", "NM_003141:c.G431A:p.G144E",
"NM_006645:c.C749T:p.S250L", "NM_206927:c.C778A:p.P260T", "NM_012240:c.G209A:p.G70E",
"NM_152336:c.A382C:p.K128Q", "NM_002773:c.G750C:p.W250C", "NM_001797:c.C2125T:p.R709W",
"NM_058216:c.C797A:p.A266D", "NM_198977:c.C1543T:p.R515W", "NM_000307:c.C356T:p.A119V"
)), row.names = c(NA, -15L), class = "data.frame")
Expected output:
id = NM_001024940, NM_001098209, NM_152539, NM_032421...
aa = c.T1054A:p.Y352N, c.T109C:p.S37P, c.G955A:p.E319K, c.A2422G:p.T808A
CodePudding user response:
In this case, you can try using seperate()
instead of the regex.
library(tidyr)
df %>%
separate(AAChange.refGene, c("id", "aa"), sep = ":", extra = "merge")
CodePudding user response:
You could use sub
for a base R option:
df$id <- sub(":.*", "", df$AAChange.refGene)
df$aa <- sub(".*?:", "", df$AAChange.refGene)