Home > Mobile >  extract substring from a long string in R
extract substring from a long string in R

Time:12-28

I have a string that looks like this :

C|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC|24117|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||"

what I wish to do is to extract from the string the data that is bold

C|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC|24117|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||"

usually, i use this type of code:

str_extract(data_snp$vep, "(?<=xxx=)[^|] ")

but this time it didn't work. Is there any way that R can do this? thank you:)

CodePudding user response:

We can use read.delim for this:

txt <- "C|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC|24117|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||"
unlist(read.delim(text = txt, sep = "|", header = FALSE)[,c(2,4,93)], use.names = FALSE)
# [1] "3_prime_UTR_variant" "SRY"                 "24117"              

If you use unlist(.) without use.names=FALSE, you get V1 etc for names, but they are harmless.

CodePudding user response:

One possible way to solve your problem:

lapply(strsplit(data_snp$vep, "\\| "), \(x) intersect(x, c("3_prime_UTR_variant", "SRY", "24117")))

[[1]]
[1] "3_prime_UTR_variant" "SRY"                 "24117" 

CodePudding user response:

You can use strsplit():

strsplit(txt, '\\|')[[1]][c(2, 4, 93)]

# [1] "3_prime_UTR_variant" "SRY"                 "24117"

or stringr::word():

stringr::word(txt, c(2, 4, 93), sep = '\\|')

# [1] "3_prime_UTR_variant" "SRY"                 "24117"
  • Related