I have a string that looks like this :
C|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC|24117|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||"
what I wish to do is to extract from the string the data that is bold
C|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC|24117|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||"
usually, i use this type of code:
str_extract(data_snp$vep, "(?<=xxx=)[^|] ")
but this time it didn't work. Is there any way that R can do this? thank you:)
CodePudding user response:
We can use read.delim
for this:
txt <- "C|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC|24117|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||"
unlist(read.delim(text = txt, sep = "|", header = FALSE)[,c(2,4,93)], use.names = FALSE)
# [1] "3_prime_UTR_variant" "SRY" "24117"
If you use unlist(.)
without use.names=FALSE
, you get V1
etc for names, but they are harmless.
CodePudding user response:
One possible way to solve your problem:
lapply(strsplit(data_snp$vep, "\\| "), \(x) intersect(x, c("3_prime_UTR_variant", "SRY", "24117")))
[[1]]
[1] "3_prime_UTR_variant" "SRY" "24117"
CodePudding user response:
You can use strsplit()
:
strsplit(txt, '\\|')[[1]][c(2, 4, 93)]
# [1] "3_prime_UTR_variant" "SRY" "24117"
or stringr::word()
:
stringr::word(txt, c(2, 4, 93), sep = '\\|')
# [1] "3_prime_UTR_variant" "SRY" "24117"