I ask this question because I don't really know how to do it.
I have a genome in a gb format (YJ016_I.gb) so I want to import in R and then export all the genes in nucleotide format, or just take one of the sequence using the name of the gene.
library(genbankr)
library(stringr)
library(purrr)
gb <- genbankr::readGenBank("YJ016_I.gb")
GENES <- GenomicFeatures::genes(gb)
GenesDF <- data.frame(GENES)
seqnames start end width strand type gene locus_tag old_locus_tag loctype pseudo gene_synonym gene_id
1 I 353 787 435 - gene mioC BJE04_RS00275 VV0001, .... normal FALSE NA BJE04_RS00275
2 I 835 2196 1362 - gene mnmE BJE04_RS00280 VV0002, .... normal FALSE NA BJE04_RS00280
3 I 2314 3933 1620 - gene yidC BJE04_RS00285 VV0003, .... normal FALSE NA BJE04_RS00285
4 I 3936 4193 258 - gene yidD BJE04_RS23545 normal FALSE NA BJE04_RS23545
5 I 4160 4516 357 - gene rnpA BJE04_RS00290 VV0004, .... normal FALSE NA BJE04_RS00290
6 I 4530 4670 141 - gene rpmH BJE04_RS00295 VV0005, .... normal FALSE NA BJE04_RS00295
if I want to extract the sequence of BJE04_RS00275 in locus_tag, or in the other way export all genes of the genbank file
I mean
>BJE04_RS00275
aatgc
>BJE04_RS00280
ggcta
>BJE04_5000
atggcaa
how can I do it with R or if you have any solution in any language o program !!!
Thanks
CodePudding user response:
I do not have access to your specific file, so I am using the example file provided by the genbankr
package. You will need Biostrings
to write the sequence as fasta file. To only write the sequence(s) of a particular locus_tag, just subset GENES to that particular one (e.g. subset(GENES, locus_tag == "BJE04_RS00275")
in your example).
library(genbankr)
library(GenomicFeatures)
library(Biostrings)
gb <- readGenBank(system.file("sample.gbk", package="genbankr"))
GENES <- genes(gb)
res <- getSeq(gb@sequence, setNames(GENES, GENES$gene_id))
writeXStringSet(res, "YJ016_I_genes.fa")