how to extract genes from genbank file in R-CodePudding

I ask this question because I don't really know how to do it.

I have a genome in a gb format (YJ016_I.gb) so I want to import in R and then export all the genes in nucleotide format, or just take one of the sequence using the name of the gene.

library(genbankr)
library(stringr)
library(purrr)




gb <- genbankr::readGenBank("YJ016_I.gb")
GENES <- GenomicFeatures::genes(gb)
GenesDF <- data.frame(GENES)

  seqnames start  end width strand type gene     locus_tag old_locus_tag loctype pseudo gene_synonym       gene_id
1        I   353  787   435      - gene mioC BJE04_RS00275  VV0001, ....  normal  FALSE           NA BJE04_RS00275
2        I   835 2196  1362      - gene mnmE BJE04_RS00280  VV0002, ....  normal  FALSE           NA BJE04_RS00280
3        I  2314 3933  1620      - gene yidC BJE04_RS00285  VV0003, ....  normal  FALSE           NA BJE04_RS00285
4        I  3936 4193   258      - gene yidD BJE04_RS23545                normal  FALSE           NA BJE04_RS23545     
5        I  4160 4516   357      - gene rnpA BJE04_RS00290  VV0004, ....  normal  FALSE           NA BJE04_RS00290
6        I  4530 4670   141      - gene rpmH BJE04_RS00295  VV0005, ....  normal  FALSE           NA BJE04_RS00295

if I want to extract the sequence of BJE04_RS00275 in locus_tag, or in the other way export all genes of the genbank file

I mean

>BJE04_RS00275
aatgc
>BJE04_RS00280
ggcta
>BJE04_5000
atggcaa

how can I do it with R or if you have any solution in any language o program !!!

Thanks

CodePudding user response：

I do not have access to your specific file, so I am using the example file provided by the genbankr package. You will need Biostrings to write the sequence as fasta file. To only write the sequence(s) of a particular locus_tag, just subset GENES to that particular one (e.g. subset(GENES, locus_tag == "BJE04_RS00275") in your example).

library(genbankr)
library(GenomicFeatures)
library(Biostrings)

gb <- readGenBank(system.file("sample.gbk", package="genbankr"))
GENES <- genes(gb)
res <- getSeq(gb@sequence, setNames(GENES, GENES$gene_id))
writeXStringSet(res, "YJ016_I_genes.fa")