Home > Software engineering >  Rentrez is pulling the wrong data from NCBI in R?
Rentrez is pulling the wrong data from NCBI in R?

Time:01-28

I am trying to download sequence data from E. coli samples within the state of Washington - it's about 1283 sequences, which I know is a lot. The problem that I am running into is that entrez_search and/or entrez_fetch seem to be pulling the wrong data. For example, the following R code does pull 1283 IDs, but when I use entrez_fetch on those IDs, the sequence data I get is from chickens and corn and things that are not E. coli:

search <- entrez_search(db = "biosample", 
                        term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
                        retmax = 9999, use_history = T)

Similarly, I tried pulling the sequence from one sample manually as a test. When I search for the accession number SAMN30954130 on the NCBI website, I see metadata for an E. coli sample. When I use this code, I see metadata for a chicken:

search <- entrez_search(db = "biosample", 
                        term = "SAMN30954130[ACCN]",
                        retmax = 9999, use_history = T)
fetch_test <- entrez_fetch(db = "nucleotide",
                           id = search$ids,
                           rettype = "xml")
fetch_list <- xmlToList(fetch_test)

CodePudding user response:

The issue here is that you are using a Biosample UID to query the Nucleotide database. However, the UID is then interpreted as a Nucleotide UID, so you get a sequence record unrelated to your original Biosample query.

What you need to use in this situation is entrez_link, which uses a UID to link records between two databases.

For example, your Biosample accession SAMN30954130 has the Biosample UID 30954130. You link that to Nucleotide like this:

nuc_links <- entrez_link(dbfrom='biosample', id=30954130, db='nuccore')

And you can get the corresponding Nucleotide UID(s) like this:

nuc_links$links$biosample_nuccore

[1] "2307876014"

And then:

fetch_test <- entrez_fetch(db = "nucleotide",
                           id = 2307876014,
                           rettype = "xml")

This is covered in the section "Finding cross-references" of the rentrez tutorial.

  • Related