Fetching multiple pubmed abstract using r (httr)-CodePudding

This is continuation for the same task I am trying to perform from this question. Working with pubmed api in R (httr) to retrieve abstracts

using the answer below, I am able to find the pubmed IDs for the abstracts of interest.

Now, I am trying to obtain the title and full text of these abstracts (as 2 columns dataframe, one for title and one for abstract text).

what I understood from the api documentation is that passing multiple IDs is possible, so I tried the code below.


library(XML)
library(httr)
library(glue)
library(dplyr)
####
####



query = 'asthma[mesh] AND leukotrienes[mesh] AND 2009[pdat]'

 
reqq = glue ('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term={query}')


reqq = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal] AND breast cancer AND 2008[pdat]&usehistory=y"

op = GET(reqq)

content(op)


df_op <- op %>% xml2::read_xml() %>% xml2::as_list()

pmids <- df_op$eSearchResult$IdList %>% unlist(use.names = FALSE)

the code above obtains the pmids as a character, then I try to pass them to efetch

reqq1 = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=pmids&rettype=abstract&retmode=xml"

op1 = GET(reqq1)

content(op1)

I get an error saying: ID list is empty! Possibly it has no correct IDs.

I then tried to change the format of the ID character string so they are comma separated and without quotation marks or space in-between.

idc = paste(shQuote(pmids, type = "cmd"), collapse = ", ")
 
idc = gsub('"', '', idc)
idc = gsub(' ', '', idc)

# Then pass them to the same code: 

reqq1 = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=idc&rettype=abstract&retmode=xml"

op1 = GET(reqq1)

content(op1)

but I am still getting the same error as above. What I am trying to achieve is to get xml file with the data of these abstracts (and then eventually extract title and abstract body into dataframe). is it possible to pass all the IDs into one fetch or these have to be sent one by one using a loop ? if you can provide some guidance or have a short script that works, will be much appreciated.

thank you

CodePudding user response：

The reqq1 is still just a string. You may use glue to use actual value of pmids. I think you can query multiple id's together in which case you can use paste0(..., collapse = ',') to collapse the id's as one comma-separated string.

reqq1 = glue("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={paste0(pmids, collapse = ',')}&rettype=abstract&retmode=xml")
op1 = GET(reqq1)
content(op1)

{xml_document}
<PubmedArticleSet>
[1] <PubmedArticle>\n  <MedlineCitation Status="MEDLINE" Owner="NLM">\n    <PMID Version="1">19008416</PMID> ...
[2] <PubmedArticle>\n  <MedlineCitation Status="MEDLINE" Owner="NLM">\n    <PMID Version="1">18927361</PMID> ...
[3] <PubmedArticle>\n  <MedlineCitation Status="MEDLINE" Owner="NLM">\n    <PMID Version="1">18787170</PMID> ...
[4] <PubmedArticle>\n  <MedlineCitation Status="MEDLINE" Owner="NLM">\n    <PMID Version="1">18487186</PMID> ...
[5] <PubmedArticle>\n  <MedlineCitation Status="MEDLINE" Owner="NLM">\n    <PMID Version="1">18239126</PMID> ...
[6] <PubmedArticle>\n  <MedlineCitation Status="MEDLINE" Owner="NLM">\n    <PMID Version="1">18239125</PMID> ...