Home > Software engineering >  Scrape a NIH webpage with rvest
Scrape a NIH webpage with rvest

Time:12-12

I am learning rvest.

I intend to scrape my search results. Here is the webpage,

https://pubmed.ncbi.nlm.nih.gov/?term=eliminat matrix effect HPLC-ms/ms&filter=years.2013-2022&size=200

I looked up html_nodes(). There is no what I have seen on the webpage.

What could I do?

Here is the 'body'.

webpage %>% html_node('body')
{html_node}
<body>
 [1] <noscript>\n  <div  id="no-script-banner">\n    <div >\n      <div no-session-banner" id="no-session-banner" hidden>\n  <div >\n    <div usa-skipnav" href="#search-results">\n    Skip to main page content\n  </a>
 [4] <div role="complementary" id="ncov-alert-from-server"  style="display: block;" dat ...
 [5] <div ></div>
 [6] <header  role="banner" data-section="Header"><div >\n\t\t<div > ...
 [7] <div role="navigation" aria-label="access keys">\n<a id="nws_header_accesskey_0" href="https://www.ncbi.nlm.nih.gov/guide/bro ...
 [8] <section data-section="Alerts"><div ></div>\n</section>
 [9] <a id="maincontent" aria-label="Main page content below" role="navigation"></a>
[10] <main  id="search-page"><h1 >Search Page</h1>\n    \n    \n\n\n\n<input type="hidden" n ...
[11] <div id="ncbi-footer">\n      <div  role="complementary" title="Links to NCBI Literature Resources"> ...
[12] <script src="https://cdn.ncbi.nlm.nih.gov/pubmed/0399d7a0-471a-4f7d-84af-66091af9d657/CACHE/js/output.293fbf76aa18.js"></script>
[13] <script src="https://cdn.ncbi.nlm.nih.gov/pubmed/0399d7a0-471a-4f7d-84af-66091af9d657/CACHE/js/output.29588445dbd9.js"></script>
[14] <script>\n    ncbi.awesome.basePage.init({\n      userInfo: {\n        isLoggedIn: false,\n        username: "",\n        log ...
[15] <script type="text/javascript">\n    jQuery.getScript("https://www.ncbi.nlm.nih.gov/core/alerts/alerts.js", function () {\n   ...
[16] <script defer type="text/javascript" src="https://cdn.ncbi.nlm.nih.gov/core/pinger/pinger.js"> </script>
[17] <svg  xmlns="http://www.w3.org/2000/svg"><defs><lineargradient id="timeline-filter-selected-g ...
[18] <script src="https://cdn.ncbi.nlm.nih.gov/pubmed/0399d7a0-471a-4f7d-84af-66091af9d657/CACHE/js/output.714a700656e1.js"></script>
[19] <script>\n    ncbi.awesome.searchPage.init({\n      searchQuery: "eliminat matrix effect HPLC\\u002Dms/ms",\n      searchCons ...
Not 

CodePudding user response:

We can get the title of seach rsults by

library(rvest)
library(dplyr)
library(stringr)

url %>% read_html() %>% html_nodes('.docsum-title') %>% html_text() %>% str_remove_all('\\n')

  [1] "                HPLC-MS/MS analysis of peramivir in rat plasma: Elimination of matrix effect using the phospholipid-removal solid-phase extraction method.              "                                                                                                                                             
  [2] "                Development of matrix effect-free MISPE-UHPLC-MS/MS method for determination of lovastatin in Pu-erh tea, oyster mushroom, and red yeast rice.

And links to articles by

df = url %>% read_html() %>% html_nodes('.docsum-title') %>% html_attr('href') 

paste0('https://pubmed.ncbi.nlm.nih.gov', df)

  [1] "https://pubmed.ncbi.nlm.nih.gov/28976569/" "https://pubmed.ncbi.nlm.nih.gov/28410522/" "https://pubmed.ncbi.nlm.nih.gov/27491846/"
  [4] "https://pubmed.ncbi.nlm.nih.gov/31532096/" "https://pubmed.ncbi.nlm.nih.gov/31288535/" "https://pubmed.ncbi.nlm.nih.gov/29433096/"

CodePudding user response:

I would consider if your search terms are correctly spelt and whether you want AND or OR between each term to appropriately set your request. Based on determining these, you might decide to use the public APIs provided to apply your query, extract pubmed ids and then request the associated documents.

API guidance: https://www.ncbi.nlm.nih.gov/home/develop/api/

library(jsonlite)
library(rvest)
library(tidyverse)

get_data <- function(link) {
  page <- read_html(link)
  data.frame(
    link = link,
    id = page %>% html_element('[title="PubMed ID"]') %>% html_text(trim = T),
    title = page %>% html_element(".heading-title") %>% html_text(trim = T),
    authors = page %>% html_elements(".full-name") %>% html_text(trim = T) %>% paste(., collapse = ', '),
    abstract = page %>% html_element("#enc-abstract") %>% html_text2()
  )
}

r <- jsonlite::read_json("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=200&retmode=json&term=eliminate AND matrix AND effect AND hplc ms/ms&mindate=2013&maxdate=2022")
ids <- r$esearchresult$idlist

if(length(ids)>0){
  
  links <- sprintf("https://pubmed.ncbi.nlm.nih.gov/%s", ids)
  results <- map_dfr(links, get_data)
  
}
  • Related