I am learning rvest
.
I intend to scrape my search results. Here is the webpage,
I looked up html_nodes()
. There is no what I have seen on the webpage.
What could I do?
Here is the 'body'.
webpage %>% html_node('body')
{html_node}
<body>
[1] <noscript>\n <div id="no-script-banner">\n <div >\n <div no-session-banner" id="no-session-banner" hidden>\n <div >\n <div usa-skipnav" href="#search-results">\n Skip to main page content\n </a>
[4] <div role="complementary" id="ncov-alert-from-server" style="display: block;" dat ...
[5] <div ></div>
[6] <header role="banner" data-section="Header"><div >\n\t\t<div > ...
[7] <div role="navigation" aria-label="access keys">\n<a id="nws_header_accesskey_0" href="https://www.ncbi.nlm.nih.gov/guide/bro ...
[8] <section data-section="Alerts"><div ></div>\n</section>
[9] <a id="maincontent" aria-label="Main page content below" role="navigation"></a>
[10] <main id="search-page"><h1 >Search Page</h1>\n \n \n\n\n\n<input type="hidden" n ...
[11] <div id="ncbi-footer">\n <div role="complementary" title="Links to NCBI Literature Resources"> ...
[12] <script src="https://cdn.ncbi.nlm.nih.gov/pubmed/0399d7a0-471a-4f7d-84af-66091af9d657/CACHE/js/output.293fbf76aa18.js"></script>
[13] <script src="https://cdn.ncbi.nlm.nih.gov/pubmed/0399d7a0-471a-4f7d-84af-66091af9d657/CACHE/js/output.29588445dbd9.js"></script>
[14] <script>\n ncbi.awesome.basePage.init({\n userInfo: {\n isLoggedIn: false,\n username: "",\n log ...
[15] <script type="text/javascript">\n jQuery.getScript("https://www.ncbi.nlm.nih.gov/core/alerts/alerts.js", function () {\n ...
[16] <script defer type="text/javascript" src="https://cdn.ncbi.nlm.nih.gov/core/pinger/pinger.js"> </script>
[17] <svg xmlns="http://www.w3.org/2000/svg"><defs><lineargradient id="timeline-filter-selected-g ...
[18] <script src="https://cdn.ncbi.nlm.nih.gov/pubmed/0399d7a0-471a-4f7d-84af-66091af9d657/CACHE/js/output.714a700656e1.js"></script>
[19] <script>\n ncbi.awesome.searchPage.init({\n searchQuery: "eliminat matrix effect HPLC\\u002Dms/ms",\n searchCons ...
Not
CodePudding user response:
We can get the title of seach rsults by
library(rvest)
library(dplyr)
library(stringr)
url %>% read_html() %>% html_nodes('.docsum-title') %>% html_text() %>% str_remove_all('\\n')
[1] " HPLC-MS/MS analysis of peramivir in rat plasma: Elimination of matrix effect using the phospholipid-removal solid-phase extraction method. "
[2] " Development of matrix effect-free MISPE-UHPLC-MS/MS method for determination of lovastatin in Pu-erh tea, oyster mushroom, and red yeast rice.
And links to articles by
df = url %>% read_html() %>% html_nodes('.docsum-title') %>% html_attr('href')
paste0('https://pubmed.ncbi.nlm.nih.gov', df)
[1] "https://pubmed.ncbi.nlm.nih.gov/28976569/" "https://pubmed.ncbi.nlm.nih.gov/28410522/" "https://pubmed.ncbi.nlm.nih.gov/27491846/"
[4] "https://pubmed.ncbi.nlm.nih.gov/31532096/" "https://pubmed.ncbi.nlm.nih.gov/31288535/" "https://pubmed.ncbi.nlm.nih.gov/29433096/"
CodePudding user response:
I would consider if your search terms are correctly spelt and whether you want AND or OR between each term to appropriately set your request. Based on determining these, you might decide to use the public APIs provided to apply your query, extract pubmed ids and then request the associated documents.
API guidance: https://www.ncbi.nlm.nih.gov/home/develop/api/
library(jsonlite)
library(rvest)
library(tidyverse)
get_data <- function(link) {
page <- read_html(link)
data.frame(
link = link,
id = page %>% html_element('[title="PubMed ID"]') %>% html_text(trim = T),
title = page %>% html_element(".heading-title") %>% html_text(trim = T),
authors = page %>% html_elements(".full-name") %>% html_text(trim = T) %>% paste(., collapse = ', '),
abstract = page %>% html_element("#enc-abstract") %>% html_text2()
)
}
r <- jsonlite::read_json("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=200&retmode=json&term=eliminate AND matrix AND effect AND hplc ms/ms&mindate=2013&maxdate=2022")
ids <- r$esearchresult$idlist
if(length(ids)>0){
links <- sprintf("https://pubmed.ncbi.nlm.nih.gov/%s", ids)
results <- map_dfr(links, get_data)
}