I am trying to scrape specific portions of html based journal articles. For example if I only wanted to scrape the "Statistical analyses" sections of article in a Frontiers publication how could I do that? Since the number of paragraphs and locations of the section change for each article, the selectorGadget isn't helping.
https://www.frontiersin.org/articles/10.3389/fnagi.2010.00032/full
I've tried using rvest with html_nodes and xpath, but I'm not having any luck. The best I can do is begin scraping at the section I want, but can't get it to stop after. Any suggestions?
example_page <- "https://www.frontiersin.org/articles/10.3389/fnagi.2010.00032/full"
example_stats_section <- read_html(example_page) %>%
html_nodes(xpath="//h3[contains(., 'Statistical Analyses')]/following-sibling::p") %>%
html_text()
CodePudding user response:
Since there is a "Results" section after each "Statistical analyses" try
//h3[.='Statistical Analyses']/following-sibling::p[following::h2[.="Results"]]
to get required section