Home > front end >  How do I scrape only one section of text from a webpage in R?
How do I scrape only one section of text from a webpage in R?

Time:02-15

I am trying to scrape specific portions of html based journal articles. For example if I only wanted to scrape the "Statistical analyses" sections of article in a Frontiers publication how could I do that? Since the number of paragraphs and locations of the section change for each article, the selectorGadget isn't helping.

https://www.frontiersin.org/articles/10.3389/fnagi.2010.00032/full

I've tried using rvest with html_nodes and xpath, but I'm not having any luck. The best I can do is begin scraping at the section I want, but can't get it to stop after. Any suggestions?

example_page <- "https://www.frontiersin.org/articles/10.3389/fnagi.2010.00032/full"
example_stats_section <- read_html(example_page) %>% 
html_nodes(xpath="//h3[contains(., 'Statistical Analyses')]/following-sibling::p") %>%
html_text()

CodePudding user response:

Since there is a "Results" section after each "Statistical analyses" try

//h3[.='Statistical Analyses']/following-sibling::p[following::h2[.="Results"]]

to get required section

  • Related