Home > other >  Scraping information from a webpage using the rvest library in R
Scraping information from a webpage using the rvest library in R

Time:10-28

Example: Scrape the first poster title from this enter image description here

I have:

  • selected the title
  • right-clicked and inspected it in the Developer Tools
  • copied the Xpath

Here is my code:

    url   <- "https://www.aiche.org/academy/conferences/synthetic-biology-engineering-evolution-design-seed/2021/proceeding/session/poster-presenters-accepted"
    xpath <- "/html/body/div[1]/div[5]/section/div[2]/div/div[2]/div/div[3]/div/div/article/div/div/div[2]/div[2]/div[1]/div[1]/div[2]/span/a"

    url %>%
     read_html() %>%
     html_element(xpath = xpath) %>%
     html_text()

Question: Why don't I always extract the first title?

enter image description here

CodePudding user response:

The simple answer is that the response body of the page that you are evaluating is changing between requests. When I load that URL into a browser and force the page to reload (Command Shift R for Chrome on Mac, Control F5 for Windows) several times, a different version of the page is displayed.

First: Version you think you are requesting

Second: Alternative version of the same page

The longer answer is that it appears these two variations of the page are being returned due to the site having caching misconfigured, load balancing misconfigured, or a combination of both.

I arrived at this conclusion by looking at the response headers of several requests. The Via header's value is varnish. Varnish is an HTTP caching reverse proxy. I also noticed that the X-Cache header value was HIT and that for both versions of the page, but the X-Cache-Hits and Content-Length values varied. Out of the box, when Varnish sets the X-Cache header to HIT, it means that it is returning a cached copy from memory. The X-Cache-Hits header is basically a counter for the number of times a particular cached page has been returned.

There isn't much that you can do about caching issues without performing cache-busting requests which might be considered abusive by the site's owner.

  • Related