Scraping information from a webpage using the rvest library in R-CodePudding

Example: Scrape the first poster title from this

I have:

selected the title
right-clicked and inspected it in the Developer Tools
copied the Xpath

Here is my code:

    url   <- "https://www.aiche.org/academy/conferences/synthetic-biology-engineering-evolution-design-seed/2021/proceeding/session/poster-presenters-accepted"
    xpath <- "/html/body/div[1]/div[5]/section/div[2]/div/div[2]/div/div[3]/div/div/article/div/div/div[2]/div[2]/div[1]/div[1]/div[2]/span/a"

    url %>%
     read_html() %>%
     html_element(xpath = xpath) %>%
     html_text()

Question: Why don't I always extract the first title?

CodePudding user response：

The simple answer is that the response body of the page that you are evaluating is changing between requests. When I load that URL into a browser and force the page to reload (Command Shift R for Chrome on Mac, Control F5 for Windows) several times, a different version of the page is displayed.

First:

Second:

The longer answer is that it appears these two variations of the page are being returned due to the site having caching misconfigured, load balancing misconfigured, or a combination of both.

I arrived at this conclusion by looking at the response headers of several requests. The Via header's value is varnish. Varnish is an HTTP caching reverse proxy. I also noticed that the X-Cache header value was HIT and that for both versions of the page, but the X-Cache-Hits and Content-Length values varied. Out of the box, when Varnish sets the X-Cache header to HIT, it means that it is returning a cached copy from memory. The X-Cache-Hits header is basically a counter for the number of times a particular cached page has been returned.

There isn't much that you can do about caching issues without performing cache-busting requests which might be considered abusive by the site's owner.