Example: Scrape the first poster title from this
I have:
- selected the title
- right-clicked and inspected it in the Developer Tools
- copied the Xpath
Here is my code:
url <- "https://www.aiche.org/academy/conferences/synthetic-biology-engineering-evolution-design-seed/2021/proceeding/session/poster-presenters-accepted"
xpath <- "/html/body/div[1]/div[5]/section/div[2]/div/div[2]/div/div[3]/div/div/article/div/div/div[2]/div[2]/div[1]/div[1]/div[2]/span/a"
url %>%
read_html() %>%
html_element(xpath = xpath) %>%
html_text()
Question: Why don't I always extract the first title?
CodePudding user response:
The simple answer is that the response body of the page that you are evaluating is changing between requests. When I load that URL into a browser and force the page to reload (Command Shift R
for Chrome on Mac, Control F5
for Windows) several times, a different version of the page is displayed.
The longer answer is that it appears these two variations of the page are being returned due to the site having caching misconfigured, load balancing misconfigured, or a combination of both.
I arrived at this conclusion by looking at the response headers of several requests. The Via
header's value is varnish
. Varnish is an HTTP caching reverse proxy. I also noticed that the X-Cache
header value was HIT
and that for both versions of the page, but the X-Cache-Hits
and Content-Length
values varied. Out of the box, when Varnish sets the X-Cache
header to HIT
, it means that it is returning a cached copy from memory. The X-Cache-Hits
header is basically a counter for the number of times a particular cached page has been returned.
There isn't much that you can do about caching issues without performing cache-busting requests which might be considered abusive by the site's owner.