I am trying to use rvest
to scrape the photo links and download all the images from postings like this one into separate folders. However, I am stuck trying to find the links of ALL the images in the post (including the ones on the left) as only the active one (non-thumbnail clicked on image) is shown when I try to look for the links using html_nodes
and html_attr
My code is as below:
library(tidyverse)
library(rvest)
url <- "https://clasipar.paraguay.com/inmuebles/propiedades-rurales/feedlot-de-107-hectareas-a-90-km-de-asuncion-71526"
photos <- url %>%
read_html() %>%
html_nodes("img") %>%
html_attr("src")
So, my output only shows the one active image in that website posting - element [9] in the vector output. How can I get the links for all the images in the post?
CodePudding user response:
In this case, searching with regex can be quite convenient as you have a strict pattern for the image path:
url %>%
read_html() %>%
toString() %>%
str_extract_all("clasicdn.paraguay.com/pictures/2016.*?\\.jpg") %>%
unlist() %>%
unique()
[1] "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634181S.jpg" "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634181L.jpg"
[3] "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634243S.jpg" "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634243L.jpg"
[5] "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634376S.jpg" "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634376L.jpg"
[7] "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634456S.jpg" "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634456L.jpg"