Home > Software design >  Locating photo links and scraping from website posting using rvest
Locating photo links and scraping from website posting using rvest

Time:09-09

I am trying to use rvest to scrape the photo links and download all the images from postings like this one into separate folders. However, I am stuck trying to find the links of ALL the images in the post (including the ones on the left) as only the active one (non-thumbnail clicked on image) is shown when I try to look for the links using html_nodes and html_attr

My code is as below:

library(tidyverse)
library(rvest)
    
url <- "https://clasipar.paraguay.com/inmuebles/propiedades-rurales/feedlot-de-107-hectareas-a-90-km-de-asuncion-71526"

photos <- url %>%
  read_html() %>%
  html_nodes("img") %>%
  html_attr("src")

So, my output only shows the one active image in that website posting - element [9] in the vector output. How can I get the links for all the images in the post?

CodePudding user response:

In this case, searching with regex can be quite convenient as you have a strict pattern for the image path:

 url %>%
  read_html() %>%
  toString() %>%
  str_extract_all("clasicdn.paraguay.com/pictures/2016.*?\\.jpg") %>%
  unlist() %>%
  unique()

[1] "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634181S.jpg" "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634181L.jpg"
[3] "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634243S.jpg" "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634243L.jpg"
[5] "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634376S.jpg" "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634376L.jpg"
[7] "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634456S.jpg" "clasicdn.paraguay.com/pictures/2016/12/27/71526/1634456L.jpg"
  • Related