Home > Back-end >  Webscraping of links from a web page in R
Webscraping of links from a web page in R

Time:04-16

I would like to get also links to the properties - but for some reason, I am not getting all the links from each page, this code works but only for the first page. What I am missing regarding the link extraction?

# To get $rooms, $m2, $price, $link
library(rvest)
library(dplyr)

flat_I = data.frame()

for (i in 7:100) {
  link <- paste0("https://www.immobilienscout24.at/regional/wien/wien/immobilie-kaufen/seite-", i)
  page <- read_html(link)
  
  #parse out the parent nodes
  results <- page %>% html_elements(".DHILY")
  
  #retrieve the rooms, m2 and price from each parent
  rooms <- results %>% html_element(".ufaLY:nth-child(1)") %>%
    html_text()
  
  m2 <- results %>% html_element(".ufaLY:nth-child(2)") %>%
    html_text()
  
  price <- results %>% html_element(".tSnnN") %>%
    html_text()
  
  link <- page %>% 
    html_nodes("a._aOSG") %>% 
    html_attr("href") %>% 
    paste0("https://www.immobilienscout24.at", ., sep="")
  
  flat_I = rbind(flat_I, data.frame(rooms, m2, price, link, stringsAsFactors = FALSE))
  print(paste("Page:", i))
  
}

CodePudding user response:

The links are located in two classes s5PQF and YXjuW we can extract links fro them individually or get all the links from page and filter them to retain only desired links.

Further you have defined link twice in your loop avoid such repetitions.

library(stringr)

page %>% html_nodes('a') %>% 
    html_attr('href') %>% unique() %>% 
    str_subset('expose')
  • Related