Home > Mobile >  Scraping links from a web page at a specific position
Scraping links from a web page at a specific position

Time:04-12

I am trying to collect a number of links from a website.

For example I have the following and my idea was to collect the link where it says leer más which is where I get the xpath from.

url = "https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181"
x <- GET(url, add_headers('user-agent' = desktop_agents[sample(1:10, 1)]))
x %>% 
  read_html() %>% 
  html_nodes(xpath = '//*[@id="App"]/div[2]/div[1]/main/div/div[3]/section/article[1]/div/a/p/span[2]')

This gives me the following but not the link:

{xml_nodeset (1)}
[1] <span >Leer más</span>

Additionally, I thought about collecting all links:

x %>% 
  read_html() %>% 
  html_nodes("a") %>% 
  html_attr("href")

This gives me a lot of links but not the links to the individual webpages I want.

I would like to have a list of links such as:

https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-terraza-trastero-ascensor-amueblado-internet/162262978/d

https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-trastero-ascensor-amueblado/159750574/d

https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-jardin-zona-comunitaria-ascensor-patio-amueblado-parking-television-internet-piscina/162259162/d

CodePudding user response:

Those links are stored inside a JavaScript object within a script tag. You can regex out the string defining that object, do some unescapes to enable jsonlite to parse, then apply a custom function to extract just the urls of interest to the json object

library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)

link <- 'https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181'
p <- read_html(url) %>% html_text()
s <- str_match(p, 'window\\.__INITIAL_PROPS__ = JSON\\.parse\\("(.*)".*?;')[,2]
data <- jsonlite::parse_json(gsub('\\\\\\"', '\\\"', gsub('\\\\"', '"', s)))
links <- purrr::map(data$initialSearch$result$realEstates, ~ .x$detail$`es-ES` %>% url_absolute(link))
  • Related