Home > front end >  Scroll down a page and load all items before using read_html()
Scroll down a page and load all items before using read_html()

Time:04-12

I am trying to scrape 30 items from a website and the best I can obtain is between 16-20 of the items. The website requires you to scroll down in order to load more items.

I thought about adding in varying scrolling options such as key = "up_arrow"), key = "down_arrow"), key = "home") and key = "end") in order to try an activate all of the items and load them whilst also adding in some random system sleep to make it more human like.

I also cannot seem to find an option to "scroll" with some scrolling timer, i.e. take 5 seconds to scroll the page.

How can I correctly load the full page before I can read_html()?

Code/Data

link = "https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/sagrada-familia/l/"

####################################################
openAndScrollPage <- function(link){
  
  driver = rsDriver(browser = c("firefox"))
  remDr <- driver[["client"]]
  remDr$navigate(link)
  
  #accept cookie
  remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
  
  Sys.sleep(1)
  #scroll to the end of page
  webElem <- remDr$findElement("css", "html")
  webElem$sendKeysToElement(list(key="end"))
  
  #use the up_arrow to get pagination into view
  WAIT <- function(x){
    x <- NULL
    
    webElem$sendKeysToElement(list(key="end"))
    
    webElem$sendKeysToElement(list(key = "up_arrow"))
    
    webElem$sendKeysToElement(list(key = "home"))
    
    Sys.sleep(floor(runif(1, 5, 10)))
    
    webElem$sendKeysToElement(list(key="end"))
  }
  WAIT()
  
  Sys.sleep(1)
  
  html_full_page = remDr$getPageSource()[[1]] %>% 
    read_html()
  
  return(html_full_page)
  
}

####################################################
html_full_page = openAndScrollPage(link)

x <- html_full_page %>% 
  html_nodes('.re-CardPackPremium-carousel') 

CodePudding user response:

To load whole page we need to scroll bit by bit instead of directly scrolling to the end of page.

#after navigating and accepting cookie, we shall scroll bit by bit 

for(i in 1:30){ 
  print(i)
remDr$executeScript("window.scrollBy(0,500);")
  Sys.sleep(1)
}

#get nodes of all houses
html_full_page = remDr$getPageSource()[[1]] %>% 
  read_html()
x <- html_full_page %>% 
  html_nodes('.re-CardPackPremium-carousel') 
{xml_nodeset (30)}
  • Related