Home > Software design >  R - extract links; web scraping site that asks for consent (accept cookies) RSelenium
R - extract links; web scraping site that asks for consent (accept cookies) RSelenium

Time:11-30

I am using rvest to scrape news articles from the results that are given in

https://www.derstandard.at/international/2011/12/01

(and other 1000 links on that page).

For other webpages, I used hmtl_nodes to extract the links and created a loop to open them in order to scrape the text from each article. Here is a short version of what I'm trying to do:

library(rvest)
library(tidyverse)
library(RSelenium)

get_text <-  function(headline_links) {
  article_page <- read_html(headline_links)
  text_article <- article_page %>%
    html_nodes('.article-body p') %>%
    html_text() %>%
    paste( collapse = " ")
  return(text_article)
}

date <-seq.Date(as.Date("2011/12/01"), as.Date("2012/06/30"), by = 1 )
date <-strptime(date, "%Y-%m-%d")
dat2 <-format(date, "%Y/%m/%d")

Newspaper <- as.list("der standard")
articles_standard <- (data.frame())

for (page_date in date[1:length(date)]) {
  
      link <- paste0("https://www.derstandard.at/international/", page_date)
      page <- read_html(link)
      headline_links <- page %>%
            html_nodes('.teaser-inner') %>%
            html_attr("href")
          
      text_all <- vector("character", length(headline_links))
      text_all<- sapply(headline_links, FUN = get_text, USE.NAMES = F)
      text_all <- as.data.frame(cbind(text_all,headline_links))
      articles_standard <- rbind(articles_standard, data.frame(Newspaper, text_all, stringsAsFactors = F))
}

However, when I try to extract the links, I get no output. I think that the problem is the pop-up that appears when opening the webpage where I have to accept cookies and other stuff.

I found some similar issues here Scrape site that asks for cookies consent with rvest, where it was suggested to use the Network Analysis function in my browser, to find a non-hidden API. However, I could not find it.

I installed PhantomJS binary and tried to use it to render the HTML and scrape it with rvest with the code provided here: Scraping javascript website in R, however, I got no results from that (because phantomJS is deprecated?).

I also read about RSelenium several times, but from what I read it is very slow. I tried it anyways, however, using the $findElement function only always gives me an error. I wanted to extract the information by first switching to the iframe as suggested here: RSelenium can't find element with given parameters which only leads me to the same error. Here's what I did:

     driver <- rsDriver(port = 1333L,browser=c("firefox"))
      remote_driver <- driver[["client"]]
      remote_driver$navigate(link)
      # content is in iframe
      frames <- remote_driver$findElements("css", "iframe")
      # switch to first ifram
      remote_driver$switchToFrame(frames[[1]])
      webElem <- remote_driver$findElement(using = "xpath", value ="/html/body/div/div[2]/div[3]/div[1]/button")
      webElem$clickElement()

And the error I get:

Selenium message:Unable to locate element: #notice.message.type-modal div.message-component.message-row.dst-columns div.message-component.message-column button
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'DESKTOP-2SMICP6', ip: '137.208.131.247', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_311'
Driver info: driver.version: unknown

Error:   Summary: NoSuchElement
     Detail: An element could not be located on the page using the given search parameters.
     class: org.openqa.selenium.NoSuchElementException
     Further Details: run errorDetails method

This is my first time doing web scraping, also, I only have some experience in R and am not familiar with html, javascript, and other programs. So, either I have the wrong css selector or xpath(I also tried several others, not shown in the code) or there's another reason it does not work.

I am a little bit lost now, so thanks for any help!

CodePudding user response:

There are many pop-ups on the website. You are right you have accept the cookie in the beginning.

Here is the code to get links for one date 2011/12/01

url = 'https://www.derstandard.at/international/2011/12/01'
#start the browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)

Now you have to accept the cookie which is in iframe.

webElem <- remDr$findElements("css", "iframe")
remDr$switchToFrame(webElem[[2]])
remDr$findElement(using = "xpath", '//*[@id="notice"]/div[3]/div[1]/button')$clickElement()

After this step i would suggest to refresh using remDr$refresh() a couple of times as few pop-ups may disrupt our web scraping.

Then just extract the links for each article,

remDr$getPageSource()[[1]] %>% 
  read_html() %>% 
  html_nodes(xpath = '/html/body/main/div/div[2]/section[1]') %>% html_nodes("a") %>% 
  html_attr("href")
       [1] "/story/1322531678074/deutsche-bundesanwaltschaft-keine-indizien-fuer-anschlagsplaene-teherans"                             
 [2] "/story/1322531677068/regierung-uebersteht-vertrauensabstimmung"                                                            
 [3] "/story/1322531676520/moskau-lieferte-anti-schiff-raketen"                                                                  
 [4] "/story/1322531676296/vergewaltigungsopfer-soll-taeter-heiraten"                                                            
 [5] "/story/1322531672482/erneut-zahlreiche-tote-bei-anschlaegen"                                                               
 [6] "/story/1322531670023/gewerkschaftsbund-von-verfassungsschutz-bespitzelt"                                                   

Or else you can also use,

remDr$getPageSource()[[1]] %>% 
  read_html() %>% 
  html_nodes('.teaser-inner') %>% html_nodes("a") %>% 
  html_attr("href")
       [1] "/story/1322531678074/deutsche-bundesanwaltschaft-keine-indizien-fuer-anschlagsplaene-teherans"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 [2] "/story/1322531677068/regierung-uebersteht-vertrauensabstimmung"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
 [3] "/story/1322531676520/moskau-lieferte-anti-schiff-raketen" 
  • Related