Home > other >  R - web scraping site that asks for consent
R - web scraping site that asks for consent


I am using Rvest to scrape news articles from the results that are given in https://www.derstandard.at/international/2011/12/01 (and other 1000 links on that page). For other webpages, I used hmtl_nodes to extract the links and created a loop to open them in order to scrape the text from each article. Here is a short version of what I'm trying to do:


get_text <-  function(headline_links) {
  article_page <- read_html(headline_links)
  text_article <- article_page %>%
    html_nodes('.article-body p') %>%
    html_text() %>%
    paste( collapse = " ")

date <-seq.Date(as.Date("2011/12/01"), as.Date("2012/06/30"), by = 1 )
date <-strptime(date, "%Y-%m-%d")
dat2 <-format(date, "%Y/%m/%d")

Newspaper <- as.list("der standard")
articles_standard <- (data.frame())

for (page_date in date[1:length(date)]) {
      link <- paste0("https://www.derstandard.at/international/", page_date)
      page <- read_html(link)
      headline_links <- page %>%
            html_nodes('.teaser-inner') %>%
      text_all <- vector("character", length(headline_links))
      text_all<- sapply(headline_links, FUN = get_text, USE.NAMES = F)
      text_all <- as.data.frame(cbind(text_all,headline_links))
      articles_standard <- rbind(articles_standard, data.frame(Newspaper, text_all, stringsAsFactors = F))

However, when I try to extract the links, I get no output. I think that the problem is the pop-up that appears when opening the webpage where I have to accept cookies and other stuff.

I found some similar issues here Scrape site that asks for cookies consent with rvest, where it was suggested to use the Network Analysis function in my browser, to find a non-hidden API. However, I could not find it.

I installed PhantomJS binary and tried to use it to render the HTML and scrape it with rvest with the code provided here: Scraping javascript website in R, however, I got no results from that (because phantomJS is deprecated?).

I also read about RSelenium several times, but from what I read it is very slow. I tried it anyways, however, using the $findElement function only always gives me an error. I wanted to extract the information by first switching to the iframe as suggested here: RSelenium can't find element with given parameters which only leads me to the same error. Here's what I did:

     driver <- rsDriver(port = 1333L,browser=c("firefox"))
      remote_driver <- driver[["client"]]
      # content is in iframe
      frames <- remote_driver$findElements("css", "iframe")
      # switch to first ifram
      webElem <- remote_driver$findElement(using = "xpath", value ="/html/body/div/div[2]/div[3]/div[1]/button")

And the error I get:

Selenium message:Unable to locate element: #notice.message.type-modal div.message-component.message-row.dst-columns div.message-component.message-column button
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'DESKTOP-2SMICP6', ip: '', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_311'
Driver info: driver.version: unknown

Error:   Summary: NoSuchElement
     Detail: An element could not be located on the page using the given search parameters.
     class: org.openqa.selenium.NoSuchElementException
     Further Details: run errorDetails method

This is my first time doing web scraping, also, I only have some experience in R and am not familiar with html, javascript, and other programs. So, either I have the wrong CSS selector or XPATH (I also tried several others, not shown in the code) or there's another reason it does not work.

I am a little bit lost now, so thanks for any help!

CodePudding user response:

There are many pop-ups on the website. You are right you have accept the cookie in the beginning.

Here is the code to get links for one date 2011/12/01

url = 'https://www.derstandard.at/international/2011/12/01'
#start the browser
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]

Now you have to accept the cookie which is in iframe.

webElem <- remDr$findElements("css", "iframe")
remDr$findElement(using = "xpath", '//*[@id="notice"]/div[3]/div[1]/button')$clickElement()

After this step i would suggest to refresh using remDr$refresh() a couple of times as few pop-ups may disrupt our web scraping.

Then just extract the links for each article,

remDr$getPageSource()[[1]] %>% 
  read_html() %>% 
  html_nodes(xpath = '/html/body/main/div/div[2]/section[1]') %>% html_nodes("a") %>% 
       [1] "/story/1322531678074/deutsche-bundesanwaltschaft-keine-indizien-fuer-anschlagsplaene-teherans"                             
 [2] "/story/1322531677068/regierung-uebersteht-vertrauensabstimmung"                                                            
 [3] "/story/1322531676520/moskau-lieferte-anti-schiff-raketen"                                                                  
 [4] "/story/1322531676296/vergewaltigungsopfer-soll-taeter-heiraten"                                                            
 [5] "/story/1322531672482/erneut-zahlreiche-tote-bei-anschlaegen"                                                               
 [6] "/story/1322531670023/gewerkschaftsbund-von-verfassungsschutz-bespitzelt"                                                   

Or else you can also use,

remDr$getPageSource()[[1]] %>% 
  read_html() %>% 
  html_nodes('.teaser-inner') %>% html_nodes("a") %>% 
       [1] "/story/1322531678074/deutsche-bundesanwaltschaft-keine-indizien-fuer-anschlagsplaene-teherans"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 [2] "/story/1322531677068/regierung-uebersteht-vertrauensabstimmung"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
 [3] "/story/1322531676520/moskau-lieferte-anti-schiff-raketen" 
  • Related