I am using Rvest to scrape news articles from the results that are given in https://www.derstandard.at/international/2011/12/01 (and other 1000 links on that page). For other webpages, I used hmtl_nodes to extract the links and created a loop to open them in order to scrape the text from each article. Here is a short version of what I'm trying to do:
library(rvest)
library(tidyverse)
library(RSelenium)
get_text <- function(headline_links) {
article_page <- read_html(headline_links)
text_article <- article_page %>%
html_nodes('.article-body p') %>%
html_text() %>%
paste( collapse = " ")
return(text_article)
}
date <-seq.Date(as.Date("2011/12/01"), as.Date("2012/06/30"), by = 1 )
date <-strptime(date, "%Y-%m-%d")
dat2 <-format(date, "%Y/%m/%d")
Newspaper <- as.list("der standard")
articles_standard <- (data.frame())
for (page_date in date[1:length(date)]) {
link <- paste0("https://www.derstandard.at/international/", page_date)
page <- read_html(link)
headline_links <- page %>%
html_nodes('.teaser-inner') %>%
html_attr("href")
text_all <- vector("character", length(headline_links))
text_all<- sapply(headline_links, FUN = get_text, USE.NAMES = F)
text_all <- as.data.frame(cbind(text_all,headline_links))
articles_standard <- rbind(articles_standard, data.frame(Newspaper, text_all, stringsAsFactors = F))
}
However, when I try to extract the links, I get no output. I think that the problem is the pop-up that appears when opening the webpage where I have to accept cookies and other stuff.
I found some similar issues here Scrape site that asks for cookies consent with rvest, where it was suggested to use the Network Analysis function in my browser, to find a non-hidden API. However, I could not find it.
I installed PhantomJS binary and tried to use it to render the HTML and scrape it with rvest with the code provided here: Scraping javascript website in R, however, I got no results from that (because phantomJS is deprecated?).
I also read about RSelenium several times, but from what I read it is very slow. I tried it anyways, however, using the $findElement function only always gives me an error. I wanted to extract the information by first switching to the iframe as suggested here: RSelenium can't find element with given parameters which only leads me to the same error. Here's what I did:
driver <- rsDriver(port = 1333L,browser=c("firefox"))
remote_driver <- driver[["client"]]
remote_driver$navigate(link)
# content is in iframe
frames <- remote_driver$findElements("css", "iframe")
# switch to first ifram
remote_driver$switchToFrame(frames[[1]])
webElem <- remote_driver$findElement(using = "xpath", value ="/html/body/div/div[2]/div[3]/div[1]/button")
webElem$clickElement()
And the error I get:
Selenium message:Unable to locate element: #notice.message.type-modal div.message-component.message-row.dst-columns div.message-component.message-column button
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'DESKTOP-2SMICP6', ip: '137.208.131.247', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_311'
Driver info: driver.version: unknown
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
This is my first time doing web scraping, also, I only have some experience in R and am not familiar with html, javascript, and other programs. So, either I have the wrong CSS selector or XPATH (I also tried several others, not shown in the code) or there's another reason it does not work.
I am a little bit lost now, so thanks for any help!
CodePudding user response:
There are many pop-ups on the website. You are right you have accept the cookie in the beginning.
Here is the code to get links for one date 2011/12/01
url = 'https://www.derstandard.at/international/2011/12/01'
#start the browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
Now you have to accept the cookie which is in iframe
.
webElem <- remDr$findElements("css", "iframe")
remDr$switchToFrame(webElem[[2]])
remDr$findElement(using = "xpath", '//*[@id="notice"]/div[3]/div[1]/button')$clickElement()
After this step i would suggest to refresh using remDr$refresh()
a couple of times as few pop-ups may disrupt our web scraping.
Then just extract the links for each article,
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes(xpath = '/html/body/main/div/div[2]/section[1]') %>% html_nodes("a") %>%
html_attr("href")
[1] "/story/1322531678074/deutsche-bundesanwaltschaft-keine-indizien-fuer-anschlagsplaene-teherans"
[2] "/story/1322531677068/regierung-uebersteht-vertrauensabstimmung"
[3] "/story/1322531676520/moskau-lieferte-anti-schiff-raketen"
[4] "/story/1322531676296/vergewaltigungsopfer-soll-taeter-heiraten"
[5] "/story/1322531672482/erneut-zahlreiche-tote-bei-anschlaegen"
[6] "/story/1322531670023/gewerkschaftsbund-von-verfassungsschutz-bespitzelt"
Or else you can also use,
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.teaser-inner') %>% html_nodes("a") %>%
html_attr("href")
[1] "/story/1322531678074/deutsche-bundesanwaltschaft-keine-indizien-fuer-anschlagsplaene-teherans"
[2] "/story/1322531677068/regierung-uebersteht-vertrauensabstimmung"
[3] "/story/1322531676520/moskau-lieferte-anti-schiff-raketen"