How to select "href" of a web page of a specific "target"?-CodePudding

<a class="image teaser-image ng-star-inserted" target="_self" href="/politik/inland/neuwahlen-2022-welche-szenarien-jetzt-realistisch-sind/401773131">

I just want to extract the "href" (for example the upper HTML tag) in order to concat it with the domain name of this website "https://kurier.at" and web scrape all articles on the home page.

I tried the following code

library(rvest)
library(lubridate)


kurier_wbpg <- read_html("https://kurier.at")

# I just want the "a" tags which come with the attribute "_self" 

articleLinks <- kurier_wbpg %>% html_elements("a")%>%
html_elements(css = "tag[attribute=_self]")  %>% 
html_attr("href")%>% 
paste("https://kurier.at",.,sep = "")

When I execute up to the html_attr("href") part of the above code block, the result I get is

character(0)

I think something wrong with selecting the HTML element tag. I need some help with this?

CodePudding user response：

You need to narrow down your css to the second teaser block image which you can do by using the naming conventions of the classes. You can use url_absolute() to add the domain.

library(rvest)
library(magrittr)

url <- 'https://kurier.at/'
result <- read_html(url) %>% 
  html_element('.teasers-2 .image') %>% 
  html_attr('href') %>% 
  url_absolute(url)

Same principle to get all teasers:

results <- read_html(url) %>% 
  html_elements('.teaser .image') %>% 
  html_attr('href') %>% 
  url_absolute(url)

Not sure if you want the bottom block of 5 included. If so, you can again use classes

articles <- read_html(url) %>% 
  html_elements('.teaser-title') %>% 
  html_attr('href') %>% 
  url_absolute(url)

CodePudding user response：

It works with xpath -

library(rvest)

kurier_wbpg <- read_html("https://kurier.at")

articleLinks  <- kurier_wbpg %>% 
  html_elements("a") %>%
  html_elements(xpath = '//*[@target="_self"]') %>%
  html_attr('href') %>%
  paste0("https://kurier.at",.)

articleLinks

# [1] "https://kurier.at/plus"
# [2] "https://kurier.at/coronavirus"
# [3] "https://kurier.at/politik"
# [4] "https://kurier.at/politik/inland"
# [5] "https://kurier.at/politik/ausland"
#...
#...