I'm very new to coding and am attempting to scrape all the article URLs from a news website. I've successfully scraped the article title, author, dates, and summaries and placed them into a data frame, but I'm unable to follow the same process for scraping the URLs. I'm using the Selector Gadget but can't seem to pick the right element.
library(rvest)
library(tidyverse)
link="https://www.theroot.com/news/criminal-justice"
webpage=read_html(link)
articlelinks= webpage%>% html_nodes(".diJdnO")%>% html_attr("href")
I'm receiving a vector of 20 "NAs." I would love any assistance correcting this code!
CodePudding user response:
Suggestion for multiple pages of scraping.
library(tidyverse)
library(rvest)
get_articles <- function(n_articles) {
page <- paste0("https://www.theroot.com/news/criminal-justice",
"?startIndex=",
n_articles) %>%
read_html()
tibble(
title = page %>%
html_elements(".aoiLP .js_link") %>%
html_text2(),
author = page %>%
html_elements(".llHfhX .js_link , .permalink-bylineprop") %>%
html_text2(),
date = page %>%
html_elements(".js_meta-time") %>%
html_text2(),
url = page %>%
html_elements(".aoiLP .js_link") %>%
html_attr("href")
)
}
df <- map_dfr(seq(0, 200, by = 20), get_articles)
# A tibble: 220 x 4
title author date url
<chr> <chr> <chr> <chr>
1 Brooklyn Bishop Gets Robbed at Gunpoint During Servi~ Kalyn~ Mond~ http~
2 Georgia Gov. Brian Kemp To Testify On Trump Probe To~ Murja~ Mond~ http~
3 Florida To Allow Military Veterans Teach In Schools ~ Murja~ Satu~ http~
4 One of George Floyd’s Killers Gets Sentenced to Only~ Kalyn~ 7/21~ http~
5 Judge Finds Enough Evidence to Pursue Criminal Charg~ Kalyn~ 7/20~ http~
6 Indiana Man Arrested in Connection to Black Girl’s D~ Kalyn~ 7/19~ http~
7 “This is Not a George Floyd Situation!” Says Woman w~ Kalyn~ 7/19~ http~
8 Three Black Men Exonerated in Horrible 1995 Subway K~ Kalyn~ 7/18~ http~
9 NAACP Calls On Department of Justice To Investigate ~ Murja~ 7/15~ http~
10 Autopsy: Jayland Walker Suffered 46 Bullet Wounds Kalyn~ 7/15~ http~
# ... with 210 more rows
CodePudding user response:
library(tidyverse)
library(rvest)
page <- "https://www.theroot.com/news/criminal-justice" %>%
read_html()
tibble(
url = page %>%
html_elements(".aoiLP") %>%
html_elements(".js_link") %>%
html_attr("href"),
title = page %>%
html_elements(".aoiLP") %>%
html_elements(".js_link") %>%
html_text2()
)
# A tibble: 20 x 2
url title
<chr> <chr>
1 https://www.theroot.com/brooklyn-bishop-gets-robbed-at-gunpoint-d~ Broo~
2 https://www.theroot.com/georgia-gov-brian-kemp-to-testify-on-trum~ Geor~
3 https://www.theroot.com/florida-to-allow-military-veterans-teach-~ Flor~
4 https://www.theroot.com/one-of-george-floyd-s-killers-gets-senten~ One ~
5 https://www.theroot.com/judge-finds-enough-evidence-to-pursue-cri~ Judg~
6 https://www.theroot.com/indiana-man-arrested-in-connection-to-bla~ Indi~
7 https://www.theroot.com/this-is-not-a-george-floyd-situation-says~ “Thi~
8 https://www.theroot.com/three-men-exonerated-in-horrible-1995-sub~ Thre~
9 https://www.theroot.com/naacp-calls-on-department-of-justice-to-i~ NAAC~
10 https://www.theroot.com/autopsy-jayland-walker-suffered-46-bullet~ Auto~
11 https://www.theroot.com/detroit-to-pay-7-5m-to-black-man-who-clai~ Detr~
12 https://www.theroot.com/pro-trump-man-charged-for-staging-arson-a~ Pro-~
13 https://www.theroot.com/footage-of-uvalde-school-shooting-stirs-a~ Foot~
14 https://www.theroot.com/akron-recognizes-jayland-walker-s-funeral~ Akro~
15 https://www.theroot.com/jayland-walker-family-and-legal-team-addr~ Jayl~
16 https://www.theroot.com/white-man-makes-over-100-racist-threats-a~ Whit~
17 https://www.theroot.com/wisconsin-supreme-court-allows-chrystul-k~ Wisc~
18 https://www.theroot.com/kamala-harris-calls-for-assault-weapons-b~ Kama~
19 https://www.theroot.com/jayland-walker-s-sister-speaks-out-follow~ Jayl~
20 https://www.theroot.com/mississippi-judges-block-new-dna-tests-in~ Miss~