I'm attempting to scrape news articles and place them into a data frame, so I can analyze the text using quanteda. So far, I've been able to scrape the title,author, date, and URLs and placed them into a data frame. I've also been able to scrape articles over several pages. How can I "go into" each article to "get" the article body text to also place into the data frame?
library(rvest)
library(tidyverse)
get_articles <- function(n_articles) {
page <- paste0("https://www.theroot.com/news/criminal-justice",
"?startIndex=",
n_articles) %>%
read_html()
tibble(
title = page %>%
html_elements(".aoiLP .js_link") %>%
html_text2(),
author = page %>%
html_elements(".llHfhX .js_link , .permalink-bylineprop") %>%
html_text2(),
date = page %>%
html_elements(".js_meta-time") %>%
html_text2(),
url = page %>%
html_elements(".aoiLP .js_link") %>%
html_attr("href")
)
}
df <- map_dfr(seq(0, 200, by = 20), get_articles)
I've written some code to do this with one article, but unsure how do duplicate it using the function I already have.
get_article=function(article_link) {
article_link="https://www.theroot.com/mississippi-man-arrested-for-attempting-to-hit-black-ch-1849342160"%>%
article_page=read_html()%>%
article_body=article_page%>% html_nodes(".bOfvBY")%>% html_text() %>% paste(collapse = ",")
}
CodePudding user response:
df %>%
slice(1:10) %>%
mutate(content = map(url, ~ read_html(.x) %>%
html_elements(".bOfvBY") %>%
html_text2 %>%
paste(collapse = ","))) %>%
unnest(content)
# A tibble: 10 × 5
title author date url content
<chr> <chr> <chr> <chr> <chr>
1 Man Charged in Ahmaud Arbery Murder Asks for Leniency Ahead of Sentencing Kalyn… Toda… http… "Greg …
2 Mississippi Man Arrested for Attempting to Hit Black Children with Car Kalyn… 7/28… http… "White…
3 2 Blacks Girls Charged With Hate Crimes for Attacking Woman on MTA Bus Kalyn… 7/27… http… "Two B…
4 [Updated] Flashy Bishop Whitehead of Brooklyn Reenacts Getting Robbed at Gunpoint Kalyn… 7/25… http… "Bisho…
5 Georgia Gov. Brian Kemp To Testify On Trump Probe To Overturn 2020 Election Murja… 7/25… http… "Profe…
6 Florida To Allow Military Veterans Teach In Schools With No Degree Murja… 7/23… http… "Flori…
7 One of George Floyd’s Killers Gets Sentenced to Only 2 Years In Prison Kalyn… 7/21… http… "Forme…
8 Judge Finds Enough Evidence to Pursue Criminal Charges Against Elijah McClain’s K… Kalyn… 7/20… http… "A jud…
9 Indiana Man Arrested in Connection to Black Girl’s Disappearance. Kalyn… 7/19… http… "Karen…
10 “This is Not a George Floyd Situation!” Says Woman who Called Cops on Andrew Tekl… Kalyn… 7/19… http… "The t…