Home > Software engineering >  Scrape Body of News Articles and Place into Data Frame
Scrape Body of News Articles and Place into Data Frame

Time:08-05

I'm attempting to scrape news articles and place them into a data frame, so I can analyze the text using quanteda. So far, I've been able to scrape the title,author, date, and URLs and placed them into a data frame. I've also been able to scrape articles over several pages. How can I "go into" each article to "get" the article body text to also place into the data frame?

library(rvest)
library(tidyverse)

get_articles <- function(n_articles) {
  page <- paste0("https://www.theroot.com/news/criminal-justice",
                 "?startIndex=",
                 n_articles) %>%
    read_html()
  
  tibble(
    title = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_text2(),
    author = page %>%
      html_elements(".llHfhX .js_link , .permalink-bylineprop") %>%
      html_text2(),
    date = page %>%
      html_elements(".js_meta-time") %>%
      html_text2(),
    url = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_attr("href")
  )
}

df <- map_dfr(seq(0, 200, by = 20), get_articles)

I've written some code to do this with one article, but unsure how do duplicate it using the function I already have.

get_article=function(article_link) {
  article_link="https://www.theroot.com/mississippi-man-arrested-for-attempting-to-hit-black-ch-1849342160"%>% 
  article_page=read_html()%>% 
  article_body=article_page%>% html_nodes(".bOfvBY")%>% html_text() %>% paste(collapse = ",")
}

CodePudding user response:

df %>%
  slice(1:10) %>%
  mutate(content = map(url, ~ read_html(.x) %>%
                         html_elements(".bOfvBY") %>%
                         html_text2 %>% 
                         paste(collapse = ","))) %>% 
  unnest(content)

# A tibble: 10 × 5
   title                                                                              author date  url   content
   <chr>                                                                              <chr>  <chr> <chr> <chr>  
 1 Man Charged in Ahmaud Arbery Murder Asks for Leniency Ahead of Sentencing          Kalyn… Toda… http… "Greg …
 2 Mississippi Man Arrested for Attempting to Hit Black Children with Car             Kalyn… 7/28… http… "White…
 3 2 Blacks Girls Charged With Hate Crimes for Attacking Woman on MTA Bus             Kalyn… 7/27… http… "Two B…
 4 [Updated] Flashy Bishop Whitehead of Brooklyn Reenacts Getting Robbed at Gunpoint  Kalyn… 7/25… http… "Bisho…
 5 Georgia Gov. Brian Kemp To Testify On Trump Probe To Overturn 2020 Election        Murja… 7/25… http… "Profe…
 6 Florida To Allow Military Veterans Teach In Schools With No Degree                 Murja… 7/23… http… "Flori…
 7 One of George Floyd’s Killers Gets Sentenced to Only 2 Years In Prison             Kalyn… 7/21… http… "Forme…
 8 Judge Finds Enough Evidence to Pursue Criminal Charges Against Elijah McClain’s K… Kalyn… 7/20… http… "A jud…
 9 Indiana Man Arrested in Connection to Black Girl’s Disappearance.                  Kalyn… 7/19… http… "Karen…
10 “This is Not a George Floyd Situation!” Says Woman who Called Cops on Andrew Tekl… Kalyn… 7/19… http… "The t…
  • Related