I am trying to scrape multiple URLs (from hundreds of news outlets), and create a new column with texts of news articles. It seems previous questions related to my issue are showing how to scrape multiple URLS from one source (e.g., single news outlet).
I was able to read one url, just for example, by using gettxt().
#read one url for example
url1 <- 'https://www.pilotonline.com/news/vp-nw-rally-roe-v-wade-20220504-xs6i4unvhfgn7i4ghbvlni3ysu-story.html#ed=rss_www.pilotonline.com/arcio/rss/category/government/virginia/'
text <- gettxt(url1)
text
But I have over 10,000 urls, so I want to see if there's a way to scrape URls without creating "url1" for individual URLs. Here are some examples of URLs that I have: text
The gettxt function showed the whole webpage, including advertisements, so I was wondering if there's anyway I could only scrape the main body of each news story.
I also tried read_html for each outlet:
simple <- read_html("https://www.pilotonline.com/news/vp-nw-rally-roe-v-wade-20220504-xs6i4unvhfgn7i4ghbvlni3ysu-story.html#ed=rss_www.pilotonline.com/arcio/rss/category/government/virginia/")
simple %>%
html_nodes(".body-paragraph") %>%
html_text()
simple2 <- read_html("https://www.washingtonpost.com/politics/2022/05/05/democrats-pressure-biden-administration-come-up-with-plan-counter-roe-demise/")
simple2 %>%
html_nodes(".article-body") %>%
html_text()
The results are like this, which isn't ideal, because it's not putting all the paragraphs into one row. But this is better than gettxt because it only shows the main text that I want to focus on (with no advertisements or unnecessary text):
Results for simple:
[1] "NORFOLK — More than 75 people gathered in front of the federal courthouse Tuesday to “Rally for Roe” in protest of a draft Supreme Court opinion that would throw out the landmark Roe v. Wade abortion rights ruling that has stood for nearly a half-century."
[2] "Chants for the government to get its bans off women’s bodies could be heard down Granby Street as the crowd amassed."
[3] "One protestor said she can’t believe she is still fighting for women’s reproductive rights more than 50 years after she made her first sign and marched her first march."
[4] "“I have been doing this since before most of them were born,” said Bobbie Fisher, of Norfolk, as she motioned to the chanting crowd."
I omitted results here... [24]
But I have hundreds of news outlets, so it's difficult to replace html_nodes for each outlet. Is there any better way to do it (and perhaps putting the texts into one row rather than showing it like [1] [2] [3] [4]....)?
CodePudding user response:
You could arrange your scraping information into a dataframe ...
## example:
df <- data.frame(
source = c("pilotonline", "wp"),
url = c("https://www.pilotonline.com/news/vp-nw-rally-roe-v-wade-20220504-xs6i4unvhfgn7i4ghbvlni3ysu-story.html#ed=rss_www.pilotonline.com/arcio/rss/category/government/virginia/", "https://www.washingtonpost.com/politics/2022/05/05/democrats-pressure-biden-administration-come-up-with-plan-counter-roe-demise/"),
selector = c(".body-paragraph", ".article-body")
)
... and use {dplyr} and {tidyr} to gather and tabulate the content pieces:
library(dplyr)
library(tidyr)
results <-
df |>
rowwise() |>
mutate(content = read_html(url) |>
html_nodes(selector) |>
html_text() |>
list()
) |>
## list content items ([1] ... [2] ... etc.) in
## separate rows:
unnest_longer(content) ## list content items
example output:
> results |> select(source, content)
# A tibble: 72 x 2
source content
<chr> <chr>
1 wp "Good morning, Early Birds. A fox fled the scene at the National Zoo ~
2 wp "WpGet the full experience.Choose your planArrowRight"
3 wp "In today’s edition … All the latest coverage from The Post on the Ro~
4 wp "On the Hill"
5 wp "Democrats pressure Biden administration to come up with plan to coun~
6 wp ""
7 wp "Faced with the reality they don't have the votes to codify Roe v. Wa~
8 wp "Advertisement"
Don't run into copyright trouble with media outlets though.