The webpage https://polititweet.org/ stores the complete tweet history of certain politicans, CEOs and so on. Importantly, they also provide deleted tweets I am interested in. Now, I would like to write a webscraper in R to retrieve the texts of the deleted tweets from Elon Musk, but I fail as the html includes some href.
That's my try (after edit due to @Bensstats):
library(rvest)
url_page1<- read_html("https://polititweet.org/tweets?page=1&deleted=True&account=44196397&search=")
tweets_deleted <- html_nodes(url_page1, ".tweet-card") |> html_attr("href")
tweets_deleted
With this, I yield the ID of the deleted tweets on page 1. However, what I want, is the deleted text itself.
Moreover, there are 9 pages of deleted tweets for Musk. As in future this amount of pages is likely to increase, I would like to extract the number of pages automatically and then automatise the process for each page (via a loop or something like that).
I would really appreciate it if somebody of you has an idea how to fix these problems!
Thanks a lot!
CodePudding user response:
Get all of Elons deleted tweets, page 1:9.
library(tidyverse)
library(rvest)
get_tweets <- function(page) {
tweets <-
str_c(
"https://polititweet.org/tweets?page=",
page,
"&deleted=True&account=44196397&search="
) %>%
read_html() %>%
html_elements(".tweet-card")
tibble(
tweeter = tweets %>%
html_element("strong .has-text-grey") %>%
html_text2(),
tweet = tweets %>%
html_element(".small-top-margin") %>%
html_text2(),
time = tweets %>%
html_element(".is-paddingless") %>%
html_text2() %>%
str_remove_all("Posted ")
)
}
map_dfr(1:9, get_tweets)
# A tibble: 244 × 3
tweeter tweet time
<chr> <chr> <chr>
1 @elonmusk "@BBCScienceNews Tesla Megapacks are extremely e… Nov.…
2 @elonmusk "\u2014 PolitiTweet.org" Nov.…
3 @elonmusk "@BuzzPatterson They could help it \U0001f923 \u… Nov.…
4 @elonmusk "\U0001f37f \u2014 PolitiTweet.org" Nov.…
5 @elonmusk "Let\u2019s call the fact-checkers \u2026 \u2014… Nov.…
6 @elonmusk "#SharkJumping \u2014 Po… Nov.…
7 @elonmusk "Can you believe this app only costs $8!? https:… Nov.…
8 @elonmusk "@langdon @EricFrohnhoefer @pokemoniku He\u2019s… Nov.…
9 @elonmusk "@EricFrohnhoefer @MEAInd I\u2019ve been at Twit… Nov.…
10 @elonmusk "@ashleevance @mtaibbi @joerogan Twitter drives … Nov.…
# … with 234 more rows
# ℹ Use `print(n = ...)` to see more rows
Since you wanted it to automatically detect pages and scrape, here's a possible solution where you just supply a link into the function:
get_tweets <- function(link) {
page <- link %>%
read_html()
pages <- page %>%
html_elements(".pagination-link") %>%
last() %>%
html_text2() %>%
as.numeric()
twitter <- function(link, page) {
tweets <-
link %>%
str_replace(pattern = "page=1", str_c("page=", page)) %>%
read_html() %>%
html_elements(".tweet-card")
tibble(
tweeter = tweets %>%
html_element("strong .has-text-grey") %>%
html_text2(),
tweet = tweets %>%
html_element(".small-top-margin") %>%
html_text2(),
time = tweets %>%
html_element(".is-paddingless") %>%
html_text2() %>%
str_remove_all("Posted ")
)
}
map2_dfr(link, 1:pages, twitter)
}
get_tweets("https://polititweet.org/tweets?page=1&deleted=True&account=44196397&search=")
I highly recommend this tool to help you select CSS elements.
CodePudding user response:
Edit
You are going to want to change the CSS selector
library(rvest)
url <- read_html("https://polititweet.org/tweets?account=44196397&deleted=True")
tweets_deleted <- html_nodes(url, "p.small-top-margin") |> html_text()
tweets_deleted %>%
gsub('\\n','',.)