Webscraping PolitiTweet with rvest-CodePudding

The webpage https://polititweet.org/ stores the complete tweet history of certain politicans, CEOs and so on. Importantly, they also provide deleted tweets I am interested in. Now, I would like to write a webscraper in R to retrieve the texts of the deleted tweets from Elon Musk, but I fail as the html includes some href.

That's my try (after edit due to @Bensstats):

library(rvest)
url_page1<- read_html("https://polititweet.org/tweets?page=1&deleted=True&account=44196397&search=")
tweets_deleted <- html_nodes(url_page1, ".tweet-card") |> html_attr("href")

tweets_deleted

With this, I yield the ID of the deleted tweets on page 1. However, what I want, is the deleted text itself.

Moreover, there are 9 pages of deleted tweets for Musk. As in future this amount of pages is likely to increase, I would like to extract the number of pages automatically and then automatise the process for each page (via a loop or something like that).

I would really appreciate it if somebody of you has an idea how to fix these problems!

Thanks a lot!

CodePudding user response：

Get all of Elons deleted tweets, page 1:9.

library(tidyverse)
library(rvest)

get_tweets <- function(page) {
  tweets <-
    str_c(
      "https://polititweet.org/tweets?page=",
      page,
      "&deleted=True&account=44196397&search="
    ) %>%
    read_html() %>%
    html_elements(".tweet-card")
  
  tibble(
    tweeter = tweets %>%
      html_element("strong  .has-text-grey") %>%
      html_text2(),
    tweet = tweets %>%
      html_element(".small-top-margin") %>%
      html_text2(),
    time = tweets %>%
      html_element(".is-paddingless") %>%
      html_text2() %>%
      str_remove_all("Posted ")
  )
}

map_dfr(1:9, get_tweets)

# A tibble: 244 × 3
   tweeter   tweet                                             time 
   <chr>     <chr>                                             <chr>
 1 @elonmusk "@BBCScienceNews Tesla Megapacks are extremely e… Nov.…
 2 @elonmusk "\u2014 PolitiTweet.org"  Nov.…
 3 @elonmusk "@BuzzPatterson They could help it \U0001f923 \u… Nov.…
 4 @elonmusk "\U0001f37f \u2014 PolitiTweet.org"               Nov.…
 5 @elonmusk "Let\u2019s call the fact-checkers \u2026 \u2014… Nov.…
 6 @elonmusk "#SharkJumping  \u2014 Po… Nov.…
 7 @elonmusk "Can you believe this app only costs $8!? https:… Nov.…
 8 @elonmusk "@langdon @EricFrohnhoefer @pokemoniku He\u2019s… Nov.…
 9 @elonmusk "@EricFrohnhoefer @MEAInd I\u2019ve been at Twit… Nov.…
10 @elonmusk "@ashleevance @mtaibbi @joerogan Twitter drives … Nov.…
# … with 234 more rows
# ℹ Use `print(n = ...)` to see more rows

Since you wanted it to automatically detect pages and scrape, here's a possible solution where you just supply a link into the function:

get_tweets <- function(link) {
  page <- link %>%
    read_html()
  
  pages <- page %>%
    html_elements(".pagination-link") %>%
    last() %>%
    html_text2() %>%
    as.numeric()
  
  twitter <- function(link, page) {
    tweets <-
      link %>%
      str_replace(pattern = "page=1", str_c("page=", page)) %>%
      read_html() %>% 
      html_elements(".tweet-card")
    
    tibble(
      tweeter = tweets %>%
        html_element("strong  .has-text-grey") %>%
        html_text2(),
      tweet = tweets %>%
        html_element(".small-top-margin") %>%
        html_text2(),
      time = tweets %>%
        html_element(".is-paddingless") %>%
        html_text2() %>%
        str_remove_all("Posted ")
    )
  }
  
  map2_dfr(link, 1:pages, twitter)
}

get_tweets("https://polititweet.org/tweets?page=1&deleted=True&account=44196397&search=")

I highly recommend this tool to help you select CSS elements.

CodePudding user response：

Edit

You are going to want to change the CSS selector

library(rvest)
url <- read_html("https://polititweet.org/tweets?account=44196397&deleted=True")
tweets_deleted <- html_nodes(url, "p.small-top-margin") |> html_text()

tweets_deleted %>% 
  gsub('\\n','',.)