I am downloading all the Tweets (using rtweet package, version 0.7.0) that contain the user @sernac in the text of the tweet (a Chilean government entity), then extract all all the usernames (screen name) from the body of the tweet using the following function.
Tweets <- search_tweets("@sernac", n = 50000, include_rts = F)
Names <- str_extract_all(Tweets$text, "(?<=^|\\s)@[^\\s] ")
This give me a List object with the every screen name of each text's tweet.
The first question is: How i get a data frame whith the following estructure?
X1 | X2 | X3 | X4 | X5 | ... | Xn |
---|---|---|---|---|---|---|
@sernac | @vtrchile | NA | NA | NA | NA | NA |
@username | @playstation | @taylorswitft | @elonmusk | NA | NA | |
@username2 | @username5 | @selenagomez | @username2 | @username3 | @FIFA | @xbox |
@username4 | @ebay | NA | NA | NA | NA | NA |
Where the numbers of columns is equal to the max number of elements in a object from the list.
I tried the following fuction, but only return 4 columns, where the max elements into a object is 9.
df <- data.frame(matrix(unlist(Names), nrow=length(Names), byrow = T))
After this, I need to perform a left join between this table and a cluster table created by me, this left join must be between the first column of the newly created database and the cluster data base , but if there is no match in the left join, it should perform a second left join, but in this case using the second column, until exhausting all the columns if there is no match when performing the left join.
This is an example of the database created by me and the final desired result:
CLUSTER DATA FRAME
screen_name | cluster |
---|---|
@sernac | Gov |
@playstation | Videogames |
@walmart | Supermarket |
@SelenaGomez | Celebrity |
@elonmusk | Celebrity |
@xbox | Videogames |
@ebay | Ecommerce |
FINAL RESULT
X1 | X2 | X3 | X4 | X5 | ... | Xn | cluster |
---|---|---|---|---|---|---|---|
@sernac | @vtrchile | NA | NA | NA | NA | NA | Gov |
@username | @playstation | @taylorswitft | @elonmusk | NA | NA | Videogames | |
@username2 | @username5 | @selenagomez | @username2 | @username3 | @FIFA | @xbox | Celebrity |
@username4 | @ebay | NA | NA | NA | NA | NA | Ecommerce |
I have tried to explain myself in the best way, English is not my main language, so I can explain more detail in the comments.
CodePudding user response:
I would approach this differently.
First, if you are trying to download as many tweets as possible, set n = Inf
and retryonratelimit = TRUE
:
Tweets <- search_tweets("@sernac",
n = Inf,
include_rts = FALSE,
retryonratelimit = TRUE)
Second, there is no need to extract screen names from the tweet text, as this information can be found in the entities
column.
One way to extract mentions is to use lapply
. You can then create a data frame with just the useful columns, and convert screen names to lower case for matching.
library(dplyr)
mentions <- lapply(Tweets$entities, function(x) x$user_mentions) %>%
bind_rows(.id = "tweet_number") %>%
select(tweet_number, screen_name) %>%
mutate(screen_name_lc = tolower(screen_name))
head(mentions)
tweet_number screen_name screen_name_lc
1 1 mundo_pacifico mundo_pacifico
2 1 OIMChile oimchile
3 1 subtel_chile subtel_chile
4 1 ReclamosSubtel reclamossubtel
5 1 SERNAC sernac
6 2 mundo_pacifico mundo_pacifico
Next, add a column with the lower-case screen names to your cluster data:
cluster_df <- cluster_df %>%
mutate(screen_name_lc = str_replace(screen_name, "@", "") %>%
tolower())
Now we can join the data frames, just on the screen_name_lc
column:
mentions_clusters <- mentions %>%
left_join(cluster_df,
by = "screen_name_lc") %>%
select(tweet_number, screen_name = screen_name.x, cluster)
head(mentions_clusters)
tweet_number screen_name cluster
1 1 mundo_pacifico <NA>
2 1 OIMChile <NA>
3 1 subtel_chile <NA>
4 1 ReclamosSubtel <NA>
5 1 SERNAC Gov
6 2 mundo_pacifico <NA>
This "long" format is much easier to work with for subsequent analysis than the "wide" format, and can still be grouped by tweet using the tweet_number
column.
Data for cluster_df
:
cluster_df <- structure(list(screen_name = c("@sernac", "@playstation", "@walmart",
"@SelenaGomez", "@elonmusk", "@xbox", "@ebay"), cluster = c("Gov",
"Videogames", "Supermarket", "Celebrity", "Celebrity", "Videogames",
"Ecommerce"), screen_name_lc = c("sernac", "playstation", "walmart",
"selenagomez", "elonmusk", "xbox", "ebay")), class = "data.frame", row.names = c(NA,
-7L))