I have a data frame with two columns. cnn_handle
contains Twitter handles and tweet
contains tweets where the Twitter handle in the corresponding row is mentioned. However, most tweets mention at least one other user/handle indicated by @
. I want to remove all rows where a tweet contains more than one @
.
df
cnn_handle tweet
1 @DanaBashCNN @JohnKingCNN @DanaBashCNN @kaitlancollins @eliehonig @thelauracoates @KristenhCNN CNN you are still FAKE NEWS !!!
2 @DanaBashCNN @DanaBashCNN He could have made the same calls here, from SC.
3 @DanaBashCNN @DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point. Also please refrain from showing a pic of him till you have one in his casket. thank you
4 @brianstelter @eliehonig @brianstelter My apologies to you sir. Just seems like that story disappeared. Imo the nursing home scandal is just as bad.
5 @brianstelter @DrAndrewBaer1 @JGreenblattADL @brianstelter @CNN @TuckerCarlson @FoxNews Anti-Semite are you, Herr Doktor? How very Mengele of you.
6 @brianstelter @ma_makosh @Shortguy1 @brianstelter @ChrisCuomo Liberals, their feelings before facts and their crucifixion of people before due process. Never a presumption of innocence when it concerns the rival party. So un-American.
7 @andersoncooper @BrendonLeslie And Biden was a staunch opponent of “forced busingâ€. He also said that integrating schools will cause a “racial jungleâ€. But u won’t hear this on @ChrisCuomo @jaketapper @Acosta @andersoncooper bc they continue to cover up the truth about Biden & his family.
8 @andersoncooper Anderson Cooper revealed that he "wanted a change" when reflecting on his break from news as #TheMole arrives on Netflix.
9 @andersoncooper @johnnydollar01 @newsbusters @drsanjaygupta @andersoncooper He was terrible as a host
I suspect some type of regular expression is needed. However, I am not sure how to combine it with a greater-than sign.
The desired result i.e. tweets only mentioning the corresponding cnn_handle
cnn_handle tweet
2 @DanaBashCNN @DanaBashCNN He could have made the same calls here, from SC.
3 @DanaBashCNN @DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point. Also please refrain from showing a pic of him till you have one in his casket. thank you
8 @andersoncooper Anderson Cooper revealed that he "wanted a change" when reflecting on his break from news as #TheMole arrives on Netflix.
CodePudding user response:
A straighforward solution using str_count
from stringr
which presupposes that @
occur only in Twitter handles:
base R
:
library(stringr)
df[str_count(df$tweet, "@") > 1,]
dplyr
:
library(dplyr)
library(stringr)
df %>%
filter(!str_count(tweet, "@") > 1)
CodePudding user response:
Assuming your dataframe is called tweets
, just check to see if there is more than one match for @
followed by text:
pattern <- "@[a-zA-Z. ]"
multiple_ats <- unlist(lapply(tweets$tweet, function(x) length(gregexpr(pattern, x)[[1]])>1))
tweets[!multiple_ats,]
Output:
# A tibble: 3 x 2
cnn_handle tweet
<chr> <chr>
1 @DanaBashCNN "@DanaBashCNN He could have made the same calls here, from SC."
2 @DanaBashCNN "@DanaBashCNN GRAMMER ALERT: THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point.,Also please refrain from showing a pic of him till you have one in his casket.,thank you"
3 @andersoncooper "Anderson Cooper revealed that he \"wanted a change\" when reflecting on his break from news as #TheMole arrives on Netflix."
Edit: You will have to change the pattern if Twitter user names are allowed to start with numbers or special characters. I don't know what the rules are.