Home > Software design >  Remove row if string contains more than one "@" using regular expression
Remove row if string contains more than one "@" using regular expression

Time:04-04

I have a data frame with two columns. cnn_handle contains Twitter handles and tweet contains tweets where the Twitter handle in the corresponding row is mentioned. However, most tweets mention at least one other user/handle indicated by @. I want to remove all rows where a tweet contains more than one @.

df
    cnn_handle      tweet
1   @DanaBashCNN    @JohnKingCNN @DanaBashCNN @kaitlancollins @eliehonig @thelauracoates @KristenhCNN CNN you are still FAKE NEWS !!!
2   @DanaBashCNN    @DanaBashCNN He could have made the same calls here, from SC.
3   @DanaBashCNN    @DanaBashCNN GRAMMER ALERT:  THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point.   Also please refrain from showing a pic of him till you have one in his casket.   thank you
4   @brianstelter   @eliehonig @brianstelter My apologies to you sir. Just seems like that story disappeared. Imo the nursing home scandal is just as bad.
5   @brianstelter   @DrAndrewBaer1 @JGreenblattADL @brianstelter @CNN @TuckerCarlson @FoxNews Anti-Semite are you,  Herr Doktor? How very Mengele of you.
6   @brianstelter   @ma_makosh @Shortguy1 @brianstelter @ChrisCuomo Liberals, their feelings before facts and their crucifixion of people before due process. Never a presumption of innocence when it concerns the rival party. So un-American.
7   @andersoncooper @BrendonLeslie And Biden was a staunch opponent of “forced busingâ€. He also said that integrating schools will cause a “racial jungleâ€. But u won’t hear this on @ChrisCuomo @jaketapper @Acosta @andersoncooper bc they continue to cover up the truth about Biden & his family.
8   @andersoncooper Anderson Cooper revealed that he "wanted a change" when reflecting on his break from news as #TheMole arrives on Netflix.
9   @andersoncooper @johnnydollar01 @newsbusters @drsanjaygupta @andersoncooper He was terrible as a host

I suspect some type of regular expression is needed. However, I am not sure how to combine it with a greater-than sign.

The desired result i.e. tweets only mentioning the corresponding cnn_handle

cnn_handle      tweet
2   @DanaBashCNN    @DanaBashCNN He could have made the same calls here, from SC.
3   @DanaBashCNN    @DanaBashCNN GRAMMER ALERT:  THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point.   Also please refrain from showing a pic of him till you have one in his casket.   thank you
8   @andersoncooper Anderson Cooper revealed that he "wanted a change" when reflecting on his break from news as #TheMole arrives on Netflix.

CodePudding user response:

A straighforward solution using str_count from stringrwhich presupposes that @ occur only in Twitter handles:

base R:

library(stringr)
df[str_count(df$tweet, "@") > 1,]

dplyr:

library(dplyr)
library(stringr)
df %>%
  filter(!str_count(tweet, "@") > 1)

CodePudding user response:

Assuming your dataframe is called tweets, just check to see if there is more than one match for @ followed by text:

pattern  <- "@[a-zA-Z. ]"
multiple_ats  <- unlist(lapply(tweets$tweet, function(x) length(gregexpr(pattern, x)[[1]])>1))
tweets[!multiple_ats,]

Output:

# A tibble: 3 x 2
  cnn_handle      tweet
  <chr>           <chr>
1 @DanaBashCNN    "@DanaBashCNN He could have made the same calls here, from SC."
2 @DanaBashCNN    "@DanaBashCNN GRAMMER ALERT:  THAT'S FORMER PRESIDENT TRUMP Please don't forget this important point.,Also please refrain from showing a pic of him till you have one in his casket.,thank you"
3 @andersoncooper "Anderson Cooper revealed that he \"wanted a change\" when reflecting on his break from news as #TheMole arrives on Netflix."

Edit: You will have to change the pattern if Twitter user names are allowed to start with numbers or special characters. I don't know what the rules are.

  • Related