How to extract key phrases following specific characters using regex in R?-CodePudding

I have a dataframe that looks like so:

ID | Tweet_ID | Tweet
1    12345      @sprintcare I did.
2    SPRINT     @12345 Please send us a Private Message.
3    45678      @apple My information is incorrect.
4    APPLE      @45678 What information is incorrect.

What I would like to do is some case_when statement to extract all the tweets that have the handle of the company name and ignore the numerical handles to create a new field.

Current code I'm playing around with but not succeeding with:

tweet_pattern <- " @[^0-9.-]\\w "

Customer <- Customer %>% 
           Response_To_Comp = ifelse(str_detect(Tweet, tweet_pattern), 
                                str_extract(Tweet, tweet_pattern), 
                                NA_character_))

Desired output:

ID | Tweet_ID | Tweet                                    | Response_To_Comp
1    12345      @sprintcare I did.                         sprintcare
2    SPRINT     @12345 Please send us a Private Message.   NA
3    45678      @apple My information is incorrect.        apple
4    APPLE      @45678 What information is incorrect.      NA

CodePudding user response：

You can use a lookbehind regex to extract the text which comes after '@' and has one or more A-Za-z characters in them.

library(dplyr)
library(stringr)

tweet_pattern <- "(?<=@)[A-Za-z] "

df %>%mutate(Response_To_Comp = str_extract(Tweet, tweet_pattern))

#  ID Tweet_ID                                    Tweet Response_To_Comp
#1  1    12345                       @sprintcare I did.       sprintcare
#2  2   SPRINT @12345 Please send us a Private Message.             <NA>
#3  3    45678      @apple My information is incorrect.            apple
#4  4    APPLE    @45678 What information is incorrect.             <NA>

CodePudding user response：

Using str_detect and str_replace

library(stringr)
library(dplyr)
Customer %>%
    mutate(Response_to_Comp = case_when(str_detect(Tweet, "@[^0-9-] ") ~ 
      str_replace(Tweet, "@([A-Za-z] )\\s .*", "\\1")))
  ID Tweet_ID                                    Tweet Response_to_Comp
1  1    12345                       @sprintcare I did.       sprintcare
2  2   SPRINT @12345 Please send us a Private Message.             <NA>
3  3    45678      @apple My information is incorrect.            apple
4  4    APPLE    @45678 What information is incorrect.             <NA>

data

Customer <- structure(list(ID = 1:4, Tweet_ID = c("12345", "SPRINT", "45678", 
"APPLE"), Tweet = c("@sprintcare I did.", "@12345 Please send us a Private Message.", 
"@apple My information is incorrect.", "@45678 What information is incorrect."
)), class = "data.frame", row.names = c(NA, -4L))