Home > other >  Remove pattern that occurs outside of words
Remove pattern that occurs outside of words

Time:05-07

I am trying to remove pattern 'SO' from the end of a character vector. The issue I run into with the below code is that it will remove any sequence of 'SO' case insensitive/just removes the whole string (vs. last pattern detected). One solution I had was to do some manual cleaning and force all to lower with the exception of final 'SO' and leaving it case sensitive.

x <- data.frame(y = c("Solutions are welcomed, please SO # 12345")

x <- x %>% mutate(y = stri_replace_last_regex(x$y,"SO.*","",case_insensitive = TRUE)) # This will remove the string entirely - I'm not really sure why.  

The desired output is:

'Solutions are welcomed, please'

I have used an iteration of regex that looks like \\b\\SO{2}\\b and \\b\\D{2}*\\b|[[:punct:]] - I believe the answer could lie here by setting word boundaries but I am not sure. The second one gets rid of the SO but I feel if there are so letters in sequence elsewhere separate from wording that would get removed as well. I just need the last occurrence of SO and everything after to be removed including punctuation in the whole string.

Any guidance on this would come much appreciated to me.

CodePudding user response:

You can use gsub to remove the pattern you don't want.

gsub("\\sSO. $", "", x$y)

[1] "Solutions are welcomed, please"

Use [[:upper:]]{2} if you want to generalise to any two consecutive upper case letters.

gsub("\\s[[:upper:]]{2}. $", "", x$y)

[1] "Solutions are welcomed, please"

UPDATE: the above code might not be accurate if you have more than one "SO" in the string

To demonstrate, I have created another string with multiple "SO". Here, we are capturing any characters from the start of the string (^), until before the last occurrence of "SO" (SO. $). These strings are stored in the first capture group (it's the regex (.*)). Then we can use gsub to replace the entire string with the first capture group (\\1), thus getting rid of everything that is after the last occurrence of "SO".

x <- data.frame(y = "Solutions are SO welcomed, SO please SO # 12345")

gsub('^(.*)SO. $', '\\1', x$y)

[1] "Solutions are SO welcomed, SO please "

CodePudding user response:

library(dplyr)
library(stringr)

x %>% 
  mutate(y = str_replace_all(y, 'SO.*', ''))

or

library(dplyr)
library(stringr)

x %>% 
  mutate(y = str_replace_all(y, 'SO\\s\\#\\s\\d*', ''))

output:

                                y
1 Solutions are welcomed, please 
  • Related