Home > database >  How to remove all wording before a word using regex in r?
How to remove all wording before a word using regex in r?

Time:12-15

I would like to remove the words before 'not'. When I try the code snippet below, I didn't get expected result.

test <- c("this will not work.", "'' is not one of ['A', 'B', 'C'].", "This one does not use period ending!")
gsub(".*(not .*)\\.", "\\1", test)

But if I replace \\. with [[:punct:]], it works fine. Can anyone tell me why the first one is not working? I may need to keep other punctuations, other than period.

expected output:

> not work
> not one of ['A', 'B', 'C']
> not use period ending!

Thank you!

CodePudding user response:

sub('.*(not.*?)\\.?$', '\\1', test)

[1] "not work"                   "not one of ['A', 'B', 'C']"
[3] "not use period ending!"   

CodePudding user response:

You may use lookahead regex to drop everything before "not" and also drop the period at the end.

gsub('.*(?=not)|\\.$', '', test, perl = TRUE)
#[1] "not work"     "not one of ['A', 'B', 'C']" "not use period ending!"

CodePudding user response:

Here is a translation of your original code:

  1. Match any character zero or more time
  2. Capture the word not with one space then any character after zero or more times.
  3. Match one period.

If the expression doesn't match this pattern including that one period you won't get a match and gsub() isn't going to do it's thing. So adding the [[:punct:]] makes sense bc then you're saying: "match everything in that pattern and then one punctuation mark of any kind instead of just one period.

If you don't want to use the [[:punct:]] you can use this

(?:.*(not\\s .*)\\.?). ?$

which says

  1. The following is a not capture group
  2. match any character 0 or more time
  3. capture "not" one or more spaces zero or more of any character
  4. next optionally match a period
  5. optionally match any character one or more times
  6. match the end of the line

This regex gives an output like this:

[1] "not work"                   "not one of ['A', 'B', 'C']"
[3] "not use period ending" 

The example above does get rid of the "!" though so if you wanted to keep that I would just use [[:punct:]] or you could just say match anyone of these punctuation marks like this:

[!"\#$%&'()* ,\-./:;<=>?@\[\\\]^_‘{|}~]

but that is super annoying. This website should help give you an even better understanding. Hope I helped!

  • Related