I would like to remove the words before 'not'. When I try the code snippet below, I didn't get expected result.
test <- c("this will not work.", "'' is not one of ['A', 'B', 'C'].", "This one does not use period ending!")
gsub(".*(not .*)\\.", "\\1", test)
But if I replace \\.
with [[:punct:]]
, it works fine. Can anyone tell me why the first one is not working? I may need to keep other punctuations, other than period.
expected output:
> not work
> not one of ['A', 'B', 'C']
> not use period ending!
Thank you!
CodePudding user response:
sub('.*(not.*?)\\.?$', '\\1', test)
[1] "not work" "not one of ['A', 'B', 'C']"
[3] "not use period ending!"
CodePudding user response:
You may use lookahead regex to drop everything before "not"
and also drop the period at the end.
gsub('.*(?=not)|\\.$', '', test, perl = TRUE)
#[1] "not work" "not one of ['A', 'B', 'C']" "not use period ending!"
CodePudding user response:
Here is a translation of your original code:
- Match any character zero or more time
- Capture the word not with one space then any character after zero or more times.
- Match one period.
If the expression doesn't match this pattern including that one period you won't get a match and gsub()
isn't going to do it's thing. So adding the [[:punct:]]
makes sense bc then you're saying: "match everything in that pattern and then one punctuation mark of any kind instead of just one period.
If you don't want to use the [[:punct:]] you can use this
(?:.*(not\\s .*)\\.?). ?$
which says
- The following is a not capture group
- match any character 0 or more time
- capture "not" one or more spaces zero or more of any character
- next optionally match a period
- optionally match any character one or more times
- match the end of the line
This regex gives an output like this:
[1] "not work" "not one of ['A', 'B', 'C']"
[3] "not use period ending"
The example above does get rid of the "!" though so if you wanted to keep that I would just use [[:punct:]]
or you could just say match anyone of these punctuation marks like this:
[!"\#$%&'()* ,\-./:;<=>?@\[\\\]^_‘{|}~]
but that is super annoying. This website should help give you an even better understanding. Hope I helped!