> mysentence <- "UK is Beautiful. UK is not the part of EU since 2016"
> gsub("[0-9]*", "", mysentence)
[1] "UK is Beautiful. UK is not the part of EU since "
> mysentence <- "UK is Beautiful. UK is not the part of EU since 2016"
> sub("[0-9]*", "", mysentence)
[1] "UK is Beautiful. UK is not the part of EU since 2016"
> mysentence <- "UK is Beautiful. UK is not the part of EU since 2016"
> sub("[0-9] ", "", mysentence)
[1] "UK is Beautiful. UK is not the part of EU since "
Here, while using gsub, I get the expected output, but when replaced with sub, the output still has 2016 in it, which should have been removed. On performing the same command with instead of *, the output is as expected. Why is the second example, i.e
sub("[0-9]*", "", mysentence)
not giving the same output as the other examples?
CodePudding user response:
The issue is that the *
quantifier is 0 or more. So [0-9]*
will match Nothing, as well as 1 or more digits. sub
only replaces the first match, so sub("[0-9]*", "", mysentence)
matches 1 Nothing, right at the beginning, replaces it with ""
(also nothing), and is done.
We can see this more easily if we put a non-nothing replacement:
sub("[0-9]*", "HI", mysentence)
# [1] "HIUK is Beautiful. UK is not the part of EU since 2016"
gsub
replaces every occurrence, and if we had a non-nothing replacement it gets pretty absurd, as it matches Nothing at every position:
gsub("[0-9]*", "HI", mysentence)
# [1] "HIUHIKHI HIiHIsHI HIBHIeHIaHIuHItHIiHIfHIuHIlHI.HI HIUHIKHI
# HIiHIsHI HInHIoHItHI HItHIhHIeHI HIpHIaHIrHItHI HIoHIfHI HIEHIUHI
# HIsHIiHInHIcHIeHI HI"
Using the
quantifier, which is 1 or more, means that Nothing is not matched, and in this 1-match case sub
as gsub
behave identically:
gsub("[0-9] ", "HI", mysentence)
# [1] "UK is Beautiful. UK is not the part of EU since HI"
sub("[0-9] ", "HI", mysentence)
# [1] "UK is Beautiful. UK is not the part of EU since HI"