I have the following vector with strings:
strings <- c("ABC0001", "ABC02", "ABC10", "ABC01010", "ABC11", "ABC011", "ABC0120")
"ABC0001" "ABC02" "ABC10" "ABC01010" "ABC11" "ABC011" "ABC0120"
Desired output:
[1] "ABC1" "ABC2" "ABC10" "ABC1010" "ABC11" "ABC11" "ABC120"
My question: What is the regular expression pattern for zeros before first integer in a string?
So far I have tried:
library(stringr)
str_replace(strings,'0 ', "")
which gives:
[1] "ABC1" "ABC2" "ABC1" "ABC1010" "ABC11" "ABC11" "ABC120"
Note: Not desired ABC1
in position 3. Should be ABC10
I suspect this might be easy, but I can't get it.
I want to learn the regular expression pattern!
CodePudding user response:
Here is a base R option using sub
, with lookarounds:
strings <- c("ABC0001", "ABC02", "ABC10", "ABC01010", "ABC11", "ABC011", "ABC0120")
output <- sub("(?<=[A-Z])0 (?=.)", "", strings, perl=TRUE)
output
[1] "ABC1" "ABC2" "ABC10" "ABC1010" "ABC11" "ABC11" "ABC120"
Here is an explanation of the regex pattern being used:
(?<=[A-Z]) assert that an uppercase letter precedes
0 match one or more zeroes
(?=.) assert that some character follows
The (?<=.)
and (?=.)
are called lookarounds. In this case, they make sure that the 0
we target are not at the very start or very end of the input value. For an input like ABC110
, we want the output to be ABC110
, i.e. the final zero should not be removed.
CodePudding user response:
I think I'd go with 'sub' and use capture groups:
strings <- c("ABC0001", "ABC02", "ABC10", "ABC01010", "ABC11", "ABC011", "ABC0120", "ABC1001")
output <- sub("([A-Z])0 ([1-9])", "\\1\\2", strings, perl=TRUE)
output
[1] "ABC1" "ABC2" "ABC10" "ABC1010" "ABC11" "ABC11" "ABC120" "ABC1001"
Tested on Jdoodle and an online regex demo
([A-Z])
- Capture any uppercase letter in 1st group;0
- 1 (Greedy) consecutive zero's;([1-9])
- A single digit ranging 1-9 in a 2nd group. Note that you can remove this group if you are ok with 'ABC0000' to be 'ABC'!
Replace with \1\2
.
CodePudding user response:
We can try the code below
> sub("(\\D )0 (.*)", "\\1\\2", strings)
[1] "ABC1" "ABC2" "ABC10" "ABC1010" "ABC11" "ABC11" "ABC120"