Home > Enterprise >  How to remove all zeros before first non-zero number WITHIN a string
How to remove all zeros before first non-zero number WITHIN a string

Time:12-29

I have the following vector with strings:

strings <- c("ABC0001", "ABC02", "ABC10", "ABC01010", "ABC11", "ABC011", "ABC0120")

"ABC0001"  "ABC02"    "ABC10"    "ABC01010" "ABC11"    "ABC011"   "ABC0120" 

Desired output:

[1] "ABC1"    "ABC2"    "ABC10"   "ABC1010" "ABC11"   "ABC11"   "ABC120"

My question: What is the regular expression pattern for zeros before first integer in a string?

So far I have tried:

library(stringr)
str_replace(strings,'0 ', "")

which gives:

[1] "ABC1"    "ABC2"    "ABC1"    "ABC1010" "ABC11"   "ABC11"   "ABC120" 

Note: Not desired ABC1 in position 3. Should be ABC10 I suspect this might be easy, but I can't get it.

I want to learn the regular expression pattern!

CodePudding user response:

Here is a base R option using sub, with lookarounds:

strings <- c("ABC0001", "ABC02", "ABC10", "ABC01010", "ABC11", "ABC011", "ABC0120")
output <- sub("(?<=[A-Z])0 (?=.)", "", strings, perl=TRUE)
output

[1] "ABC1"    "ABC2"    "ABC10"   "ABC1010" "ABC11"   "ABC11"   "ABC120"

Here is an explanation of the regex pattern being used:

(?<=[A-Z])  assert that an uppercase letter precedes
0           match one or more zeroes
(?=.)       assert that some character follows

The (?<=.) and (?=.) are called lookarounds. In this case, they make sure that the 0 we target are not at the very start or very end of the input value. For an input like ABC110, we want the output to be ABC110, i.e. the final zero should not be removed.

CodePudding user response:

I think I'd go with 'sub' and use capture groups:

strings <- c("ABC0001", "ABC02", "ABC10", "ABC01010", "ABC11", "ABC011", "ABC0120", "ABC1001")
output <- sub("([A-Z])0 ([1-9])", "\\1\\2", strings, perl=TRUE)
output

[1] "ABC1"    "ABC2"    "ABC10"   "ABC1010" "ABC11"   "ABC11"   "ABC120"   "ABC1001"

Tested on Jdoodle and an online regex demo


  • ([A-Z]) - Capture any uppercase letter in 1st group;
  • 0 - 1 (Greedy) consecutive zero's;
  • ([1-9]) - A single digit ranging 1-9 in a 2nd group. Note that you can remove this group if you are ok with 'ABC0000' to be 'ABC'!

Replace with \1\2.

CodePudding user response:

We can try the code below

> sub("(\\D )0 (.*)", "\\1\\2", strings)
[1] "ABC1"    "ABC2"    "ABC10"   "ABC1010" "ABC11"   "ABC11"   "ABC120"
  • Related