Home > Software engineering >  Complex splitting of a String using REGEX, only discarding spaces
Complex splitting of a String using REGEX, only discarding spaces

Time:10-14

In Java (JDK 11), consider the following string:

String hello = "333 444 5qwerty5 006 -7";

I am trying to come up with a RegEx that will split anything that isn't a digit, whilst keeping the separators except space. So in the above example, I would like to end up with the following array:

["333" , " " , "444" , "5" , "q" , "w" , "e" , "r" , "t" , "y" , "5" , "006" , "-7"]

Do note the leading zeroes in 006, and -7. The code I am using is the following:

String[] splited = s.split("((?<=[^0-9] )|(?=[^0-9] )|(\\s ))");

However, I can see that my array is keeping spaces. I can't for the life of me figure my mistake. Any thoughts?

EDIT: Turns out the requirement is even more complex than I thought:

["333 444" , "5" , "q" , "w" , "e" , "r" , "t" , "y" , "5" , "006" , "-7"]

So if there is no space between an integer and operators - * / % ^, then do not split them. I have issues implementing this rule along with the fact that leading zeroes and negative numbers should not be split.

CodePudding user response:

Instead of using split, you could also match all the parts:

-?\d |\S

The pattern matches:

  • -? Optionally match a hyphen
  • \d Match 1 digits
  • | Or
  • \S Match a single non whitespace char

See a regex demo and a Java demo.

Example

String regex = "-?\\d |\\S";
String string = "333 444 5qwerty5 006 -7";

List<String> allMatches = new ArrayList<String>();

Matcher m = Pattern.compile(regex).matcher(string);
while (m.find()) {
    allMatches.add(m.group());
}

System.out.println(Arrays.toString(allMatches.toArray()));

Output

[333,  , 444, 5, q, w, e, r, t, y, 5, 006, -7]

CodePudding user response:

This works for your example:

String[] split = hello.split("(?<=\\d)(?=\\D) *|(?<=[^\\d -])(?=[\\d-])|(?<=[\\d-])(?=[^\\d -])|(?<=[^\\d -])(?=[^\\d -])");

The important parts are:

  • Using [\\d-] instead of \d so minus signs are treated as "digits"
  • Generally using [^\d -] instead of \D to prevent empty split elements at word ends
  • Splitting after digits, but only if a non-digit follows
  • Adding * to capture ("delete") spaces when splitting
  • Splitting between non-digits

Test code:

String hello = "333 444 5qwerty5 006 -7";
String[] split = hello.split("(?<=\\d)(?=\\D) *|(?<=[^\\d -])(?=[\\d-])|(?<=[\\d-])(?=[^\\d -])|(?<=[^\\d -])(?=[^\\d -])");
System.out.println(Arrays.toString(split));

Output:

[333,  , 444, 5, q, w, e, r, t, y, 5, 006, -7]
  • Related