Trying to develop a regex in Java 8 flavor to match all non-word characters in several different strings, so I can split them. The only exception is when the ":" is between numbers, such as in "8:00AM".
So far, I've come up with this: "\W(?:(?<!\d)(?!\d))|[-/](?=\d)"
Given the strings below, I got the following result:
M-F: 10AM - 6PM
M-D: 9am / 6pm F: 9am / 4pm
Seg-Qui: 08h às 17h Sex: 08h às 16h
L-V: 8:00AM - 6:00PM CST
M, F, 10AM-5PM
Lun-Jeu: 9/18h Ven:9/17h
However, there are the following issues:
In the string Lun-Jeu: 9/18h Ven:9/17h, it's not selecting the ":" in Ven:9.
In the string Seg-Qui: 08h às 17h Sex: 08h às 16h, I also would like to select the whole word "às" if possible.
Could anyone help to fix the regex or provide a better solution to achieve this?
CodePudding user response:
You can use
(?U)\W(?<!\d:(?=\d))
In Java:
String regex = "(?U)\\W(?<!\\d:(?=\\d))";
See the regex demo.
Details:
(?U)
-Pattern.UNICODE_CHARACTER_CLASS
embedded flag option, makes\d
and\W
and other shorthands Unicode-aware\W
- any word char(?<!\d:(?=\d))
- a negative lookbehind that matches a location not immediately precedd with a digit and:
, and immediately followed with a digit.
To also fail the match of a dot inside digits, use (?U)\W(?<!\d[:.](?=\d))
. You may add more chars there if you wish.
CodePudding user response:
Try this:
(?<!\d)[^\p{L}\d]|[^\p{L}\d](?!\d)
It selects anything not a unicode letter (ie which includes à
) or digit, but only if either not preceded by a digit, or not followed by a digit.