Regular Expression to match all non-word character unless between numbers-CodePudding

Trying to develop a regex in Java 8 flavor to match all non-word characters in several different strings, so I can split them. The only exception is when the ":" is between numbers, such as in "8:00AM".

So far, I've come up with this: "\W(?:(?<!\d)(?!\d))|[-/](?=\d)"
Given the strings below, I got the following result:

M-F: 10AM - 6PM
M-D: 9am / 6pm F: 9am / 4pm
Seg-Qui: 08h às 17h Sex: 08h às 16h
L-V: 8:00AM - 6:00PM CST
M, F, 10AM-5PM
Lun-Jeu: 9/18h Ven:9/17h

However, there are the following issues:

In the string Lun-Jeu: 9/18h Ven:9/17h, it's not selecting the ":" in Ven:9.
In the string Seg-Qui: 08h às 17h Sex: 08h às 16h, I also would like to select the whole word "às" if possible.

Could anyone help to fix the regex or provide a better solution to achieve this?

CodePudding user response：

You can use

(?U)\W(?<!\d:(?=\d))

In Java:

String regex = "(?U)\\W(?<!\\d:(?=\\d))";

See the regex demo.

Details:

(?U) - Pattern.UNICODE_CHARACTER_CLASS embedded flag option, makes \d and \W and other shorthands Unicode-aware
\W - any word char
(?<!\d:(?=\d)) - a negative lookbehind that matches a location not immediately precedd with a digit and :, and immediately followed with a digit.

To also fail the match of a dot inside digits, use (?U)\W(?<!\d[:.](?=\d)). You may add more chars there if you wish.

CodePudding user response：

Try this:

(?<!\d)[^\p{L}\d]|[^\p{L}\d](?!\d)

It selects anything not a unicode letter (ie which includes à) or digit, but only if either not preceded by a digit, or not followed by a digit.