Ignore specific characters before a delimiter with RegEx-CodePudding

I am trying to create two regular expressions to capture the needed characters of European license plates.

It's important to mention that

the delimiter separating the country from the rest (first letter) is always an * (asterisk) or a - (hyphen)
the delimiter separating the district from the other characters on the right is always a - (hyphen)
The district can also contain letters such as ä,ö,ü

The license plates look like this:

A*S-XXXPA
A*SL-XXXPC
A*SL-XXXSD
A*HA-XXXHV
D*R-XXXXX
D*TS-XXXXX
A*VB-1XXXXX

The RegExs I use for capturing the countries and the district are the following.

String country = "^([A-Z]{1,3})";
String district = "\\h*(\\p{L}{1,3})[-*]";

Once I get the needed information out of my Strings, I remove the information I don't need with java code, here's the code piece:

if (matcher.find()) {
        
        country_region = matcher.group(1);
        country_region = country_region.replace("*", "");
        country_region = country_region.replace("-", "");
        country_region = country_region.replaceAll("\\s $", "");            

    }

My regex capturing the countries works fine, here's an example:

https://regex101.com/r/jfhSJN/1

The one I'm having troubles with is the RegEx I use to capture districts. At the moment it also catches the countries...

https://regex101.com/r/57ZE9O/1

I guess I could just remove the asterisk at the end of my RegEx, but I do not think it's the cleanest way to do it.

Thank you!

CodePudding user response：

The main issue with the district regex is that \h* matches any zero or more horizontal whitespaces. So the match can also occur at the start of string.

Since you want to get a match after a horizontal whitespace, * or -, you can use

[*\h-](\p{L}{1,3})[-*]

See the regex demo. Here, [*\h-] matches a *, a horizontal whitespace or a - char.

However, it makes sense to use a regex to match the stirng while capturing all parts into groups:

^([A-Z]{1,3})[\h*-](\p{L}{1,3})[-*](. )

See this regex demo. Details:

^ - start of string
([A-Z]{1,3}) - Group 1: one, two or three uppercase letters
[\h*-] - a horizontal whitespace, * or -
(\p{L}{1,3}) - Group 2: one to three any Unicode letters
[-*] - a - or * char
(. ) - Group 3: all text till the end of string/line.