The strings I parse with a regular expression contain a region of fixed length N
where there can either be numbers or dashes. However, if a dash occurs, only dashes are allowed to follow for the rest of the region. After this region, numbers, dashes, and letters are allowed to occur.
Examples (N=5
, starting at the beginning):
12345ABC
12345123
1234-1
1234--1
1----1AB
How can I correctly match this? I currently am stuck at something like (?:\d|-(?!\d)){5}[A-Z0-9\-]
(for N=5
), but I cannot make numbers work directly following my region if a dash is present, as the negative look ahead blocks the match.
Update
Strings that should not be matched (N=5
)
1-2-3-A
----1AB
--1--1A
CodePudding user response:
You could assert that the first 5 characters are either digits or -
and make sure that there is no - before a digit in the first 5 chars.
^(?![\d-]{0,3}-\d)(?=[\d-]{5})[A-Z\d-] $
^
Start of string(?![\d-]{0,3}-\d)
Make sure that in the first 5 chars there is no-
before a digit(?=[\d-]{5})
Assert at least 5 digits or-
[A-Z\d-]
Match 1 times any of the listed characters$
End of string
If atomic groups are available:
^(?=[\d-]{5})(?>\d -*|-{5})[A-Z\d_]*$
^
Start of string(?=[\d-]{5})
Assert at least 5 chars-
or digit(?>
Atomic group\d -*
Match 1 digits and optional-
|
or-{5}
match 5 times-
)
Close atomic group[A-Z\d_]*
Match optional chars A-Z digit or_
$
End of string
CodePudding user response:
Use a non-word-boundary assertion \B
:
^[-\d](?:-|\B\d){4}[A-Z\d-]*$
A non word-boundary succeeds at a position between two word characters (from \w
ie [A-Za-z0-9_]
) or two non-word characters (from \W
ie [^A-Za-z0-9_]
). (and also between a non-word character and the limit of the string)
With it, each \B\d
always follows a digit. (and can't follow a dash)
Other way (if lookbehinds are allowed):
^\d*-*(?<=^.{5})[A-Z\d-]*$