I'm trying to implement a regex with the following task. A string contains a state name. At the end of the State name, optional parentheses may contain additional informations. Examples of valid strings:
- New York, US
- California, United States of America (USA)
- Massachusetts, United States of America(USA)
Between the State name and the first parenthesis, a space may be present. The regex should extract the State name, dropping the optional content, as well as the space separating the State name and the optional content. At the moment I am using the following regex:
(?P<country>[A-Za-z ,] )(?: {0,1})(?=[(])?(?:[(]\w*[)])?
Unfortunately, however, due to the greedyness of (?: {0,1})(?=[(])?
the whitespace separating the State name and the optional content never gets captured, as shown in this regex101.
The desired result would be New York, US
, California, United States of America
, and Massachusetts, United States of America
.
Any suggestion?
CodePudding user response:
In your pattern (?P<country>[A-Za-z ,] )(?: {0,1})(?=[(])?(?:[(]\w*[)])?
you can omit (?=[(])?
as it is optional adn will always be true, and (?: {0,1})
can be written as just ?
As you don't want to have the optional part at the end between parenthesis in the final match, you could also choose not to match it and make the pattern a bit more specific
\b(?P<country>[A-Za-z] (?:,? [A-Za-z] ) )\b
The pattern matches:
\b
A word boundary(?P<country>
Named group country[A-Za-z]
Match 1 chars a-z(?:,? [A-Za-z] )
Repeat 1 times matching an optional comma and a space followed by 1 chars a-z
)
Close the named group\b
A word boundary
If the part with the parenthesis at the end is optional at the end of the string and you want to match the whole string, you can introduce anchors to assert the start and the end of the pattern.
Then you can use the non greedy apporach with the character class [A-Za-z ,] ?
^(?P<country>[A-Za-z ,] ?) ?(?:[(]\w*[)])?$