Home > database >  Java 11 Generic Regex to parse a given String value
Java 11 Generic Regex to parse a given String value


I'm working on a side project for which I need to parse String to obtain substrings

I have a REST API containing a String parameter in the payload. This String value's pattern can vary across any of the enlisted patterns:

  1. [Name]
  2. [Name 1], [Name 2]
  3. [Name 1] and [Name 2]
  4. [Name 1], [Name 2] and [Name 3]
  5. [Name 1], [Name 2] and [Name 3], [Role]

Options I tried:

  • Including another parameter in the request payload that describes the format of the String value. For Ex: If a string value of pattern #4 is to be passed as input, here is the payload I would expect:

    "Value" : "Name 1, Name 2 and Name 3",
    "Format": 4

Here, it's a burden on the client to determine the format and set the format value accordingly, which is definitely not a good approach

  • Somehow determine the format (For Ex: count the number of commas and AND keyword) and accordingly use a Reg-ex dedicated for that format For Ex: If the string contains at least one comma, an occurrence of the AND keyword and a comma after the AND keyword, it could be pattern #5 (described in the list above). So use the Reg-ex pattern: ([a-zA-Z] ( [a-zA-Z] ) ),([a-zA-Z] ( [a-zA-Z] ) ),[a-zA-Z]
    This approach does work, but still is far too rigid to be practical. For Ex: Consider 4 names (rather than 3) being a part of the value, the said pattern won't work

Is there a more generic reg-ex pattern possible that could satisfy each of the aforementioned patterns?

CodePudding user response:

Here is a generic regex pattern which covers all 5 types of inputs:

^\[.*?\](?:(?:,|\s and\s )\s*\[.*?\](?:\s and\s \[.*?\])*)*$


Explanation of regex:

^                    start of string
\[.*?\]              match [Name]
    (?:,|\s and\s )  match either comma or "and" separator
    \s*              optional whitespace
    \[.*?\]          another [Name 2]
        \s and\s     "and" separator
        \[.*?\]      more [Name] terms
    )*               zero or more
)*                   zero or more
$                    end of string

CodePudding user response:

You could write the pattern repeatedly matching all between the square brackets:

^\[[^\]\[]*](?:(?:,| and) \[[^\]\[]*])*$

In parts, the pattern matches:

  • ^ Start of string
  • \[[^\]\[]*] Match from [....]
  • (?: Non capture group
    • (?:,| and) Match either a comma followed by a space or and followed by a space
    • \[[^\]\[]*] Match from [....]
  • )* Close the non capture group and optionally repeat
  • $ End of string

Regex demo

In Java with the doubled escaped backslashes:

String regex = "^\\[[^\\]\\[]*](?:(?:,| and) \\[[^\\]\\[]*])*$"
  • Related