Home > Back-end >  not greedy and optional comma splits the whole string
not greedy and optional comma splits the whole string

Time:10-23

I have the following string: [Example] öäüß asdf 1234 (1aö) (not necessary),

Explanation:

[Example] optional, not needed

öäüß asdf 1234 the most important part which I need. Every character, number, special character as well as German characters like äÄöÖüÜß can be found here.
A greedy selection might be the best solution to prevent characters like the German ones, right?

(1aö) optional and needed

(not necessary) optional, not needed. If it appears it could be (not ...) or (unusual)

, the comma can be optional, too. But is also not needed.

I use the following RegEx: /(?:\[.*\]\s)?(?<name>.*?)(?:\s\([not|unusual].*?\))?\,/g

The problems:

  1. when I use the optional parameter ? at the comma it splits the whole string into separate characters.

  2. when I change the non greedy selection in the name group to a greedy one the optional comma is separated. But now the example string starting with ö is selected up to the end.

  3. the string inside of the first standard brackets () can start with upper or lower case. At this moment I can only recognize upper case.

Here's my attempt at regex101 with a bunch of examples: https://regex101.com/r/Lx2anw/1

Sorry for the quite specific question, but I'm at the end with my knowledge ...

Does anyone have suggestions what I can do here?

CodePudding user response:

It will work if you put every expression that you don't want to match as a single non-captured group. Your expression will be like this:

/(?:\[.*\]\s)?(?<name>. ?)(?:\s\(not \w \))?(?:\s\(unusual\))?,?$/gm

https://regex101.com/r/a7Qtvw/1

CodePudding user response:

You can use

^(?:\[.*?]\s)?(?<name>.*?)(?:\s\((?:not|unusual)[^()]*\))?,?\s*$

See the regex demo.

Details:

  • ^ - start of string
  • (?:\[.*?]\s)? - an optional sequence of [...] and a whitespace
  • (?<name>.*?) - Group "name": any zero or more chars as few as posible
  • (?:\s\((?:not|unusual)[^()]*\))? - an optional sequence of a whitespace, (, not or unusual, and then zero or more chars other than ( and ) and then a ) char
  • ,? - an optional comma
  • \s* - zero or more whitespaces
  • $ - end of string

CodePudding user response:

Your pattern matches the rest of the line in group 1 because all that follows in the pattern after group name is optional.

Note that you use a character class [not|unusual] but you should use a grouping if you want to match one of the alternatives like (?:not|unusual)

You might also match any character except parenthesis, or a comma that is at the end of the string.

Then match an optional part between parenthesis.

^(?:\[[^\][\n]*\]\s)?(?<name>(?:(?!,\s*$)[^\n()])*(?:\([^()\n]*\))?)

Explanation

  • ^ Start of string
  • (?:\[[^\][\n]*\]\s)? Optionally match [...]
  • (?<name> Group name
    • (?: Non capture group
      • (?!,\s*$)[^\n()] If we are not looking at a trailing comma, match any character except ( ) or a newline
    • ) Close the non capture group and repeat 1 or more times to not match an empty line
    • (?:\([^()\n]*\))? Optionally match a part from (...)
  • ) Close group name

Regex demo

If the first part between parenthesis should not start with the words not or unusual you can assert for it using a negative lookahead (?!not\b|unusual\b)

^(?:\[[^\][\n]*\]\s)?(?<name>(?:(?!,\s*$)[^\n()]) (?:\((?!not\b|unusual\b)[^()\n]*\))?)

Regex demo

  • Related