Home > Blockchain >  Regex group doesn't match with "?" even if it should
Regex group doesn't match with "?" even if it should

Time:02-03

input strings:

"| VLAN56                    | LAB06    | Labor 06                          | 56     | 172.16.56.0/24   | VLAN56_LAB06       | ✔️           |            |",
"| VLAN57                    | LAB07    | Labor 07                          | 57     | 172.16.57.0/24   | VLAN57_LAB07       | ✔️           | @#848484:  |"

regex:

'\|\s (\d ). (VLAN\d _[0-9A-Za-z] )\s \|. (#[0-9A-Fa-f]{6})?'

The goal is to get the VLAN number, hostname, and if there is one, the color code, but with a "?" it ignores the color code every time, even when it should match.

With the "?" the last capture group is always None.

CodePudding user response:

You may use this regex:

\|\s (\d ). (VLAN\d _[0-9A-Za-z] )\s \|[^|] \|[^#|]*(#[0-9A-Fa-f]{6})?

You have a demo here: https://regex101.com/r/SWe42v/1

The reason why it didn't work with your regex is that . is a greedy quantifier: It matches as much as it can.

So, when you added the ? to the last part of the regex, you give no option to backtrack. The . matches the rest of the string/line and the group captures nothing (which is correct because it is optional)

In order to fix it, you can simply try to match the column with the emoji. You don't care about its content, so you simply use |[^|] to skip the column.

This sort of construct is widely used in regexes: SEPARATOR[^SEPARATOR]*

CodePudding user response:

The reason why the last capture group is None is that the preceding . can capture the rest of the line.

I would however first use the fact that this is a pipe-separated format, and split by that pipe symbol and then retrieve the elements of interest needed by slicing them from that result by their index:

import re

s = "| VLAN57                    | LAB07    | Labor 07                          | 57     | 172.16.57.0/24   | VLAN57_LAB07       | ✔️           | @#848484:  |"

vlan,name,color = re.split(r"\s*\|\s*", s)[4:9:2]
print(vlan, name, color)

This code is in my opinion easier to read and to maintain.

CodePudding user response:

I think this is what you're after: Demo

^\|\s (VLAN[0-9A-Za-z] )\s \|\s ([0-9A-Za-z] )\s \|.*((?<=\#)[0-9A-Fa-f]{6})?.*$

  • ^\|\s - the start of the line must be a pipe followed by some whitespace.
  • (VLAN[0-9A-Za-z] ) - What comes next is the VLAN - so we capture it; with the VLAN and all (at least 1) following alpha-numeric chars.
  • \s \|\s - there's then another pipe delimeter, with whitespace either side.
  • ([0-9A-Za-z] ) - the column after the vlan name is the device name; so we capture the alphanumeric value from that.
  • \s \| - after our device there's more whitespace and then the delimiter
  • .* - following that there's a load of stuff that we're not interested in; could be anything.
  • ((?<=\#)[0-9A-Fa-f]{6})? - next there may be a 6 hex char value preceded by a hash; we want to capture only the hex value part.
    • (...) says this is another capture group
    • (?<=\#) is a positive look behind; i.e. checks that we're preceded by some value (in this case #) but doesn't include it within the surrounding capture
    • [0-9A-Fa-f]{6} is the 6 hex chars to capture
    • ? after the parenthesis says there's 0 or 1 of these (i.e. it's optional); so if it's there we capture it, but if it's not that's not an issue.
  • .*$ says we can have whatever else through to the end of the string.

We could strip a few of those bits out; or add more in (e.g. if we know exactly what column everythign will be in we can massively simplify by just capturing content from those columns. E.g.

^\|\s*([^\|\s] )\s*\|\s*([^\|\s] )\s*\|\s*[^\|]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\d]*(\d{6})?[^\|]*\s*\|$

... But amend per your requirements / whatever feels most robust and suitable for your purposes.

  • Related