I need to match text between undefined \w:
patterns (so n: text
, foo: text
and n: text foo: more text
, more examples in the test script below).
To do this, I'm using python's finditer
and a regex, but I can't capture more multiple words between placeholders. How can I adjust either the regex, or the finditer
method to do what I want?
import re
def test_query_parse_regex(query, expected_result):
result = {}
# perform the matching here, this needs to change
r = r"([\w-] ):\s?([\w-]*)"
matches = re.finditer(r, query)
for match in matches:
# eg 'n'
operator = match.group(1).strip()
# eg 'text'
operator_value = match.group(2).strip()
# build a dict for comparison
result[operator] = operator_value
if result == expected_result:
print(f"PASS: {query}")
else:
print(f"FAIL: {query}")
print(f" Expected: {expected_result}")
print(f" Got : {result}")
checks = [
# Query, expected
("n: tom", {"n": "tom"}),
("n: tom preston", {"n": "tom preston"}),
("n: tom l: london", {"n": "tom", "l": "london"}),
("n: tom preston l: london derry", {"n": "tom preston", "l": "london derry"}),
]
for check in checks:
test_query_parse_regex(*check)
Note. I've tried a positive look ahead but can't make that work either: r"([\w-] ):\s?([\w-]*)(?=\w:)"
CodePudding user response:
You can use
r = r"([\w-] ):\s*(.*?)(?=[\w-] :|$)"
r = r"([\w-] ):\s*(.*?)(?=[\w-] :|\Z)"
Note that if your strings can have line breaks you will need to also amend the re.finditer
part to
re.finditer(r, query, re.DOTALL)
See the regex demo. Prefer the version with \Z
if you use the re.M
or re.MULTILINE
option since \Z
always matches the very end of string.
Details:
([\w-] )
- Group 1: one or more word or hyphen chars:\s*
- a colon and any zero or more whitespaces(.*?)
- Group 2: zero or more chars other than line break chars (ifre.DOTALL
is not used) as few as possible(?=[\w-] :|\Z)
- a positive lookahead that requires one or more word or hyphen chars followed with a colon, or end of string, immediately to the right of the current location.