Say I have a group (or set) of words: {foo, bar, baz}, and I want to match and extract the group. The words can be in any order, but they need to be next to each other. For example,
hello foo bar baz wow yeah => foo bar baz
hello bar foo baz wow yeah => bar foo baz
wow yeah hello baz bar foo hello => baz bar foo
baz yeah bar foo hello => no match
What'd be a good regex, preferably Python, to accomplish this?
CodePudding user response:
If each word can only appear once (in the entire string), you may use:
(?:\b(foo|bar|baz)(?!.*\b\1\b) ){3}
Demo.
If words might repeat, I don't think you can get any shorter than something like this:*
\b(foo|bar|baz) (?!\1)(foo|bar|baz) (?!\1|\2)(foo|bar|baz)\b
Demo.
Details:
\b
- Word boundary.(foo|bar|baz)
- Match any of the specified words and capture it in group 1.(?!\1)
- A space character not immediately followed by the word captured in group 1.(foo|bar|baz)
- Match any of the specified words and capture it in group 2.(?!\1|\2)
- A space char not immediately followed by any of the words previously captured.(foo|bar|baz)
- Match any of the specified words and capture it in group 3.\b
- Word boundary.
Note: The third occurrence of foo|bar|baz
can be used without a capturing group (i.e., in a non-capturing group) but I left it there for consistency.
Python example:
import re
regex = r"\b(foo|bar|baz) (?!\1)(foo|bar|baz) (?!\1|\2)(foo|bar|baz)\b"
test_str = """hello foo bar baz wow yeah
hello bar foo baz wow yeah
wow yeah hello baz bar foo hello
baz yeah bar foo hello"""
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print (match.group())
Output:
foo bar baz
bar foo baz
baz bar foo
* We can actually use a slightly shorter pattern for this specific case: \b(foo|ba[rz]) (?!\1)(foo|ba[rz]) (?!\1|\2)(foo|ba[rz])\b
but that wouldn't work for any 3 words.
CodePudding user response:
You can use positive lookaheads to capture what's after 3 words and match each of the 3 desired words with a lookahead assertion that what was captured will follow:
(?=(\w \w \w )(.*))(?=.*\bfoo\b.*\2)(?=.*\bbar\b.*\2)(?=.*\bbaz\b.*\2)
Each match can then be found in group #1.
Demo: https://regex101.com/r/0tHKN5/2
EDIT: Performance improved from 5490 to 1377 steps according to regex101 with a word boundary assertion at the start and at most 2 words around each keyword instead of trying until the end with .*
:
(?=(\b\w \w \w )(.*))(?=(?:\w ){,2}\bfoo\b(?: \w ){,2}\2)(?=(?:\w ){,2}\bbar\b(?: \w ){,2}\2)(?=(?:\w ){,2}\bbaz\b(?: \w ){,2}\2)