Home > OS >  Regular expression to capture n words after pattern that do not contain that pattern
Regular expression to capture n words after pattern that do not contain that pattern

Time:12-01

I'm trying to write a regular expression that captures n words after a pattern, which was answered in this question, except I want the search to keep going for another n words if it encounters that pattern again. For example, if my main search pattern is 'x', and I want to capture a word that contains 'x' and n=3 words after it that don't contain 'x', the following string should result in three matches:

Lorem ipsum dolxor sit amet, consectetur adipiscing elit. Morxbi fringilla, dui axt tincidunt consectetur, libero arcu cursus arcxu, ut commodo lexctus magna vitxae venenatis neque.

Matches ('x's in bold for ease of viewing)

  1. dolxor sit amet, consectetur
  2. Morxbi fringilla, dui axt tincidunt consectetur, libero
  3. arcxu, ut commodo lexctus magna vitxae venenatis neque.

Matching n=3 words after is straightforward: [^ ]*x[^ ]*(?: [^ ]*){0,3}

How to keep going if another 'x' is encountered, I'm not sure. I've tried this -- [^ ]*x[^ ]*(?: (?![^ ]*x[^ ]*)[^ ]*){0,3} -- but it terminates the search instead of continuing on check the next n words, which, given the example above, gives six results instead of the expected three:

  1. dolxor sit amet, consectetur
  2. Morxbi fringilla, dui
  3. axt tincidunt consectetur, libero
  4. arcxu, ut commodo
  5. lexctus magna
  6. vitxae venenatis neque.

P.S. I'm working with python.


EDIT: For context, I'm trying to get sufficient information about the surroundings of each appearance of the given pattern. (For simplicity's sake, I'm only including the words after the pattern, but it's easy to generalize to words in the back.) And the problem with the first regex is that it might result in a word that lacks information on its surroundings if it gets picked up as part of the surroundings of another match. For example, the first match given the text above would be 'Morxbi fringilla, dui axt', which gives us information about what comes after 'Morxbi' but not 'axt'. The second regex doesn't help because now matches with another match in its surroundings will lose that information, e.g., we won't know the third word that comes after 'Morxbi'.

CodePudding user response:

It's not clear that regex is the most natural way to solve your use case. Consider this hybrid approach.

import re

pattern = re.compile(r"x")  # or whatever

def get_at_least_n(text: str, n=3) -> Optional[range]:
    words = text.split()
    matches = list(map(pattern.search, words))
    if not any(matches):
        return None
    last = sorted(_get_ranges(matches, n))[-1]
    _, i, j = last
    assert j >= i   n
    return range(i, j)

def _get_ranges(matches, n):
    for i in range(len(matches)):
        if matches[i]:
            j = i   n
            k = j   1
            while k < len(matches) and k - j < n:
                if matches[k]:
                    j = k
        yield j - i, i, j

The regular expression engine loops over characters and can handle CFGs, but is not Turing complete. For one thing, evaluation is guaranteed to always Halt. (Cheating a bit, wrapping it in a loop such as sed offers, would enable Turing and even Conway's Life: https://bitly.com/regexgol)

Here, the longest match you're looking for could be supplied in various orders, such as

A B x D E x G H x J K L M N O x Q R x T

or

A B x D E F G H I J K L M N O x Q R x T

so the correct answer for (1.) is "x D E x G H x J K" and for (2.) is "x Q R x T".

Computing longest match using just the regex engine is not straightforward, and it would yield illegible code, unmaintainable.

Given that you have cPython's Turing machine available to you, hey, may as well use it, right?

  • Related