Filter similar matches from regex.finditer-CodePudding

I created a regex pattern which is ((\>|\s)I{0,1}(tem|TEM)(\s|)\w ((\s|)(\-|\–|\—|:|\<)|\.\s)). This pattern should be meant to find headers of sections inside a document. Therefore, sometime you can find a match which is in the body of a section and not a header. To start solving this problem I was thinking to use only the pattern which has more matches inside the re.finditer output, excluding those which are recognized as outliers.

For example, given this string ITEM 1. and& ITEM 2. Traceback xd Item 1. ff942> in <mITEM 3. ITEM 4. the most common pattern found is \sITEM\s\d \.\s, meanwhile \sItem\s\d \.\s would be excluded.

Is there a way to print the matched pattern in regex format from this output? Is there any other way which is not creating 'n' different patterns and looking for each them in loop?

pattern = re.compile(
    r'((\>|\s)I{0,1}(tem|TEM)(\s|)\w ((\s|)(\-|\–|\—|:|\<)|\.\s))')
print([x for x in pattern.finditer(string)])

Update:

To be more specific with the problem, given the following regex pattern ((\>|\s)I(TEM|tem)(\s|)\d{1,2}((\s|)(\-|\–|\—|:|\<)|\.\s)), this can generate the following:

matches = [
    ">Item 6.", "ITEM 3<", "Item 4.", ">Item 11.", "> Item 0.",
    ">ITEM 3.", ">ITEM 2.", ">ITEM 23.", " ITEM69.", ">Item8.",
    ">Item 6.", "ITEM 3<", "Item 4.", ">Item 11.", "> Item 0.",
    ">ITEM 3.", ">ITEM 2.", ">ITEM 23.", " ITEM69.", ">Item8.",
    ">Item 6.", "ITEM 3<", "Item 4.", ">Item 11."]

The ideal result is to filter the matches given the most common one. Therefore, final list of matches would be: ['>ITEM 2.', '>ITEM 23.', '>ITEM 2.', '>ITEM 23.', '>ITEM 3.', '>ITEM 3.'].

CodePudding user response：

You might use a two-pass solution: 1) get the group values for all matches with all variations, 2) build a new pattern based on the most frequent group match and re-match the string.

See the following Python demo:

import re
text = "ITEM 1. and& ITEM 2. Traceback xd Item 1. ff942> in <mITEM 3. ITEM 4."
i = "tem|TEM"
regex = fr"I{{0,1}}({i})\s*\w (?=\s*[-–—:<]|\.(?!\S))"
lst = [x.group(1) for x in re.finditer(regex, text)]
new_i = max(set(lst), key=lst.count)
print( new_i )                                             # => TEM
regex = fr"I{{0,1}}({new_i})\s*\w (?=\s*[-–—:<]|\.(?!\S))"
print( [x.group() for x in re.finditer(regex, text)] )
# => ['ITEM 1', 'ITEM 2', 'ITEM 3', 'ITEM 4']

Here,

i = "tem|TEM" declares a variable containing the pattern under consideration
fr"I{{0,1}}({i})\s*\w (?=\s*[-–—:<]|\.(?!\S))" defines the initial regex
lst = [x.group(1) for x in re.finditer(regex, text)] gets all the values for the first capturing group
new_i = max(set(lst), key=lst.count) finds the most frequent Group 1 value
regex = fr"I{{0,1}}({new_i})\s*\w (?=\s*[-–—:<]|\.(?!\S))" is the new pattern.

CodePudding user response：

I want to identify a list of common/identical matches coming from regex.finditer. From the pattern ((\>|\s)I{0,1}(tem|TEM)(\s|)\w ((\s|)(\-|\–|\—|:|\<)|\.\s)) the combinations of items which can be generated are a lot. For each combination only the digits are allowed to vary, therefore >Item 6. and >Item 66. belongs to the same pattern. The way I found out to resolve this problem is very easy and straightforward.

Explanation of the code

Given a list of re.Match, for each of them is created a tuple([re.Match, re.sub("...", '', re.Match.group())]). The idea is to link to each re.Match a string which will not contain the elements which are allowed to vary (in my case are digits). Then, I am simply looking for identical strings and appending to final result.

def rematch(
    matches: list,
    dynamic_chars: str = "\d"
):
    chains = list()
    matches = [tuple([m, re.sub("[{}] ".format(exclude), '', m.group())]) for m in items]
    while (len(matches) != 0):
        match = matches[0]
        correlates = [m for m in matches if (m[1] == match[1])]
        if len(correlates) > 1:
            chains.append([c[0] for c in correlates])
            for c in correlates:
                matches.remove(c)
        else:
            matches.remove(match)
    
    return sorted(chains, key=lambda x: len(x), reverse=True)