Count number of substrings in String by multiple delimiters-CodePudding

Imagine the following example Strings

‘John @ Mary John v Mary John vs Mary’
‘John v Mary Ben v Paul John v Mary’
‘Hello World / John v Mary John @ Mary John vs Mary’
‘John v Mary John vs Mary John @ Mary John v Mary’

There are 3 identified delimiters

' @ '
' v '
' vs '

For every field row in my file, I would like to iterate through each delimiter, look left and right by 4 characters, concatenate left and right together, and return the count should all concatenated substrings match.

we would end up finding 'JohnMary' 3 times. Return = 3
we would end up finding 'JohnMary','BenPaul' and 'JohnMary'. Return = 0
we would end up finding 'JohnMary' 3 times. note the Hello World is irrelevant as we only look 4 characters left and right. Return = 3
we would end up finding 'JohnMary' 4 times. Return = 4

For this I'll need some sort recursive/loop query to iterate through each delimiter in each row, and count the number of matched substrings.

note, if the first 2 substrings encountered aren't a match, we don't need to continue checking any further and can return 0 (like in example 2)

CodePudding user response：

Try with this code that assumes always exists a space before and after the delimiter

!/usr/bin/python3

import re
from copy import deepcopy
from typing import List, Tuple, Union

def count_match(s: str, d: List[str]) -> Tuple[Union[None, str], int, int]:

    if len(s) == 0:
        return None, 0, 0

    counter = dict()
    offset = 0
    for each in d:
        match = re.search(each, s)
        if match is None:
            break
        idx = match.start()
        sub_string1 = s[idx-4: idx]
        sub_string2 = s[idx len(each): idx len(each) 4]
        sub_string = ''.join((sub_string1, sub_string2))
        offset = max(offset, idx len(each) 4)
        try:
            counter[sub_string]  = 1
        except KeyError:
            counter[sub_string] = 1
    if not len(counter):
        return None, 0, 0
    if len(counter.keys()) > 1:
        return None, -1, 0
    return sub_string, list(counter.values())[0], offset


if __name__ == '__main__':
    text = 'John @ Mary John v Mary John vs Mary John @ Mary'
    delimiter = [' @ ', ' v ', ' vs ']
    count = 0
    ref_string = ""
    while text:
        string, partial, start = count_match(text, delimiter)
        if string != ref_string and ref_string != "":
            count = 0
            break
        if partial == -1:
            count = 0
            break
        if partial == 0:
            break
        ref_string = string
        count  = partial
        text = text[start:]

    print(count)

CodePudding user response：

Got this answer from a Matthew Barnett on a Python help forum. It also works great :)

text = '''\
John @ Mary John v Mary John vs Mary
John v Mary Ben v Paul John v Mary
Hello World / John v Mary John @ Mary John vs Mary
John v Mary John vs Mary John @ Mary John v Mary
'''

from collections import defaultdict

import re
pattern = re.compile('(.{4})( @ | v | vs )(.{4})')

for line in text.splitlines():
    found = defaultdict(lambda: 0)

    for before, sep, after in pattern.findall(line):
        key = before, sep, after
        found[before   after]  = 1

    if len(found) == 1 and sum(found.values()) > 1:
        print(list(found.values())[0])
    else:
        print(0)