Imagine the following example Strings
- ‘John @ Mary John v Mary John vs Mary’
- ‘John v Mary Ben v Paul John v Mary’
- ‘Hello World / John v Mary John @ Mary John vs Mary’
- ‘John v Mary John vs Mary John @ Mary John v Mary’
There are 3 identified delimiters
- ' @ '
- ' v '
- ' vs '
For every field row in my file, I would like to iterate through each delimiter, look left and right by 4 characters, concatenate left and right together, and return the count should all concatenated substrings match.
- we would end up finding 'JohnMary' 3 times. Return = 3
- we would end up finding 'JohnMary','BenPaul' and 'JohnMary'. Return = 0
- we would end up finding 'JohnMary' 3 times. note the Hello World is irrelevant as we only look 4 characters left and right. Return = 3
- we would end up finding 'JohnMary' 4 times. Return = 4
For this I'll need some sort recursive/loop query to iterate through each delimiter in each row, and count the number of matched substrings.
- note, if the first 2 substrings encountered aren't a match, we don't need to continue checking any further and can return 0 (like in example 2)
CodePudding user response:
Try with this code that assumes always exists a space before and after the delimiter
!/usr/bin/python3
import re
from copy import deepcopy
from typing import List, Tuple, Union
def count_match(s: str, d: List[str]) -> Tuple[Union[None, str], int, int]:
if len(s) == 0:
return None, 0, 0
counter = dict()
offset = 0
for each in d:
match = re.search(each, s)
if match is None:
break
idx = match.start()
sub_string1 = s[idx-4: idx]
sub_string2 = s[idx len(each): idx len(each) 4]
sub_string = ''.join((sub_string1, sub_string2))
offset = max(offset, idx len(each) 4)
try:
counter[sub_string] = 1
except KeyError:
counter[sub_string] = 1
if not len(counter):
return None, 0, 0
if len(counter.keys()) > 1:
return None, -1, 0
return sub_string, list(counter.values())[0], offset
if __name__ == '__main__':
text = 'John @ Mary John v Mary John vs Mary John @ Mary'
delimiter = [' @ ', ' v ', ' vs ']
count = 0
ref_string = ""
while text:
string, partial, start = count_match(text, delimiter)
if string != ref_string and ref_string != "":
count = 0
break
if partial == -1:
count = 0
break
if partial == 0:
break
ref_string = string
count = partial
text = text[start:]
print(count)
CodePudding user response:
Got this answer from a Matthew Barnett on a Python help forum. It also works great :)
text = '''\
John @ Mary John v Mary John vs Mary
John v Mary Ben v Paul John v Mary
Hello World / John v Mary John @ Mary John vs Mary
John v Mary John vs Mary John @ Mary John v Mary
'''
from collections import defaultdict
import re
pattern = re.compile('(.{4})( @ | v | vs )(.{4})')
for line in text.splitlines():
found = defaultdict(lambda: 0)
for before, sep, after in pattern.findall(line):
key = before, sep, after
found[before after] = 1
if len(found) == 1 and sum(found.values()) > 1:
print(list(found.values())[0])
else:
print(0)