find string array has words with matching pattern of digits using python-CodePudding

I have many string arrays with different words of numeric combinations -

array1 = ['nt1xz2.log','nt3xz32.log','nt5xz12.log','nt6xz3.log','nt9xz7.log','nt4xz3.log']
array2 = ['ztxz2.log','ztxz3.log','ztxz6.log','ztxz9.log','ztxz34.log','ztxz12.log']
array3 = ['joy_1.log','joy_2.log','joy_3.log','joy_6.log','joy_8.log','joy_2.log']
array4 = ['name_12_23.log','name_14_13.log','name_15_23.log','name_22_23.log','name_18_73.log','name_92_23.log']

All the above array string, has some pattern in it, but the pattern words keep changing with array.

I want to find the whether array words has exist pattern in it (by considering the first word as reference), how many words matching with that pattern for in array ?

Eg: array1 has pattern 'nt\d xz\d .log' and number of words matching with that pattern is 6 but here word pattern can be determined using the first word of the array by using str1.replace('\d','*') where str1 = array[0]

CodePudding user response：

Here is one way to try to guess patterns. Consider this only a quick proof of concept, it's only been tested on the provided samples and only works for simple character types.

The core idea is to group by type of characters using unicode.category, then determine if the subpattern is identical, same type, same length,among the sub-parts of the words of the array.

def guess_pattern(array):
    from itertools import groupby
    from unicodedata import category
    import re

    # group by character category
    chunks = zip(*([(k,''.join(g)) for k,g in groupby(word, category)]
                    for word in array))

    types = {'Ll': '[a-zA-Z]', 'Nd': r'\d', 'Po': '''[!"#%&\'*,./:;?@\\]'''}

    pattern = ''
    for c in chunks:
        cat, sub = zip(*c)
        # identity: if all subpattern match, then use literal subpattern
        if len(set(c)) == 1:
            pattern  = re.escape(sub[0])
            continue

        # are all subpatterns of same type? If not use catch-all "."
        if len(set(cat)) == 1:
            pattern  = types.get(cat[0], '.')
        else:
            pattern  = '.'

        # determine the range of repeats if non-literal
        l = [len(x) for x in sub]
        if len(set(l)) == 1:
            LENGTH = f'{{{l[0]}}}'
        else:
            LENGTH = f'{{{min(l)},{max(l)}}}'  # you can simplify and use " "
        if LENGTH != '{1}':
            pattern  = LENGTH
    return pattern

Example

patterns = [guess_pattern(array) for array in [array1, array2, array3, array4]]

output:

['nt\\dxz\\d{1,2}\\.log',
 'ztxz\\d{1,2}\\.log',
 'joy_\\d\\.log',
 'name_\\d{2}_\\d{2}\\.log']

CodePudding user response：

You can construct a regex to match all the strings in a list like so:

from os.path import commonprefix
import re 

array1 = ['nt1xz2.log','nt3xz32.log','nt5xz12.log','nt6xz3.log','nt9xz7.log','nt4xz3.log']
array2 = ['ztxz2.log','ztxz3.log','ztxz6.log','ztxz9.log','ztxz34.log','ztxz12.log']
array3 = ['joy_1.log','joy_2.log','joy_3.log','joy_6.log','joy_8.log','joy_2.log']
array4 = ['name_12_23.log','name_14_13.log','name_15_23.log','name_22_23.log','name_18_73.log','name_92_23.log']

for a in (array1, array2, array3, array4):
    pre=commonprefix(a)
    suf='|'.join({"\." e.split('.')[-1] for e in a})
    cla='[' ''.join(set(''.join(re.sub(rf'{pre}(.*){suf}', r"\1", e) for e in a))) ']'
    pat=rf'^{pre}{cla} {suf}$'
    print(f'r"{pat}" => {sum(bool(re.match(pat,e)) for e in a)} matches for {a}')

Prints:

r"^nt[63497z5x12] \.log$" => 6 matches for ['nt1xz2.log', 'nt3xz32.log', 'nt5xz12.log', 'nt6xz3.log', 'nt9xz7.log', 'nt4xz3.log']
r"^ztxz[634912] \.log$" => 6 matches for ['ztxz2.log', 'ztxz3.log', 'ztxz6.log', 'ztxz9.log', 'ztxz34.log', 'ztxz12.log']
r"^joy_[63182] \.log$" => 6 matches for ['joy_1.log', 'joy_2.log', 'joy_3.log', 'joy_6.log', 'joy_8.log', 'joy_2.log']
r"^name_[34_975182] \.log$" => 6 matches for ['name_12_23.log', 'name_14_13.log', 'name_15_23.log', 'name_22_23.log', 'name_18_73.log', 'name_92_23.log']

I suppose you would use this to have a list of examples in one list to filter another list.

If you just want to inspect the first element in the list and construct a regex that will match similar patterns in the remainder of the list, you could do this:

for a in (array1, array2, array3, array4):
    new_a=[r'\d ' if s.isdigit() else re.escape(s) 
                for s in re.split(r"(?<=\d)(?=\D)|(?=\d)(?<=\D)",a[0])]
    new_pat=rf"^{''.join(new_a)}$"
    print(f'r"{new_pat}" => {sum(bool(re.match(new_pat,e)) for e in a)} matches for {a}')

Prints:

r"^nt\d xz\d \.log$" => 6 matches for ['nt1xz2.log', 'nt3xz32.log', 'nt5xz12.log', 'nt6xz3.log', 'nt9xz7.log', 'nt4xz3.log']
r"^ztxz\d \.log$" => 6 matches for ['ztxz2.log', 'ztxz3.log', 'ztxz6.log', 'ztxz9.log', 'ztxz34.log', 'ztxz12.log']
r"^joy_\d \.log$" => 6 matches for ['joy_1.log', 'joy_2.log', 'joy_3.log', 'joy_6.log', 'joy_8.log', 'joy_2.log']
r"^name_\d _\d \.log$" => 6 matches for ['name_12_23.log', 'name_14_13.log', 'name_15_23.log', 'name_22_23.log', 'name_18_73.log', 'name_92_23.log']