Building a string one character at a time, check if it contains a neighbouring repeating substring-CodePudding

Examples of invalid strings: AA, ABAA, ABACAA, ABACABAC, ABACABAB

Examples of valid strings: AB, ABC, ABA, ABACABADAB, ABACABCACBABCABACABCACBACAB

I call this function every time I add a new character to the string. So strings like ABABC will never get checked by the function since the string will already be invalid once the second B is added.

I need the absolute fastest way to check this in a function.

So far I've come up with 2 relatively slow functions:

def is_valid(s):
    if len(s) < 2:
        return True

    # Are last two chars the same
    if s[-1] == s[len(s)-2]:
        return False

    # Get array of indexes where last char appears and reverse it
    indexes = [m.start() for m in re.finditer(s[-1], s[:-1])]
    indexes = indexes[::-1]

    for index in indexes:
        a = s[index 1:]
        if index - len(a)   1 < 0:
            return True
        b = s[index-len(a) 1 : index 1]
        if a == b: 
            return False

    return True

and

def is_valid(s):
    # Reverse string
    s = s[::-1]

    substring = ""
    for i in range(len(s)):
        substring  = s[i]

        if len(s) >= i 1 len(substring) and s[i 1 : i 1 len(substring)] == substring:
            return False

    return True

If there's not much to be improved here, would there be a noticable performance increase in C /C# or Java?

CodePudding user response：

There is rather effective O(nlogn) algorithm by Main and Lorentz to search for tandem repetitions.

Python translation from this code returns only the fact of repeat.

def z_function(s):
    n = len(s)
    z = [0]*n
    l = 0
    r = 0
    for i in range(1, n):
        if (i <= r):
            z[i] = min (r-i 1, z[i-l])
        while (i z[i] < n) and (s[z[i]] == s[i z[i]]):
            z[i]  = 1
        if (i z[i]-1 > r):
            l = i
            r = i z[i]-1
    return z

def get_z(z, i):
    return z[i] if 0<=i<len(z) else 0

def find_tandems(s, shift = 0):
    n =  len(s)
    if (n == 1):
        return False

    nu = n//2
    nv = n-nu
    u = s[:nu]
    v = s[nu:]
    ru = u[::-1]
    rv = v[::-1]

    if find_tandems (u, shift):
        return True
    if find_tandems (v, shift   nu):
        return True

    z1 = z_function (ru)
    z2 = z_function (v   '#'   u)
    z3 = z_function (ru   '#'   rv)
    z4 = z_function (v)
    for cntr in range(n):
        if (cntr < nu):
            l = nu - cntr
            k1 = get_z (z1, nu-cntr)
            k2 = get_z (z2, nv 1 cntr)

        else:
            l = cntr - nu   1
            k1 = get_z (z3, nu 1   nv-1-(cntr-nu))
            k2 = get_z (z4, (cntr-nu) 1)

        if (k1   k2 >= l):
            return True
    return False

print(find_tandems("ABACABAC"))
print(find_tandems("ABACABADAB"))

>> True
>> False

Addition for incremental case:

def is_valid(s):
    s = s[::-1]
    z = z_function(s)
    for i in range(1,len(z)):
        if z[i]>=i:
            return False
    return True

CodePudding user response：

One approach is to use a regex:

import re


def is_valid(string):
    return re.search(r"(. )\1", string) is None


invalids = ["AA", "ABAA", "ABACAA", "ABACABAC", "ABACABAB"]
print("All invalids are invalid", not any(is_valid(invalid) for invalid in invalids))

valids = ["AB", "ABC", "ABA", "ABACABADAB", "ABACABCACBABCABACABCACBACAB"]
print("All valids are valid", all(is_valid(valid) for valid in valids))

Output

All invalids are invalid True
All valids are valid True

The pattern "(. )\1" tries to match a group of one or more characters followed by the same group.