Home > Software engineering >  Reducing a string by detecting patterns
Reducing a string by detecting patterns

Time:05-16

Given a string, I would like to detect the repeating substrings, and then reduce abab to (ab)2.

For instance, ababababacdecdecdeababab would reduce to (ab)4a(cde)3(ab)3. The string does not have the same character twice in a row. So, aaab is an invalid string.

Here is the Python that I wrote:

def superscript(n):
    return "".join(["⁰¹²³⁴⁵⁶⁷⁸⁹"[ord(c)-ord('0')] for c in str(n)]) 


signature = 'hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb'
d = {}

processed = []
for k in range(2, len(signature)):
    i = 0
    j = i   k
    while j <= len(signature):
        repeat_count = 0
        while signature[i:i k] == signature[j:j k]:
            repeat_count  = 1
            j  = k
        if repeat_count > 0 and i not in processed:
            d[i] = [i k, repeat_count   1]
            for j in range(i, (i k)*repeat_count   1):
                processed.append(j)
            
            i = j
            j = i   k
        else:
            i  = 1
            j = i   k
            
od = collections.OrderedDict(sorted(d.items()))
output = ''
for k,v in od.items():
    print(k, v)
    output  = '('   signature[k:v[0]]   ')'   superscript(v[1])

Which aims to detect the repeating substrings of length 2, 3, 4, and so on. I mark the start and the end of a repeating substring by using a dict. I also mark the index of the processed characters by keeping a list to avoid replacing (ab)4 by (abab)2 (since the latter one will overwrite the beginning index in the dict).

The example string I work with is hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb which should output (hd)4(cg)4c(bf)4b(ae)4a(dh)4d(cg)4c(bf)4b(ae)4a(dh)4d(cg)4c(bf)4b(ae)4a(dh)4d(cg)4cbfb.

However, I get this output: (hd)4(dcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdh)5(cg)4(ea)2(dh)4(hd)2(cg)4

I don't know whether this is a well-known problem, but I couldn't find any resources. I don't mind the time complexity of the algorithm. Where did I make a mistake?

The algorithm I try to describe looks like this:
First, find the repeating substrings of length 2, then 3, then 4, ..., up to the length of the input string.
Then, do the same operation until there is no repetition at all.

A step-by-step example looks like this:

abcabcefefefghabcdabcdefefefghabcabcefefefghabcdabcdefefefgh
abcabc(ef)²ghabcdabcd(ef)²ghabcabc(ef)²ghabcdabcd(ef)²gh
(abc)³(ef)²ghabcdabcd(ef)²gh(abc)³(ef)²ghabcdabcd(ef)²gh
(abc)³(ef)²gh(abcd)²(ef)²gh(abc)³(ef)²gh(abcd)²(ef)²gh
((abc)³(ef)²gh(abcd)²(ef)²gh)²

CodePudding user response:

You can use re.sub to match any repeating two chars and then pass a replacement function that formats the pattern you desire

import re

def superscript(n):
    return "".join(["⁰¹²³⁴⁵⁶⁷⁸⁹"[ord(c)-ord('0')] for c in str(n)])

s = 'hdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdhdcgcgcgcgcbfb'

max_length = 5

result = re.sub(
    rf'(\w{{2,{max_length}}}?)(\1 )', # Omit the second number in the repetition to match any number of repeating chars (\w{2,}?)(\1 )
    lambda m: f'({m.group(1)}){superscript(len(m.group(0))//len(m.group(1)))}',
    s
)

print(result) # (hd)⁴(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴c(bf)⁴b(ae)⁴a(dh)⁴d(cg)⁴cbfb

CodePudding user response:

The problem in your code happens when you put together the list of repeating patterns. When you are merging patterns of length 2 and patterns of length 3, you are using patterns that are not compatible with each other.

  • hdhdhdhd = (hd)4 starts at index 0 and ends at index 7 (included).
  • (dcgcgcgcgcbfbfbfbfbaeaeaeaeadhdhdhdh)5, which is a correct pattern in your string, starts at index 7 (included).

This means when you merge the two patterns, you get an incorrect end result because the letter at index 7 is shared.

This problem stems from the fact that one pattern is even in length, while the other is odd and their limits are not aligning. So, they don't even overwrite each other in d and you end up with your result.

I think you tried to solve this problem using the dictionary d with the starting index as key and with the processed list, but there is still a couple of problems.

  1. for j in range(i, (i k)*repeat_count 1): should be for l in range(i, j), otherwise you are skipping the last index of the pattern (j). Also, I changed the loop index to l because j was already used. This fixed the problem I described above.
  2. Even with that fixed, there is still an issue. You check for patterns starting from length 2 (for k in range(2, len(signature))), so single letters not belonging to any pattern, like the c in (hd)4(cg)4c(bf)4 will never make it in the dictionary and therefore you will still have overlapping patterns with different lengths at those positions.
  • Related