Construct string from its permutation with least jumps back-CodePudding

Find a polynomial time algorithm or prove np-hardness for the following problem: given two strings s1=a1, a2,...,ak and s2=b1,...,bk, where s2 is a random permutation of s1. We now want to build s1 out of s2. A construction works as follows. Pick a letter from s2, that is equal to a1 and remove it. Proceed, with letter a2 and remove it and so on, until s2 is empty. Note, that the same letter can occur multiple times in s2. Let C = c1, c2,...,ck be the constructed sequence, so that C = s1. We define l to be the number of times we need to jump back in s2 to pick the next letter. For example, if we chose c3 and c4, but the index of c4 in the original s2 is smaller than the index of c3 in the original s2, we increment l by one. Our task is, to find the optimum sequence, so that l is minimal. If we are given s1=abac and s2=acab for example, the output has to be 1, since we can pick the second "a" from s2, then choose "b", then jump back and pick the first "a", then add "c".

I don't know how to solve this problem in polynomial time. I thought maybe there is some way to calculate a perfect matching and read of the optimal sequence, but i am not sure. So far, i have only an exponential algorithm.

I tried finding a polynomial algorithm, but i didn't find it

The exponential algorithm looks as follows(not sure if it is correct, don't know how to test it):

def solvenaive(s1, s2, curr_ind):
    if len(s1) == 0:
        return 0
    first_s1 = s1[0]
    
    vorkommen = findOccurrences(s2, first_s1)
    results = []

    for i in vorkommen:
        new_s1 = s1[1:]
        new_s2 = stringPop(s2, i)
        res = solvenaive(new_s1, new_s2, i)
        
        if curr_ind > i:
            results.append(res 1)
        else:
            results.append(res)
    
    return min(results)

CodePudding user response：

Just to clarify my comment on branch&bound, here is my shot.

It is clearly faster than yours (while keeping the "naive" pure python implementation. For example, using immutable strings when we clearly want to "mutate" them :D). Yet, it is still clearly not polynomial. Just a faster exponential.

That is most of the time the case with branch&bound.

import itertools
import random
import numpy as np
import timeit
import matplotlib.pyplot as plt

W=10 # Size of a word

# Generate a random word
def randomWord():
    return ''.join(np.random.choice(list("abcdef"),W))

# And a random shuffling of it
def shuffle(s):
    return ''.join(random.sample(s,len(s)))

# Quick'n'dirty implementation of the function you are using
# It would have been better, btw, if you had provided them, as
# needed to create a minimal reproducible example
def findOccurrences(s, c):
    return [pos for pos, char in enumerate(s) if char == c]

def stringPop(s, i):
    return s[:i] s[i 1:]

# Your function
def solvenaive(s1, s2, curr_ind):
    if len(s1) == 0:
        return 0
    first_s1 = s1[0]
    
    vorkommen = findOccurrences(s2, first_s1)
    results = []

    for i in vorkommen:
        new_s1 = s1[1:]
        new_s2 = stringPop(s2, i)
        res = solvenaive(new_s1, new_s2, i)
        
        if curr_ind > i:
            results.append(res 1)
        else:
            results.append(res)
    
    return min(results)

# Mine. Almost the same.
# But I add 2 parameters
# bestScore which is what we have seen best
# and sofar, a partial score 
def solvecut(s1, s2, curr_ind, bestScore, sofar=0):
    if len(s1)==0: # If we reach a "leaf" of recursion, then
                   # return the score. (sofar is a real score in that case, not a partial)
        return sofar
    # Otherwise, we will recurse... unless it is a lost cause
    if sofar>=bestScore: # No need to try better, we are already over our best
        return 1e9

    first_s1 = s1[0]
    
    vorkommen = findOccurrences(s2, first_s1)


    for i in vorkommen:
        new_s1 = s1[1:] # I let that one here in the loop because I don't want to optimize anything other that my "cuts"
                        # so that improvement are clearly imputable to those cuts
                        # but, obviously that could have been done outside the loop
        new_s2 = stringPop(s2, i)
        if curr_ind>i:
            res=solvecut(new_s1, new_s2, i, bestScore, sofar 1)
        else:
            res=solvecut(new_s1, new_s2, i, bestScore, sofar)

        if res<bestScore:
            bestScore=res

    return bestScore # Sometimes we'll return that just because we did nothing best, but it doesn't matter


# Test if result are the same on some examples

for i in range(20):
    w=randomWord()
    s=shuffle(w)
    sc1=solvenaive(w,s,0)
    sc2=solvecut(w, s)
    print(w, s, sc1, sc2)

# Timeit
# Note that timing depends a lot of the word (letter repetition, etc.). So it is important to do them with the same words
# Especially since this is not very fast, so we need to timit with number=1
W=17
w1=randomWord()
s1=shuffle(w1)
print(timeit.timeit(lambda: solvenaive(w1, s1, 0), number=1))
print(timeit.timeit(lambda: solvecut(w1, s1, 0, 1e9), number=1))

Result

cbaaecffae aaebcffcae 2 2
eedeafdffb ffdbeefaed 3 3
aedefdceba adeefcabde 4 4
aaafccaafb aafacaafcb 2 2
eaffdfafef efdffefaaf 2 2
dedbfffbce bcdffbedfe 3 3
fedfcbaeed bedcfaefde 4 4
ffeebfcfab feaebcffbf 3 3
fcdaddbfbf ffddacbfbd 3 3
deffcdeaea eeafdadfce 4 4
cafeeebcdb beaedfbecc 4 4
beefdaeabd edaefbbdea 3 3
ddccbdbdae cdbdbcdead 3 3
bcbababdef decafbbbba 4 4
dfaeceefea edfcefaaee 2 2
abcdcbbdbf fbbadccbdb 4 4
acbedbaefc cbeacaebdf 3 3
aaffacfbde cfaaeafbfd 3 3
edceddebdc dedcecbded 2 2
ecafabdfea fafaeaedbc 3 3
4.060046362923458
0.05187216284684837

So we can trust reasonnably that my code returns always the same result as yours.

But timings are nearly 100 times faster.

Yet (I've tried with many W's value, unfortunately it is obvious, despite the randomness), it is still exponential:

1 cut=6.040092557668686e-06 
2 cut=8.61193984746933e-06 
3 cut=1.5990110114216805e-05 
4 cut=1.5911879017949104e-05 
5 cut=2.897903323173523e-05 
6 cut=3.5561854019761086e-05 
7 cut=5.8827921748161316e-05 
8 cut=0.00018196692690253258 
9 cut=0.0001191610936075449 
10 cut=0.0002572301309555769 
11 cut=0.0008078841492533684 
12 cut=0.0009584939107298851 
13 cut=0.008062224835157394 
14 cut=0.004587493138387799 
15 cut=0.08773919194936752 
16 cut=0.02209200686775148 
17 cut=0.045466484036296606 
18 cut=0.11209587100893259 
19 cut=0.40787983988411725 
20 cut=2.2789966259151697 
21 cut=1.8927371250465512 
22 cut=1.097903796005994

Not smooth... but we can clearly see that it is the log of the timing that is "linear with lot of noise", not the value. So, it is still exponential.

Note also that with some thinking, we should be able to estimate a better minimum bound of the reachable score than sofar (using sofar means that we still always hope that there is a solution without never going back. Which, for example, is impossible if curr_ind>0. So we could improve at least by saying that if curr_ind>0, then sofar 1 must be smaller than bestScore or else it is no use to try)

Edit: I could no resist and try it. So here are some slight optimization

def lessNaive(s1, s2, curr_ind, bestScore, sofar):
    if len(s1)==0:
        return sofar
    if sofar>=bestScore or (curr_ind>0 and sofar 1>=bestScore): # No need to try better, we are already over our best
        return 1e9

    first_s1 = s1[0]
    new_s1 = s1[1:]

    for i in range(len(s2)):
        if s2[i]!=first_s1: continue
        new_s2 = stringPop(s2, i)
        if curr_ind>i:
            res=lessNaive(new_s1, new_s2, i, bestScore, sofar 1)
        else:
            res=lessNaive(new_s1, new_s2, i, bestScore, sofar)

        if res<bestScore:
            bestScore=res

    return bestScore # Sometimes we'll return that just because we did nothing best, but it doesn't matter

Not much optimization here

I put the new_s1 computation outside the loop
I combined the iteration over s2 position with the search for those positions
As I said, if curr_ind>0 then we know that we can't reach less than sofar 1. Which is probably the most important optimization, even for just this 1, since it cuts whole branches of recursion that were about to equalize the best score.

Timings I got this time

11.236965486081317
0.4795242650434375
0.08659004187211394

[timings of naive, cut, and now lessNaive]

It was apparently a harder case (since it is for the same W as the previous 4.06/0.051). But the point is, even minor optimization such as this 1 change by an order of magnitude the timings! (Yet, it is still exponential. An even smaller exponential, but exponential. The thing with exponential, is that the ratio between two exponential is also exponential! So even going from one exponential to another exponential really worth it. Yet, the smallest exponential of course is still not polynomial)

CodePudding user response：

This sounds like the Minimal String Construction problem, which is possible to solve in polynomial time using dynamic programming. The idea is to use a 2D array to store the minimum number of jumps needed to build the first string up to the i-th character, starting from the j-th character of the second string.

def minimalStringConstruction(s1: str, s2: str) -> int:
    n = len(s1)
    m = len(s2)
    # Initialize the 2D array dp with infinity
    dp = [[float('inf') for _ in range(m)] for _ in range(n)]

    # Set the first row of dp to 0, since there are no jumps needed when starting from the first character of s2
    for j in range(m):
        dp[0][j] = 0

    # Iterate over the characters of s1
    for i in range(1, n):
        # Iterate over the characters of s2
        for j in range(m):
            # If the current characters of s1 and s2 match, no jump is needed
            if s1[i] == s2[j]:
                dp[i][j] = dp[i - 1][j - 1]
            # Otherwise, we need to jump back to the previous matching character in s2 and increment the number of jumps
            else:
                dp[i][j] = dp[i - 1][j - 1]   1

    # The minimum number of jumps is the minimum value in the last row of dp
    return min(dp[n - 1])

This has a time complexity of O(n * m), where n and m are the lengths of the strings s1 and s2, respectively. This means that it will run in polynomial time (as long as the input strings are of reasonable length).