Find a polynomial time algorithm or prove np-hardness for the following problem: given two strings s1=a1, a2,...,ak and s2=b1,...,bk, where s2 is a random permutation of s1. We now want to build s1 out of s2. A construction works as follows. Pick a letter from s2, that is equal to a1 and remove it. Proceed, with letter a2 and remove it and so on, until s2 is empty. Note, that the same letter can occur multiple times in s2. Let C = c1, c2,...,ck be the constructed sequence, so that C = s1. We define l to be the number of times we need to jump back in s2 to pick the next letter. For example, if we chose c3 and c4, but the index of c4 in the original s2 is smaller than the index of c3 in the original s2, we increment l by one. Our task is, to find the optimum sequence, so that l is minimal. If we are given s1=abac and s2=acab for example, the output has to be 1, since we can pick the second "a" from s2, then choose "b", then jump back and pick the first "a", then add "c".
I don't know how to solve this problem in polynomial time. I thought maybe there is some way to calculate a perfect matching and read of the optimal sequence, but i am not sure. So far, i have only an exponential algorithm.
I tried finding a polynomial algorithm, but i didn't find it
The exponential algorithm looks as follows(not sure if it is correct, don't know how to test it):
def solvenaive(s1, s2, curr_ind):
if len(s1) == 0:
return 0
first_s1 = s1[0]
vorkommen = findOccurrences(s2, first_s1)
results = []
for i in vorkommen:
new_s1 = s1[1:]
new_s2 = stringPop(s2, i)
res = solvenaive(new_s1, new_s2, i)
if curr_ind > i:
results.append(res 1)
else:
results.append(res)
return min(results)
CodePudding user response:
Just to clarify my comment on branch&bound, here is my shot.
It is clearly faster than yours (while keeping the "naive" pure python implementation. For example, using immutable strings when we clearly want to "mutate" them :D). Yet, it is still clearly not polynomial. Just a faster exponential.
That is most of the time the case with branch&bound.
import itertools
import random
import numpy as np
import timeit
import matplotlib.pyplot as plt
W=10 # Size of a word
# Generate a random word
def randomWord():
return ''.join(np.random.choice(list("abcdef"),W))
# And a random shuffling of it
def shuffle(s):
return ''.join(random.sample(s,len(s)))
# Quick'n'dirty implementation of the function you are using
# It would have been better, btw, if you had provided them, as
# needed to create a minimal reproducible example
def findOccurrences(s, c):
return [pos for pos, char in enumerate(s) if char == c]
def stringPop(s, i):
return s[:i] s[i 1:]
# Your function
def solvenaive(s1, s2, curr_ind):
if len(s1) == 0:
return 0
first_s1 = s1[0]
vorkommen = findOccurrences(s2, first_s1)
results = []
for i in vorkommen:
new_s1 = s1[1:]
new_s2 = stringPop(s2, i)
res = solvenaive(new_s1, new_s2, i)
if curr_ind > i:
results.append(res 1)
else:
results.append(res)
return min(results)
# Mine. Almost the same.
# But I add 2 parameters
# bestScore which is what we have seen best
# and sofar, a partial score
def solvecut(s1, s2, curr_ind, bestScore, sofar=0):
if len(s1)==0: # If we reach a "leaf" of recursion, then
# return the score. (sofar is a real score in that case, not a partial)
return sofar
# Otherwise, we will recurse... unless it is a lost cause
if sofar>=bestScore: # No need to try better, we are already over our best
return 1e9
first_s1 = s1[0]
vorkommen = findOccurrences(s2, first_s1)
for i in vorkommen:
new_s1 = s1[1:] # I let that one here in the loop because I don't want to optimize anything other that my "cuts"
# so that improvement are clearly imputable to those cuts
# but, obviously that could have been done outside the loop
new_s2 = stringPop(s2, i)
if curr_ind>i:
res=solvecut(new_s1, new_s2, i, bestScore, sofar 1)
else:
res=solvecut(new_s1, new_s2, i, bestScore, sofar)
if res<bestScore:
bestScore=res
return bestScore # Sometimes we'll return that just because we did nothing best, but it doesn't matter
# Test if result are the same on some examples
for i in range(20):
w=randomWord()
s=shuffle(w)
sc1=solvenaive(w,s,0)
sc2=solvecut(w, s)
print(w, s, sc1, sc2)
# Timeit
# Note that timing depends a lot of the word (letter repetition, etc.). So it is important to do them with the same words
# Especially since this is not very fast, so we need to timit with number=1
W=17
w1=randomWord()
s1=shuffle(w1)
print(timeit.timeit(lambda: solvenaive(w1, s1, 0), number=1))
print(timeit.timeit(lambda: solvecut(w1, s1, 0, 1e9), number=1))
Result
cbaaecffae aaebcffcae 2 2
eedeafdffb ffdbeefaed 3 3
aedefdceba adeefcabde 4 4
aaafccaafb aafacaafcb 2 2
eaffdfafef efdffefaaf 2 2
dedbfffbce bcdffbedfe 3 3
fedfcbaeed bedcfaefde 4 4
ffeebfcfab feaebcffbf 3 3
fcdaddbfbf ffddacbfbd 3 3
deffcdeaea eeafdadfce 4 4
cafeeebcdb beaedfbecc 4 4
beefdaeabd edaefbbdea 3 3
ddccbdbdae cdbdbcdead 3 3
bcbababdef decafbbbba 4 4
dfaeceefea edfcefaaee 2 2
abcdcbbdbf fbbadccbdb 4 4
acbedbaefc cbeacaebdf 3 3
aaffacfbde cfaaeafbfd 3 3
edceddebdc dedcecbded 2 2
ecafabdfea fafaeaedbc 3 3
4.060046362923458
0.05187216284684837
So we can trust reasonnably that my code returns always the same result as yours.
But timings are nearly 100 times faster.
Yet (I've tried with many W's value, unfortunately it is obvious, despite the randomness), it is still exponential:
1 cut=6.040092557668686e-06
2 cut=8.61193984746933e-06
3 cut=1.5990110114216805e-05
4 cut=1.5911879017949104e-05
5 cut=2.897903323173523e-05
6 cut=3.5561854019761086e-05
7 cut=5.8827921748161316e-05
8 cut=0.00018196692690253258
9 cut=0.0001191610936075449
10 cut=0.0002572301309555769
11 cut=0.0008078841492533684
12 cut=0.0009584939107298851
13 cut=0.008062224835157394
14 cut=0.004587493138387799
15 cut=0.08773919194936752
16 cut=0.02209200686775148
17 cut=0.045466484036296606
18 cut=0.11209587100893259
19 cut=0.40787983988411725
20 cut=2.2789966259151697
21 cut=1.8927371250465512
22 cut=1.097903796005994
Not smooth... but we can clearly see that it is the log of the timing that is "linear with lot of noise", not the value. So, it is still exponential.
Note also that with some thinking, we should be able to estimate a better minimum bound of the reachable score than sofar (using sofar means that we still always hope that there is a solution without never going back. Which, for example, is impossible if curr_ind
>0. So we could improve at least by saying that if curr_ind
>0, then sofar 1
must be smaller than bestScore
or else it is no use to try)
Edit: I could no resist and try it. So here are some slight optimization
def lessNaive(s1, s2, curr_ind, bestScore, sofar):
if len(s1)==0:
return sofar
if sofar>=bestScore or (curr_ind>0 and sofar 1>=bestScore): # No need to try better, we are already over our best
return 1e9
first_s1 = s1[0]
new_s1 = s1[1:]
for i in range(len(s2)):
if s2[i]!=first_s1: continue
new_s2 = stringPop(s2, i)
if curr_ind>i:
res=lessNaive(new_s1, new_s2, i, bestScore, sofar 1)
else:
res=lessNaive(new_s1, new_s2, i, bestScore, sofar)
if res<bestScore:
bestScore=res
return bestScore # Sometimes we'll return that just because we did nothing best, but it doesn't matter
Not much optimization here
- I put the
new_s1
computation outside the loop - I combined the iteration over s2 position with the search for those positions
- As I said, if
curr_ind>0
then we know that we can't reach less thansofar 1
. Which is probably the most important optimization, even for just this1
, since it cuts whole branches of recursion that were about to equalize the best score.
Timings I got this time
11.236965486081317
0.4795242650434375
0.08659004187211394
[timings of naive, cut, and now lessNaive]
It was apparently a harder case (since it is for the same W as the previous 4.06/0.051). But the point is, even minor optimization such as this 1
change by an order of magnitude the timings!
(Yet, it is still exponential. An even smaller exponential, but exponential. The thing with exponential, is that the ratio between two exponential is also exponential! So even going from one exponential to another exponential really worth it. Yet, the smallest exponential of course is still not polynomial)
CodePudding user response:
This sounds like the Minimal String Construction problem, which is possible to solve in polynomial time using dynamic programming. The idea is to use a 2D array to store the minimum number of jumps needed to build the first string up to the i-th character, starting from the j-th character of the second string.
def minimalStringConstruction(s1: str, s2: str) -> int:
n = len(s1)
m = len(s2)
# Initialize the 2D array dp with infinity
dp = [[float('inf') for _ in range(m)] for _ in range(n)]
# Set the first row of dp to 0, since there are no jumps needed when starting from the first character of s2
for j in range(m):
dp[0][j] = 0
# Iterate over the characters of s1
for i in range(1, n):
# Iterate over the characters of s2
for j in range(m):
# If the current characters of s1 and s2 match, no jump is needed
if s1[i] == s2[j]:
dp[i][j] = dp[i - 1][j - 1]
# Otherwise, we need to jump back to the previous matching character in s2 and increment the number of jumps
else:
dp[i][j] = dp[i - 1][j - 1] 1
# The minimum number of jumps is the minimum value in the last row of dp
return min(dp[n - 1])
This has a time complexity of O(n * m), where n and m are the lengths of the strings s1 and s2, respectively. This means that it will run in polynomial time (as long as the input strings are of reasonable length).