Given a list of strings, find the shortest unique substring of each string that doesn't occur i-CodePudding

The problem statement is:

Given a list of strings, find the shortest unique substring of each string such that the substring does not occur in any of the other strings.

For example:

Input: ["cheapair", "cheapoair", "peloton", "pelican"]
Output:
"cheapair": "pa"  // every other 1-2 length substring overlaps with cheapoair
"cheapoair": "po" // "oa" would also be acceptable
"pelican": "ca"   // "li", "ic", or "an" would also be acceptable
"peloton": "t"    // this single letter doesn't occur in any other string

I think this is solved with dynamic programming, but honestly, I have no idea how to do this other than brute force: Storing all substrings for each word and checking that they don't exist in any of the other ones, which is a terrible idea.

CodePudding user response：

Building all substrings to find those that appear in two (or more) strings. Then for each string, print a shortest substring that doesn't appear in two (or more). Maybe not the best (depends on your data and what you consider "best"), but works and is simple.

strings = ["cheapair", "cheapoair", "peloton", "pelican"]

def substrings(s):
    n = len(s)
    return {s[i:i k]
            for k in range(1, n 1)
            for i in range(n-k 1)}

one = set()
two = set()
for s in strings:
    subs = substrings(s)
    two |= one & subs
    one |= subs

for s in strings:
    subs = substrings(s)
    uniq = subs - two
    print(s, min(uniq, key=len))

Output (Try it online!):

cheapair pa
cheapoair oa
peloton t
pelican an

CodePudding user response：

Here's an alternate (but similar) solution that is O(L*N*N) (where N is the number of words in the list and L the maximum length of any of those words, assuming N > L). It iterates over substrings starting with length 1 and going up to the maximum length of the words in the list, so will terminate early once it finds the shortest unique string for each word. By iterating like this it also keeps the storage requirements low.

def shortest_substrings(words):
    # L = length of max word
    # N = number of words
    maxlen = max(len(w) for w in words)     # O(N)
    remainingwords = words[:]
    for ln in range(1, maxlen):     # O(L)
        # find all ln-character substrings of each word
        substrs = {}
        allsubstrs = defaultdict(set)
        for w in words:             # O(N)
            substrs.update({ w : set(w[i:i ln] for i in range(len(w) 1-ln)) })  # O(L)
            for w2 in remainingwords:      # O(N)
                if w2 != w:
                    allsubstrs[w2].update(substrs[w]) 
        # now see if any aren't in the global list
        for w in remainingwords[:]:     # O(N)
            unique = [s for s in substrs[w] if s not in allsubstrs[w]]      # O(N)
            if len(unique):
                print(f"{w} : {', '.join(unique)}")
                remainingwords.remove(w)
        if len(remainingwords) == 0:
            break

shortest_substrings(["cheapair", "cheapoair", "peloton", "pelican"])

Output:

peloton : t
cheapair : pa
cheapoair : oa, po
pelican : ic, ca, an, li