Home > front end >  Find strings which have a particular string as suffix in optimal time?
Find strings which have a particular string as suffix in optimal time?

Time:06-12

We have an array ‘A’ of strings and an array ‘B’ of strings. For each string ‘s’ in B, we have to find the number of strings in ‘A’ which have the suffix as ‘s’?

1≤size of A ≤10^5

1≤size of B ≤10^5

1≤|Ai|≤10^2

1≤|Bi|≤10^2

The naive approach is simply traversing through 'B' and for each string in B iterate over A to find a number but it has a time complexity of N^2. We need a solution with better time complexity.

CodePudding user response:

Construct a prefix tree based on A. In each vertex of the tree also keep information on how many strings 'pass' through it.

Then, for each s in B, find a vertex in the prefix tree that corresponds to s and just read how many strings from A passed through it (the information that is already there).

Add words from A to prefix tree reversed, so you can operate on suffixes, and not prefixes.

Time complexity is O(size(A) size(B))

Pseudo code:

struct node
{
    node* children[ALPHABET_SIZE]
    int num_words;
}

func solve(string[] a, string[] b)
{
    node* treeRoot = new node()
    
    for (string s in a)
    {
        string x = s.reverse()

        node* currNode = treeRoot
        
        for (char ch in x)
        {
            currNode.num_words  
            currNode.children[ch] = new node()
            currNode = currNode.children[ch]
        }
        currNode.num_words  
    }

    int answer[len(b)]

    for (int i=0;i<len(b);  i)
    {
        string x = b[i].reverse()
        node* currNode = treeRoot
        bool brk = false
        
        for (char ch in x)
        {
            if (currNode.children[ch] == null)
            {
                brk = true
                break
            }
            currNode = currNode.children[ch]
        }

        if (brk)
            answer[i] = 0
        else
            answer[i] = currNode.num_words
    }

    return answer
}

EDIT: By size(A) I mean total number of chars in all strings in A

CodePudding user response:

You can do this in O(N) time, where N is the size of the input (number of strings * average length), and without any complicated data structures.

First reverse all the strings to change this from a suffix problem into a prefix problem. The task is now to find the number of strings in A with each string in B as a prefix.

For each string s in B, remember these two strings:

  • s_start is just s. This is the highest string such that all strings with s as a prefix are lexographically >= s_start
  • s_end is the smallest string such that all strings with s as a prefix are < s_end. You can get this string by incrementing the last character of s.

For example, if s is "abc", then s_start = "abc" and s_end = "abd". Iff we have "abc" <= x < "abd", then "abc" is a prefix of s.

Now sort A, sort the list of B starts, and sort the list of B ends. If you use a radix sort, this takes O(N) time.

Then walk through the 3 lists in order as if you were merging them into one sorted list. If you find the same string in multiple lists, process the B strings before A strings.

Keep track of the number of As you consume, and:

  • Whenever you see an s_start, set start[s] = As_consumed_so_far
  • Whenever you see an s_end, set end[s] = As_consumed_so_far

When you're done, for each s in B, end[s]-start[s] is the number of strings in A with s as a prefix.

  • Related