Home > other >  Algorithm to find max occurrences of a substring with value of a given function
Algorithm to find max occurrences of a substring with value of a given function

Time:12-30

I have to find max(s.length * s.count) for any substring s of a given string t, where s.length is the length of the substring and s.count is the number of times s occurs within t. Substrings may overlap within t.

Example: For the string aaaaaa, the substring aaa has the max (occurrences * length), substrings and occurrences are:

a: 6
aa: 5 
aaa: 4
aaaa : 3 
aaaaa: 2 
aaaaaa: 1

So aaa is our winner with 3 occurrences * length 4 is 12. Yes, aaaa also has a score of 12, but aaa comes first.

I have tried the only means I know or can figure out, but I have an input string of 100,000 length, and just finding all the substrings is O(n^2), and this hangs my program:

var theSet = new HashSet<string>();
for (int i = 1; i < source.Length; i  )
{
    for (int start = 0; start <= source.Length - i; start  )
    {
        var sub = source.Substring(start, i);
        if (!theSet.Contains(sub))
        {
            theSet.Add(sub);
        }
    }
}
...
// Some not-noteworthy benchmark related code
...
int maxVal = 0;
foreach (var sub in subs)
{
    var count = 0;
    for (var i = 0; i < source.Length - sub.Length   1; i  )
    {
        if (source.Substring(i, sub.Length).Equals(sub)) count  ;
    }

    if (sub.Length * count > maxVal)
    {
        maxVal = sub.Length * count;
    }
}

I know I am looking for a relatively unknown algorithm and or data structure with this, as google yields no results that closely match the problem. In fact, Google is where I basically only found the costly algorithms I have attempted to use in the above code.

CodePudding user response:

Edit: Just realized that the problem has a solution on GFG: https://www.geeksforgeeks.org/substring-highest-frequency-length-product/

This can be solved in O(n) time by applying three well-known algorithms: Suffix Array, LCP Array and Largest Rectangular Area in a Histogram.

I will not provide any code as implementations of these algorithms can easily be found on the Internet. I will assume the input string is "banana" and try to explain the steps and how they work.

1. Run Suffix Array - O(n)

The Suffix Array algorithm sorts the suffixes of the string alphabetically. For the input "banana", the output is going to be the array [5, 3, 1, 0, 4, 2], where 5 corresponds to the suffix starting at position 5 ("a"), 3 corresponds to the suffix starting at position 3 ("ana"), 1 corresponds to the suffix starting at position 1 ("anana"), etc. After we compute this array, it becomes much easier to count the occurrences of a substring because the equal substrings are placed consecutively:

  1. a
  2. ana
  3. anana
  4. banana
  5. na
  6. nana

For example, we can immediately see that the substring "ana" occurs twice by looking at the 2nd and the 3rd suffixes in the above list. Similarly, we can say the substring "n" also occurs twice by looking at the 5th and the 6th.

2. Run LCP Array - O(n)

The LCP algorithm computes the length of the longest common prefix between every consecutive pair of suffixes in the suffix array. The output is going to be [1, 3, 0, 0, 2] for "banana":

  1. a
  2. ana // "a" and "ana" share the prefix "a", which is of length 1
  3. anana // "ana" and "anana" share the prefix "ana", which is of length 3
  4. banana // "anana" and "banana" share no prefix, so 0
  5. na // "banana" and "na" share no prefix, so 0
  6. nana // "na" and "nana" share the prefix "na", which is of length 2

Now if we plot the output of the LCP algorithm as an histogram:

  x
  x  x
 xx  x
 -----
 01234
 -----
aaabnn
 nnaaa
 aan n
  na a 
  an
   a

Now, here is the main observation: every rectangle in the histogram that touches the y axis corresponds to a substring and its occurences: the rectangle's width is equal to s.count - 1 and its height equals to s.length

For example consider this rectangle in the lower left corner, that corresponds to the substring "a".

xx
--
01

The rectangle is of height 1, which is "a".length and of width 2, which is "a".count - 1. And the value we need (s.count * s.length) is almost the area of the rectangle.

3. Find the largest rectangle in the histogram - O(n)

Now all we need to do is to find the largest rectangle in the histogram to find the answer to the problem, with the simple nuance that while calculating the area of the rectangle we need to add 1 to its width. This can be done by simply adding a 1 in the area calculation logic in the algorithm.

For the "banana" example, the largest rectangle is the following (considering we added 1 to every rectangle's width):

x
x
x
-
1

We add one to its width and calculate its area as 2 * 3 = 6, which equals to how many times the substring "ana" occurs times its length.

Each of the 3 steps take O(n) time, totalling to an overall time complexity of O(n).

CodePudding user response:

this does the trick despite not being very efficient O(n) complexity. I can't imagine more efficient way though...

 static void TestRegexes()
        {
            var n = CountSubs("aaaaaa", "a");
            var nn = CountSubs("aaaaaa", "aa");
            var nnn = CountSubs("aaaaaa", "aaa");
            var nnnn = CountSubs("aaaaaa", "aaaa");
            var nnnnn = CountSubs("aaaaaa", "aaaaa");
            var nnnnnn = CountSubs("aaaaaa", "aaaaaa");
            ;

        }

        private static int CountSubs( string content, string needle)
        {
            

            int l = content.Length;
            int i = 0;
            int count = 0;
            while (content.Length >= needle.Length)
            {
                if (content.StartsWith(needle))
                {
                    count  ;
                }

                content = content.Substring(1);
                i  ;
            }

            return count;
        }
  • Related