Home > Mobile >  Z-Function and unique substrings: broken algorithm parroted everywhere?
Z-Function and unique substrings: broken algorithm parroted everywhere?

Time:05-19

I am not a huge math nerd so I may easily be missing something, but let's take the algorithm from https://cp-algorithms.com/string/z-function.html and try to apply it to, say, string baz. This string definitely has a substring set of 'b','a','z', 'ba', 'az', 'baz'.

Let's see how z function works (at leas how I understand it):

  1. we take an empty string and add 'b' to it. By definition of the algo z[0] = 0 since it's undefined for size 1;
  2. we take 'b' and add 'a' to it, invert the string, we have 'ab'... now we calculate z-function... and it produces {0, 0}. First element is "undefined" as is supposed, second element should be defined as:

i-th element is equal to the greatest number of characters starting from the position i that coincide with the first characters of s.

so, at i = 1 we have 'b', our string starts with a, 'b' doesn't coincide with 'a' so of course z[i=1]=0. And this will be repeated for the whole word. In the end we are left with z-array of all zeroes that doesn't tell us anything despite the string having 6 substrings.

Am I missing something? There are tons of websites recommending z function for count of distinct substrings but it... doesn't work? Am I misunderstanding the meaning of distinct here?

See test case: https://pastebin.com/mFDrSvtm

CodePudding user response:

When you add a character x to the beginning of a string S, all the substrings of S are still substrings of xS, but how many new substrings do you get?

  • The new substrings are all prefixes of xS. There are length(xS) of these, but
  • max(Z(xS)) of these are already substrings of S, so
  • You get length(xS) - max(Z(xS)) new ones

So, given a string S, just add up all the length(P) - max(Z(P)) for every suffix P of S.

Your test case baz has 3 suffixes: z, az, and baz. All the letters are distinct, so their Z functions are zero everywhere. The result is that the number of distinct substrings is just the sum of the suffix lengths: 3 2 1 = 6.

Try baa: The only non-zero in the Z functions is Z('aa')[1] = 1, so the number of unique substrings is 3 2 - 1 1 = 5.

Note that the article you linked to mentions that this is an O(n2) algorithm. That is correct, although its overhead is low. It's possible to do this in O(n) time by building a suffix tree, but that is quite complicated.

  • Related