Home > database >  Generate minimum number of sets of substrings with max substring length
Generate minimum number of sets of substrings with max substring length

Time:08-20

Given a string, I'd like to create the minimum number of sets of substrings where:

  1. substrings have a length up to x
  2. if two sets differ when a substring in one set can be composed of smaller substrings in the other set, the second set can be excluded

For example, the substring "abcdef" with max substring length of 3 would result in the following:

[
  ['abc','def'],
  ['ab','cde','f'],
  ['ab','cd','ef'],
  ['a','bcd','ef'],
]

['abc','de','f'] is not included because of condition (2). eg, 'def' is compsed of 'de','f'

The following recursive allSubstrings() method isn't quite right and doesn't feel like the right approach. Also has the problem of being slow with longer strings.

const allSubstrings = (input,max_len = 3) => {

    if( input.length === 1) return [[input]];

    let result = [];

    const start = input.substring(1);

    allSubstrings(start).forEach(function(subresult) {
        let tmp = subresult.slice(0);

        if( tmp[0].length < max_len ){
            tmp[0] = input.charAt(0)   tmp[0];
            result.push(tmp);
        }

        if( subresult.length > 1 ){
            if( subresult.slice(0,2).join('').length <= max_len ){
                return;
            }

            if( subresult.slice(subresult.length-2).join('').length <= max_len ){
                return;
            }
        }


        tmp = subresult.slice(0);
        tmp.unshift(input.charAt(0));

        result.push(tmp);
    });

    return result;
}

CodePudding user response:

Here's one possible approach using recursion, not optimized for performance (e.g., there's no memoization):

function splits(s: string, maxLen: number, minInitLen: number = 1): string[][] {
  const ret: string[][] = s.length ? [] : [[]];
  for (let i = Math.min(maxLen, s.length); i >= minInitLen; i--) {
    for (const r of splits(s.slice(i), maxLen, maxLen   1 - i)) {
      ret.push([s.slice(0, i), ...r]);
    }
  }
  return ret;
}

The idea is that splits() takes not just the string s to split and the maximum length maxLen of substrings, but also the minimum allowable length minInitLen for the first substring. When you start splitting the string this initial length is 1, but inside the recursion it could be larger. After all, let's say that you've split "abcdef" into "a" so far. Now you need to split "bcdef", but the next substring has to have length 3 ("bcd"), since "b" would collapse into "ab" and "bc" would collapse into "abc". If you've split it into "ab" so far, then the next substring has to have length at least 2 ("cd" or "cde"), since "c" would collapse into "abc". The relevant math is that if the current substring is of length i, then the next substring must have at least length maxLen 1 - i.

The base case for the recursion is when you are asked to split an empty string. In this case you want to return [[]]; there is exactly one way to split an empty string. If you are splitting a non-empty string then you want to iterate your first substring length i between minInitLen and maxLen (well, you can't exceed s.length either), and calculate the recursive result splits(s.slice(i), maxLen, maxLen 1 - i), and then for each set of strings in the result, prepend the initial substring s.slice(0, i) to it.


Let's see if it works:

console.log(splits("abcdef", 3))
// [["abc", "def"], ["ab", "cde", "f"], ["ab", "cd", "ef"], ["a", "bcd", "ef"]] 

console.log(splits("abcdefgh", 5))
// [["abcde", "fgh"], ["abcd", "efgh"], ["abc", "defgh"], ["ab", "cdefg", "h"], 
//  ["ab", "cdef", "gh"], ["a", "bcdef", "gh"]] 

It looks good, at least for your example input. I'm not sure if there are edge cases, or if performance becomes a problem for large strings (it probably depends very much on what "large" means). And one could certainly refactor to use functional programming array methods instead of for loops if one wanted. But at least it works.

Playground link to code

  • Related