Generate A Random String With A Set of Banned Substrings-CodePudding

I want to generate a random string of a fixed length L. However, there is a set of "banned" substrings all of length b that cannot appear in the string. Is there a way to algorithmically generate this parent string?

Here is a small example:

I want a string that is 10 characters long -> XXXXXXXXXX The banned substrings are {'AA', 'CC', 'AD'} The string ABCDEFGHIJ is a valid string, but AABCDEFGHI is not.

For a small example it is relatively easy to randomly generate and then check the string, but as the set of banned substrings gets larger (or the length of the banned substrings gets smaller), the probability of randomly generating a valid string rapidly decreases.

CodePudding user response：

There are two ways.

Random String is created while repeating until the ban string does not appear.

#include <bits/stdc  .h>
using namespace std;
const int N = 1110000;

char ban[] = "AA";
char rnd[N];

int main() {
    srand(time(0));
    int n = 100;
    do {
        for (int i = 0; i < n; i  ) rnd[i] = 'A' rand()&;
        rnd[n] = 0;
    } while (strstr(rnd, ban));
    cout << rnd << endl;
}

I think this is the easiest way to implement. However, this method has complexity up to O((26/25)^n*(n b)), if the length of the string to be created is very long and the length of the ban string is very small. For example if ban="A", n=10000, then there will be time limit exceed!

You can proceed with the creation of one character and one character while checking whether there is a ban string.

If you want to use this way, you must know about KMP algorithm. In this way, we can not use the system default search function strstr.

#include <bits/stdc  .h>
using namespace std;
const int N = 1110000;

char ban[] = "A";
char rnd[N];

int b = strlen(ban);
int n = 10;

int *pre_kmp() {
    int *pie=new int [b];
    pie[0] = 0;
    int k=0;
    for(int i=1;i<b;i  )
    {
        while(k>0 && ban[k] != ban[i] )
        {
            k=pie[k-1];
        }
        if( ban[k] == ban[i] )
        {
            k=k 1;
        }
        pie[i]=k;
    }
    return pie;
}


int main() {
    srand(time(0));
    
    int *pie = pre_kmp();
    int matched_pos = 0;
    for (int cur = 0; cur < n; cur  ) {
        do {
            rnd[cur] = 'A' rand()&;
            while (matched_pos > 0 && ban[matched_pos] != rnd[cur])
                matched_pos = pie[matched_pos-1];
            if (ban[matched_pos] == rnd[cur])
                matched_pos = matched_pos 1;
        } while (matched_pos == b);
    }
    
    cout << rnd << endl;
}

This algorithm's time complexity will be O(26 * n * b). Of course you could also search manually without using the KMP algorithm. However, in this case, the time complexity becomes O(n*b), so the time complexity becomes very large when both the ban string length and the generator string length are long.

Hope this helps you.

CodePudding user response：

This will be a fairly efficient approach, but it requires a lot of theory.

First you can take your list of strings, like in this case AA, BB, AD. This can trivially be turned into a regular expression that matches any of them, namely /AA|BB|AD/. Which you can then turn into an NFA and a DFA for matching the regular expression. See these lecture notes for an example of how to do that. In this case the DFA will have the following states:

Matched nothing (yet)
Matched A
Matched B
End of match

And the transition rules will be:

If A go to state 2, if B go to state 3, else go to state 1.
If A or D go to state 4, else go to state 1.
If B go to state 4, else go to state 1.
Match complete, we're done. (Stay in state 4 forever.)

Now normally a DFA is used to find a match. We're going to use it as a way to find ways to avoid a match. So how will we do that?

The trick here is dynamic programming.

What we will do is create a table of:

by position in the string
    by state of the match
        how many ways there are to get here
        how many ways we got here from the previous (position, state) pairs

In other words we go forward and create a table which starts like this:

[
    {1: {'count': 1}},
    {1: {'count': 24, 'prev': {1: 24}},
     2: {'count': 1, 'prev': {1: 1}},
     3: {'count': 1, 'prev': {1: 1}},
    },
    {1: {'count': 625, 'prev': {1: 576, 2: 24, 3: 25}},
     2: {'count': 25, 'prev': {1: 24, 3: 1}},
     3: {'count': 25, 'prev': {1: 24, 2: 1}},
     4: {'count': 3, 'prev': {2: 2, 3: 1}},
    },
    ...         
]

By the time this is done, we know exactly how many ways we can wind up at the end of the string with a match (state 4), partial match (states 2 or 3) or not currently matching (state 1).

Now we generate the random string backwards. First we randomly pick the final state with odds based on the count. Then from that final state's prev entry we can pick the state we were on before that final one. Then we randomly pick a letter that would have done that. We are picking that letter/state combination completely randomly from all solutions.

Now we know what state we were in at the second to last letter. Pick the previous state, then pick a letter in the same way.

Continue back until finally we know the state we're in after we've picked the first letter, and then pick that letter.

It is a lot of logic to write, but it is all deterministic in time, and will let you pick random strings all day long once you've done the initial analysis.