Home > Software engineering >  Generating a random string with matched brackets
Generating a random string with matched brackets

Time:07-12

I need to generate a random string of a certain length – say ten characters, for the sake of argument – composed of the characters a, b, c, (, ), with the rule that parentheses must be matched.

So for example aaaaaaaaaa, abba()abba and ((((())))) are valid strings, but )aaaabbbb( is not.

What algorithm would generate a random string, uniformly sampled from the set of all strings consistent with those rules? (And run faster than 'keep generating strings without regard to the balancing rule, discard the ones that fail it', which could end up generating very many invalid strings before finding a valid one.)

CodePudding user response:

Use dynamic programming to generate a data structure that knows how many there are for each choice, recursively. Then use that data structure to find a random choice.

I seem to be the only person who uses the technique. And I always write it from scratch. But here is working code that hopefully explains it. It will take time O(length_of_string * (length_of_alphabet 2)) and similar data.

import random

class DPPath:
    def __init__ (self):
        self.count = 0
        self.next = None

    def add_option(self, transition, tail):
        if self.next is None:
            self.next = {}
        self.next[transition] = tail
        self.count  = tail.count

    def random (self):
        if 0 == self.count:
            return None
        else:
            return self.find(int(random.random() * self.count))

    def find (self, pos):
        result = self._find(pos)
        return "".join(reversed(result))

    def _find (self, pos):
        if self.next is None:
            return []

        for transition, tail in self.next.items():
            if pos < tail.count:
                result = tail._find(pos)
                result.append(transition)
                return result
            else:
                pos -= tail.count

        raise IndexException("find out of range")

def balanced_dp (n, alphabet):
    # Record that there is 1 empty string with balanced parens.
    base_dp = DPPath()
    base_dp.count = 1

    dps = [base_dp]

    for _ in range(n):
        # We are working backwards towards the start.
        prev_dps = [DPPath()]

        for i in range(len(dps)):
            # prev_dps needs to be bigger in case of closed paren.
            prev_dps.append(DPPath())
            # If there are closed parens, we can open one.
            if 0 < i:
                prev_dps[i-1].add_option('(', dps[i])

            # alphabet chars don't change paren balance.
            for char in alphabet:
                prev_dps[i].add_option(char, dps[i])

            # Add a closed paren.
            prev_dps[i 1].add_option(")", dps[i])

        # And we are done with this string position.
        dps = prev_dps

    # Return the one that wound up balanced.
    return dps[0]

# And a quick demo of several random strings.
for _ in range(10):
    print(balanced_dp(10, "abc").random())

CodePudding user response:

A string consisting only of balanced parentheses (for any arbitrary pair of characters representing an open and a close) is called a "Dyck string", and the number of such strings with p pairs of parentheses is the pth Catalan number, which can be computed as (2pCp)/(p 1), a formula which would be much easier to make readable if only SO allowed MathJax. If you want to also allow other characters, you need to consider, for each number pn of pairs of balanced parentheses, the number of different combinations of the non-parentheses characters (4(2n-2p)) and the number of ways you can interpolate 2n-2p characters in a string of total length 2n (2nC2p). If you sum all these counts for each possible value of p, you'll get the count of the total universe of possibilities, and you can then choose a random number in that range and select whichever of the individual p counts corresponds. Then you can select a random placement of random non-parentheses characters.

Finally, you need to get a uniformly distributed Dyck string; a simple procedure is to decompose the Dyck string into it's shortest balanced prefix and the remainder (i.e. (A)B, where A and B are balanced subsequences). Select a random length for (A), then recursively generate a random A and a random B.

Precomputing the tables of counts (or memoising the function which generates them) will produce a slight speedup if you expect to generate a lot of random strings.

  • Related