Home > Enterprise >  Generate random postfix expressions of an arbitrary length without producing duplicate expressions
Generate random postfix expressions of an arbitrary length without producing duplicate expressions

Time:08-28

For a research project, we are currently using the Gamblers Ruin Algorithm to produce random postfix expressions (terms) using term variables "x", "y", and "z" and an operator "*", as shown in the method below:

def gamblers_ruin_algorithm(prob=0.3,
                            min_term_length=1,
                            max_term_length=None):
    """
    Generate a random term using the gamblers ruin algorithm

    :type prop: float
    :param prob: Probability of growing the size of a random term
    :type max_term_length: int
    :param max_term_length: Maximum length of the generated term
    """
    term_variables = ["x", "y", "z"]
    substitutions = ("EE*", "I")
    term = "E"
    term_length = 0
    # randomly build a term
    while("E" in term):
        rand = uniform(0, 1)
        if rand < prob or term_length < min_term_length:
            index = 0
            term_length  = 1
        else:
            index = 1
        if (max_term_length is not None and
                term_length >= max_term_length):
            term = term.replace("E", "I")
            break
        term = term.replace("E", substitutions[index], 1)
    # randomly replace operands
    while("I" in term):
        term = term.replace("I", choice(term_variables), 1)
    return term

The following method will produce a random postfix term like the following:

xyz*xx***zyz*zzx*zzyy***x**x****x**

This issue with this method is that when run thousands of times, it tends to frequently produce duplicate expressions.

Is there a different algorithm for producing random posfix expressions of an arbitrary length that minimizes the probability of producing the same expression more than once?

CodePudding user response:

You have four digits: x, y, z, and * such that:

x = 1
y = 2
z = 3
* = 4

So any expression can be expressed as a number using those digits. For example, the postfix expression xy*z* is 12434. And every such number maps to a unique expression.

With this technique, you can map each expression to a unique 32 bit or 64 bit number. And there are many good techniques for generating unique random numbers. See, for example, https://stackoverflow.com/a/34420445/56778.

So:

  1. Generate a bunch of unique random numbers.
  2. For each random number:
  3. Convert to that modified base 5 number.
  4. Generate the expression from that number.

You can of course combine the 3rd and 4th steps. That is, instead of generating a '1', generate an 'x'.

There will of course be some limit on the length of the expressions. Each digit requires two bits to represent, so the maximum length of an expression from a 32 bit number will be 16 characters. You can extend to longer expressions easily enough by generating 64 bit random numbers. Or 128. Or whatever you like. The basic algorithm remains the same.

CodePudding user response:

Your basic problem is not the algorithm; it's the way you force the minimize size of the resulting expression. That procedure introduces an important bias into the generation of the first min_term_length operators. If the expression manages to grow further, that bias will slowly decrease, but it will never disappear.

Until the expression reaches the minimum length, you replace the first E with EE*. So the first few expressions are always:

E
EE*
EE*E*
EE*E*E*
...

When the minimum length is reached, the function starts replacing E with I with probability 1-prob, which is 70% using the default argument. If this succeeds for all the Es, the function will return a tree with the above shape.

Suppose that min_term_length is 5. The probability of five successive tests choosing to not extend the expression is 0.75, or about 16.8%. At that point, the expression will be II*I*I*I*I, and the six Is will be randomly replaced by a variable name. There are three variables, making a total of 36 = 729 different postfix expressions. If you do a large number of samples, the fact that a sixth of the samples fall into 729 possible expressions will certainly create lots of duplicates. That's unnecessary because there are actually 42 possible shapes of a postfix expression with five operands (the fifth Catalan number), so there were actually 30618 possible postfix expressions. If all of those could be produced, you'd expect less than one duplicate in a hundred thousand samples.

Note that the bias introduced by forcing a particular replacement for the first min terms will continue to show up for longer strings as well. For example, If the algorithm happens to expand the string exactly once during the first six steps, which has a probability of about 10%, then it will choose one of six shapes, although there are 132 possibilities. So you can expect duplicates of that size as well, although somewhat fewer.

Instead of forcing a choice when the string is still short, you should let the algorithm just continue until the gambler is ruined or the maximum length occurs. If the gambler is ruined too soon, throw out that sample and start over. That will slow things down a bit, but it's still quite practical. If tossing out so many possibilities annoys you, you could instead pre-generate all possible patterns of the minimum length -- as noted above, if the minimum length is six operators, then that's 132 shapes, which are easy to enumerate -- and select one of those at random as the starting point.

  • Related