Home > front end >  Randomness Functions Succeeds on the first try most often in Python; why?
Randomness Functions Succeeds on the first try most often in Python; why?

Time:10-19

I'm running some code to figure out the probability of an event happening on the first attempt, on the second attempt, etc.

The problem I'm facing doesn't have to do with the code itself, but rather the random library, I believe.

import random
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


denom = 227
trials = 30000

options = [x 1 for x in range(denom)]

itn_list = []

for x in range(trials):
    itn = 0
    
    while True:
        itn  = 1
        num = random.choice(options)
        if num == 1:
            itn_list.append(itn)
            break

data = []

for x in range(max(itn_list)):
    occurances = itn_list.count(x)
    data.append([x, occurances])

data = pd.DataFrame(np.array(data), columns=['Attempts', 'Occurances'])

plt.title('Occurances')
plt.plot(data.Attempts, data.Occurances)
plt.xlim(-1, max(data.Attempts) 1)
plt.ylim(-1, max(data.Occurances) 1)
plt.show()

I had initially expected the number of occurrences that it succeeded on the first try to be very low, spiking around 227 tries (since it's 1/227), then slowly dwindling off to the right. Instead, the graph was highest in the 1-20 tries range, meaning it was more likely that an event with odds of 1/227 to succeed on the first try than it would be on the 227th try. Is this an issue with the random library, or am I not understanding the math correctly?

Feel free to try out the code, I'm not sure how to attach images into stack overflow. Sorry if I didn't explain it well.

CodePudding user response:

Looks like you have a geometric distribution. The chances of getting denominator 1 are equal for each individual attempt, but you are counting the number of attempts to the first successful one.

Instead, the graph was highest in the 1-20 tries range, meaning it was more likely that an event with odds of 1/227 to succeed on the first try than it would be on the 227th try.

That is to be expected. Think about it this way: the probability of hitting your first one at attempt i is 1/227 for getting a one on that attempt times (226/227)i-1 for not hitting a one on any of the previous attempts. So while the first number is the same for any i, the latter power makes the probability go down the larger i becomes.

You may notice that even for small values of i the numbers are not that big. So in an intuitive sense, every single specific number is fairly unlikely. And that contributes to the fact that your are still likely to wait quite a while before your first one, most of the time.

I had initially expected the number of occurrences that it succeeded on the first try to be very low, spiking around 227 tries (since it's 1/227), then slowly dwindling off to the right.

You probably should be thinking about this more on terms of expected value. While the probability distribution doesn't have this kind of spike, the expected value would be 227, the inverse if the probability on each individual attempt.

Slight detour: (226/227)157 ≈ 1/2 so the chances of not getting a one in any of the first 157 attempts are very close to 50:50. That median or 50% quantile can give you another indication that waiting a long time is expected. Now you may wonder why 157 is so much smaller than 227. Intuitively speaking that's because the median ignores distance. It only cares about whether you get a one in the first 157 attempts or not. It doesn't care about how much sooner or later you might get it. Since in theory you could wait arbitrarily long (although with diminishing probability) some of the iterations have a very high number of attempts. The expected value takes the actual numbers into account and therefore is considerably higher than the median. The really high numbers more than balance the really low numbers, despite the smaller probability of each single one of them. There simply is a lot more of them.

CodePudding user response:

For anyone looking to do what I've done here, but get useful information out of it, I've made a version that displays what % of the occurrences have happened by certain trial numbers. Note that running the code with more trials takes longer, linearly I believe, but will provide more accurate data. Here's the code:

import random


denom = 227
trials = 25000

options = [x 1 for x in range(denom)]

itn_list = []

for x in range(trials):
    itn = 0
    
    while True:
        itn  = 1
        num = random.choice(options)
        if num == 1:
            itn_list.append(itn)
            break
        

data = []
total_occurences = 0
percentile = 0

for x in range(max(itn_list)):
    occurences = itn_list.count(x)
    data.append([x, occurences])
    
    total_occurences  = occurences
    
    if total_occurences/trials > percentile and occurences > 0:
        percentile  = 0.005
        print("Percentile {:.2%}".format(total_occurences/trials), f"at {x}")

data = pd.DataFrame(np.array(data), columns=['Attempts', 'Occurances'])

plt.title('Occurances')
plt.plot(data.Attempts, data.Occurances)
plt.xlim(-1, max(data.Attempts) 1)
plt.ylim(-1, max(data.Occurances) 1)
plt.show()
  • Related