I have a dataframe where I want to create random numbers in a new column. The random numbers must fulfill two constraints:
- The random numbers must add up to a specified sum (in the example, the sum is 300)
- For each observation, the random numbers must not exceed a value in the constraint column.
In the example below, the constraints are fulfilled because the sum is 300 and the random number does not exceed the constraint column.
Example:
GEOID | CONSTRAINT | RANDOM |
---|---|---|
010010000001 | 100 | 80 |
010010000002 | 50 | 40 |
010010000003 | 75 | 60 |
010010000004 | 75 | 60 |
010010000005 | 100 | 60 |
It seems having random numbers totaling a sum has been demonstrated but I do not see an example with a second constraint.
Edit for clarity: The new column must be integers. The minimum lower bound value is 0.
CodePudding user response:
You could use the multinomial distribution to build an approximate answer:
def sample(total, constraints):
import numpy as np
rng = np.random.default_rng()
samples = rng.multinomial(total, constraints / constraints.sum(), size=100)
return next(val for val in samples if np.all(val < constraints))
df["RANDOM"] = sample(300, df["CONSTRAINT"].values)
print(df)
Output
GEOID CONSTRAINT RANDOM
0 10010000001 100 81
1 10010000002 50 42
2 10010000003 75 57
3 10010000004 75 53
4 10010000005 100 67
Thanks goes to @Michael Szczesny for testing the solution.
The key to solve this, relies in (quote from numpy docs):
Its values, X_i = [X_0, X_1, ..., X_p], represent the number of times the outcome was i.
see more details in this blog post.
CodePudding user response:
Here is my solution, let me know what you think about it:
total_sum = 300 # total we want to obtain, is dynamical
# We check if the problem is realistic
if sum(df["CONSTRAINT"]) < total_sum:
raise Exception("IMPOSSIBLE TO REACH")
random_value_list = [] # values for the new column RANDOM
# We shuffle just to make sure the function works well in any case, (can keep the original order with older indexes)
df = df.sample(frac=1)
df = df.reset_index(drop=False)
for i in range(df.shape[0] - 1):
total_reached = sum(random_value_list)
total_yet_to_reach = total_sum - total_reached
max_value = min(df.at[i, "CONSTRAINT"], total_yet_to_reach)
# we put constraints or max value to not go above 300 as conditions
sum_next_constraints = sum(df.loc[i 1:, :]["CONSTRAINT"])
random_value = random.randint(0, max_value)
# this while condition is a conditions for the minimum of the random.randint
while random_value sum_next_constraints total_reached < total_sum:
random_value = random.randint(0, max_value)
random_value_list.append(random_value)
# complete the list to get 300 (can be 0)
random_value_list.append(total_sum - sum(random_value_list))
df["RANDOM"] = random_value_list
df
Here is just one of the example of what you can get :
index GEOID CONSTRAINT RANDOM
0 3 010010000004 75 49
1 0 010010000001 100 94
2 1 010010000002 50 10
3 2 010010000003 75 69
4 4 010010000005 100 78
We can check that:
sum(df["RANDOM"])
Out : 300