Looping through a second column using a probability input-CodePudding

I have a similar question to one I posed here, but subtly different as it includes an extra step to the process involving a probability:

Using a Python pandas dataframe column as input to a loop through another column

I've got two pandas dataframes: one has these variables

Year  Count   Probability
 1      8         25%
 2      26        19%
 3      17        26%
 4      9         10%

Another is a table with these variables:

Essentially I need to use the Count x in the first dataframe to loop through the 2nd dataframe x times, but only pull a value from the 2nd dataframe y percent of the times (using random number generation) - and then create a new column in the first dataframe that represents the sum of the values in the loop.

So - just to demonstrate - in that first column, we'd loop through the 2nd table 8 times, but only pull a random value from that table 25% of the time - so we might get output of:

0 100 0 0 25 0 0 0

...which sums to 125 - so we our added column to the first table looks like

Year Count  Probability  Sum
 1     8        25%      125

....and so on. Thanks in advance.

CodePudding user response：

We'll use numpy binomial and pandas sample to get this done.

import pandas as pd
import numpy as np

# Set up dataframes
vals = pd.DataFrame([[1,8,'25%'], [2,26,'19%'], [3,17,'26%'],[4,9,'10%']])
vals.columns = ['Year', 'Count', 'Probability']
temp = pd.DataFrame([[1,100], [2,25], [3,50], [4,15], [5,75]])
temp.columns = ['ID', 'Value']

# Get probability fraction from string
vals['Numeric_Probability'] = pd.to_numeric(vals['Probability'].str.replace('%', '')) / 100

# Total rows is binomial random variable with n=Count, p=Probability.
vals['Total_Rows'] = np.random.binomial(n=vals['Count'], p=vals['Numeric_Probability'])

# Sample "total rows" from other DataFrame and sum.
vals['Sum'] = vals['Total_Rows'].apply(lambda x: temp['Value'].sample(
    n=x, replace=True).sum())

# Drop intermediate rows
vals.drop(columns=['Numeric_Probability', 'Total_Rows'], inplace=True)

print(vals)

   Year  Count Probability  Sum
0     1      8         25%   15
1     2     26         19%  350
2     3     17         26%  190
3     4      9         10%    0

CodePudding user response：

You could use pass a probabilities list to np.random.choice:

In [1]: import numpy as np
   ...: import pandas as pd
In [2]: d_1 = {
   ...:     'Year': [1, 2, 3, 4],
   ...:     'Count': [8, 26, 17, 9],
   ...:     'Probability': ['25%', '19%', '26%', '10%'],
   ...: }
   ...: df_1 = pd.DataFrame(data=d_1)
In [3]: d_2 = {
   ...:     'ID': [1, 2, 3, 4, 5],
   ...:     'Value': [100, 25, 50, 15, 75],
   ...: }
   ...: df_2 = pd.DataFrame(data=d_2)
In [4]: def get_probabilities(values: pd.Series, percentage: float) -> list[float]:
   ...:     percentage /= 100
   ...:     perecent_per_val = percentage / values.size
   ...:     return [perecent_per_val] * values.size   [1 - percentage]
   ...: 
In [5]: df_1['Sum'] = [
   ...:     np.random.choice(a=pd.concat([df_2['Value'], pd.Series([0])]),
   ...:                      size=n,
   ...:                      p=get_probabilites(values=df_2['Value'],
   ...:                                         percentage=float(percent[:-1]))).sum()
   ...:     for n, percent in zip(df_1['Count'], df_1['Probability'])
   ...: ]
   ...: df_1
Out[5]: 
   Year  Count Probability  Sum
0     1      8         25%  100
1     2     26         19%  375
2     3     17         26%  275
3     4      9         10%   50

CodePudding user response：

import numpy as np
df1["Sum"] = np.round(df1.Count * pd.to_numeric(df1.Probability.str[:-1]) / 100).astype(int).apply(lambda x: np.random.choice(df2.Value, x).sum())
df1

output:

Year    Count   Probability Sum
0   1   8   25% 175
1   2   26  19% 215
2   3   17  26% 215
3   4   9   10% 100