I have a similar question to one I posed here, but subtly different as it includes an extra step to the process involving a probability:
Using a Python pandas dataframe column as input to a loop through another column
I've got two pandas dataframes: one has these variables
Year Count Probability
1 8 25%
2 26 19%
3 17 26%
4 9 10%
Another is a table with these variables:
ID Value
1 100
2 25
3 50
4 15
5 75
Essentially I need to use the Count x in the first dataframe to loop through the 2nd dataframe x times, but only pull a value from the 2nd dataframe y percent of the times (using random number generation) - and then create a new column in the first dataframe that represents the sum of the values in the loop.
So - just to demonstrate - in that first column, we'd loop through the 2nd table 8 times, but only pull a random value from that table 25% of the time - so we might get output of:
0 100 0 0 25 0 0 0
...which sums to 125 - so we our added column to the first table looks like
Year Count Probability Sum
1 8 25% 125
....and so on. Thanks in advance.
CodePudding user response:
We'll use numpy binomial and pandas sample to get this done.
import pandas as pd
import numpy as np
# Set up dataframes
vals = pd.DataFrame([[1,8,'25%'], [2,26,'19%'], [3,17,'26%'],[4,9,'10%']])
vals.columns = ['Year', 'Count', 'Probability']
temp = pd.DataFrame([[1,100], [2,25], [3,50], [4,15], [5,75]])
temp.columns = ['ID', 'Value']
# Get probability fraction from string
vals['Numeric_Probability'] = pd.to_numeric(vals['Probability'].str.replace('%', '')) / 100
# Total rows is binomial random variable with n=Count, p=Probability.
vals['Total_Rows'] = np.random.binomial(n=vals['Count'], p=vals['Numeric_Probability'])
# Sample "total rows" from other DataFrame and sum.
vals['Sum'] = vals['Total_Rows'].apply(lambda x: temp['Value'].sample(
n=x, replace=True).sum())
# Drop intermediate rows
vals.drop(columns=['Numeric_Probability', 'Total_Rows'], inplace=True)
print(vals)
Year Count Probability Sum
0 1 8 25% 15
1 2 26 19% 350
2 3 17 26% 190
3 4 9 10% 0
CodePudding user response:
You could use pass a probabilities list to np.random.choice
:
In [1]: import numpy as np
...: import pandas as pd
In [2]: d_1 = {
...: 'Year': [1, 2, 3, 4],
...: 'Count': [8, 26, 17, 9],
...: 'Probability': ['25%', '19%', '26%', '10%'],
...: }
...: df_1 = pd.DataFrame(data=d_1)
In [3]: d_2 = {
...: 'ID': [1, 2, 3, 4, 5],
...: 'Value': [100, 25, 50, 15, 75],
...: }
...: df_2 = pd.DataFrame(data=d_2)
In [4]: def get_probabilities(values: pd.Series, percentage: float) -> list[float]:
...: percentage /= 100
...: perecent_per_val = percentage / values.size
...: return [perecent_per_val] * values.size [1 - percentage]
...:
In [5]: df_1['Sum'] = [
...: np.random.choice(a=pd.concat([df_2['Value'], pd.Series([0])]),
...: size=n,
...: p=get_probabilites(values=df_2['Value'],
...: percentage=float(percent[:-1]))).sum()
...: for n, percent in zip(df_1['Count'], df_1['Probability'])
...: ]
...: df_1
Out[5]:
Year Count Probability Sum
0 1 8 25% 100
1 2 26 19% 375
2 3 17 26% 275
3 4 9 10% 50
CodePudding user response:
import numpy as np
df1["Sum"] = np.round(df1.Count * pd.to_numeric(df1.Probability.str[:-1]) / 100).astype(int).apply(lambda x: np.random.choice(df2.Value, x).sum())
df1
output:
Year Count Probability Sum
0 1 8 25% 175
1 2 26 19% 215
2 3 17 26% 215
3 4 9 10% 100