I've got two dataframes. One looks like:
Year Count
1 3
2 2
3 1
4 5
5 4
The other looks like
ID Value
1 100
2 50
3 0
4 25
5 50
I'm looking to use the Count in first dataframe to loop through the second dataframe. I want to use the count value in each row to randomly select from the Value column in the second dataframe N times where N is the value in the Count column - and add these values up, giving a new column in the first dataframe thus:
Year Count RandSum
1 3 200
2 2 50
3 1 100
4 5 225
5 4 200
i.e. so the RandSum column added to the first dataframe is the sum of "Count" random selections from the Value column in the 2nd dataframe (i.e. in the first row, Count = 3, so drawing randomly from the Value column in the 2nd table drew 100, 50 and 50 = 200)
Any help appreciated for this relative python novice.
CodePudding user response:
Here's another idea using numpy.random.choice
in a list comprehension:
import numpy as np
np.random.seed(0)
df1['RandSum'] = [np.random.choice(df2['Value'], n).sum() for n in df1['Count']]
[out]
Year Count RandSum
0 1 3 175
1 2 2 50
2 3 1 50
3 4 5 275
4 5 4 200
CodePudding user response:
I believe you can use sample
for this, then take the sum rowwise
import pandas as pd
df = pd.DataFrame({'Year': [1, 2, 3, 4, 5], 'Count': [3, 2, 1, 5, 4]})
df2 = pd.DataFrame({'ID': [1, 2, 3, 4, 5], 'Value': [100, 50, 0, 25, 50]})
df['RandSum'] = df.apply(lambda x: df2['Value'].sample(x['Count']), axis=1).sum(1)
print(df)
Output (Will vary by run)
Year Count RandSum
0 1 3 150.0
1 2 2 150.0
2 3 1 50.0
3 4 5 225.0
4 5 4 225.0
CodePudding user response:
you should import pandas and create data frames. then, you select the column you want to iterate through with a for
loop function. Afterward, you randomly choose a number of values in the target column and sum it. finally, put it into the output column named RandSum
.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Year': [1, 2, 3, 4, 5], 'Count': [3, 2, 1, 5, 4]})
df2 = pd.DataFrame({'ID': [1, 2, 3, 4, 5], 'Value': [100, 50, 0, 25, 50]})
df1["RandSum"] = np.nan
for i, count in enumerate(df1["Count"]):
sample = df2["Value"].sample(n = count)
df1.iloc[i, df1.columns.get_loc("RandSum")] = sample.sum()
df1.head()
The output will be different every time we run it.
Year | Count | RandSum |
---|---|---|
1 | 3 | 125.0 |
2 | 2 | 150.0 |
3 | 1 | 25.0 |
4 | 5 | 225.0 |
5 | 4 | 175.0 |