Using a Python pandas dataframe column as input to a loop through another column-CodePudding

I've got two dataframes. One looks like:

Year  Count
1      3
2      2
3      1
4      5
5      4

The other looks like

I'm looking to use the Count in first dataframe to loop through the second dataframe. I want to use the count value in each row to randomly select from the Value column in the second dataframe N times where N is the value in the Count column - and add these values up, giving a new column in the first dataframe thus:

Year  Count  RandSum
1      3      200
2      2       50
3      1      100
4      5      225
5      4      200

i.e. so the RandSum column added to the first dataframe is the sum of "Count" random selections from the Value column in the 2nd dataframe (i.e. in the first row, Count = 3, so drawing randomly from the Value column in the 2nd table drew 100, 50 and 50 = 200)

Any help appreciated for this relative python novice.

CodePudding user response：

Here's another idea using numpy.random.choice in a list comprehension:

import numpy as np

np.random.seed(0)

df1['RandSum'] = [np.random.choice(df2['Value'], n).sum() for n in df1['Count']]

[out]

   Year  Count  RandSum
0     1      3      175
1     2      2       50
2     3      1       50
3     4      5      275
4     5      4      200

CodePudding user response：

I believe you can use sample for this, then take the sum rowwise

import pandas as pd
df = pd.DataFrame({'Year': [1, 2, 3, 4, 5], 'Count': [3, 2, 1, 5, 4]})
df2 = pd.DataFrame({'ID': [1, 2, 3, 4, 5], 'Value': [100, 50, 0, 25, 50]})
df['RandSum'] = df.apply(lambda x: df2['Value'].sample(x['Count']), axis=1).sum(1)

print(df)

Output (Will vary by run)

   Year  Count  RandSum
0     1      3    150.0
1     2      2    150.0
2     3      1     50.0
3     4      5    225.0
4     5      4    225.0

CodePudding user response：

you should import pandas and create data frames. then, you select the column you want to iterate through with a for loop function. Afterward, you randomly choose a number of values in the target column and sum it. finally, put it into the output column named RandSum.

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Year': [1, 2, 3, 4, 5], 'Count': [3, 2, 1, 5, 4]})
df2 = pd.DataFrame({'ID': [1, 2, 3, 4, 5], 'Value': [100, 50, 0, 25, 50]})

df1["RandSum"] = np.nan
for i, count in enumerate(df1["Count"]):
    sample = df2["Value"].sample(n = count)
    df1.iloc[i, df1.columns.get_loc("RandSum")] = sample.sum()

df1.head()

The output will be different every time we run it.

Year	Count	RandSum
1	3	125.0
2	2	150.0
3	1	25.0
4	5	225.0
5	4	175.0