Calculating average of percentages after group by-CodePudding

I have a dataset that looks like this

Question No     Question     Answers    Country   Response 
Q5               Text q5?    Option 1    Ireland.  9%  
Q5               Text q5?    Option 1    Poland    56%    
Q5               Text q5?    Option 1    Spain     78%
Q5               Text q5?    Option 1    France     23%
Q5               Text q5?    Option 1    Chile      22%
Q5               Text q5?    Option 2    Ireland    19%
Q5               Text q5?    Option 2    Poland     44%
Q5               Text q5?    Option 2    Spain      65%
Q5               Text q5?    Option 2    France     33%
Q5               Text q5?    Option 2    Chile      44%
Q5               Text q5?    Option 3    Ireland    78%
Q5               Text q5?    Option 3    Poland     88%
Q5               Text q5?    Option 3    Spain      66%
Q5               Text q5?    Option 3    France     54%
Q5               Text q5?    Option 3    Chile      97%
Q5               Text q5?    Option 4    Ireland    43%  
Q5               Text q5?    Option 4    Poland     32%
Q5               Text q5?    Option 4    Spain      67%
Q5               Text q5?    Option 4    France     23%
Q5               Text q5?    Option 4    Chile      21%
Q6               Text q6?    Option 1    Ireland    39%  
Q6               Text q6?    Option 1    Poland     16%    
Q6               Text q6?    Option 1    Spain      38%
Q6               Text q6?    Option 1    France     13%
Q6               Text q6?    Option 1    Chile      22%
Q6               Text q6?    Option 2    Ireland    38%
Q6               Text q6?    Option 2    Poland     82% 
Q6               Text q6?    Option 2    Spain      64%
Q6               Text q6?    Option 2    France     54%
Q6               Text q6?    Option 2    Chile      97%
Q6               Text q6?    Option 3    Ireland    53% 
Q6               Text q6?    Option 3    Poland     12%
Q6               Text q6?    Option 3    Spain      97%
Q6               Text q6?    Option 3    France     13% 
Q6               Text q6?    Option 3    Chile      91%

I want to group by the results and find the average of each answer for each question. The tricky part is that the sample size is not the same. For Ireland it's 150 and 200 for all the other countries. So, I have created a new column to add sample size information to table.

df['Sample_Size'] = [150 if x == 'Ireland' else 200 for x in df['Country']]

Now, the issue is that I want to group by the results to see global values instead of country breakdown and for that I need to take the average of percentages. The formula to take the average is

Percentage 1 * Sample Size1 Percentage 2 * Sample Size2 / Sum of Sample Sizes

 Grouped = df.groupby(['Question', 'Answer'])['Response','Sample_Size'].apply(lambda x:sum(unpivot_surveys_Statista['Response']*unpivot_surveys_Statista['Sample_Size'])/sum(unpivot_surveys_Statista['Sample_Size'])).reset_index()

But the results are showing Nan

Ideally, I would like my end result to look like this (showing only For Q5 Option 1 sample):

Question No     Question     Answers     Response 
Q5               Text q5?    Option 1     39%

CodePudding user response：

Wouldn't group by country by the average of the percentage column pd.groupby['country'].mean() and then merging the answers into a single df to get the global average solve your problem?

CodePudding user response：

You can use:

df['Response'] = df['Response'].str[:-1].astype(int)
df['temp'] = df['Response']*df['Sample_Size']
df = df.groupby(['Question', 'Answers']).agg({'Sample_Size': 'sum', 'temp': 'sum'}).reset_index()
df['Response '] = (df['temp']/df['Sample_Size']).round(2).astype(str)   '%'
df.drop(columns=['temp', 'Sample_Size'], inplace=True)

OUTPUT

   Question   Answers Response 
0  Text q5?  Option 1    39.11%
1  Text q5?  Option 2    42.16%
2  Text q5?  Option 3    76.53%
3  Text q5?  Option 4    36.89%
4  Text q6?  Option 1    24.89%
5  Text q6?  Option 2    68.53%
6  Text q6?  Option 3    53.21%