I have a dataset that looks like this
Question No Question Answers Country Response
Q5 Text q5? Option 1 Ireland. 9%
Q5 Text q5? Option 1 Poland 56%
Q5 Text q5? Option 1 Spain 78%
Q5 Text q5? Option 1 France 23%
Q5 Text q5? Option 1 Chile 22%
Q5 Text q5? Option 2 Ireland 19%
Q5 Text q5? Option 2 Poland 44%
Q5 Text q5? Option 2 Spain 65%
Q5 Text q5? Option 2 France 33%
Q5 Text q5? Option 2 Chile 44%
Q5 Text q5? Option 3 Ireland 78%
Q5 Text q5? Option 3 Poland 88%
Q5 Text q5? Option 3 Spain 66%
Q5 Text q5? Option 3 France 54%
Q5 Text q5? Option 3 Chile 97%
Q5 Text q5? Option 4 Ireland 43%
Q5 Text q5? Option 4 Poland 32%
Q5 Text q5? Option 4 Spain 67%
Q5 Text q5? Option 4 France 23%
Q5 Text q5? Option 4 Chile 21%
Q6 Text q6? Option 1 Ireland 39%
Q6 Text q6? Option 1 Poland 16%
Q6 Text q6? Option 1 Spain 38%
Q6 Text q6? Option 1 France 13%
Q6 Text q6? Option 1 Chile 22%
Q6 Text q6? Option 2 Ireland 38%
Q6 Text q6? Option 2 Poland 82%
Q6 Text q6? Option 2 Spain 64%
Q6 Text q6? Option 2 France 54%
Q6 Text q6? Option 2 Chile 97%
Q6 Text q6? Option 3 Ireland 53%
Q6 Text q6? Option 3 Poland 12%
Q6 Text q6? Option 3 Spain 97%
Q6 Text q6? Option 3 France 13%
Q6 Text q6? Option 3 Chile 91%
I want to group by the results and find the average of each answer for each question. The tricky part is that the sample size is not the same. For Ireland it's 150 and 200 for all the other countries. So, I have created a new column to add sample size information to table.
df['Sample_Size'] = [150 if x == 'Ireland' else 200 for x in df['Country']]
Now, the issue is that I want to group by the results to see global values instead of country breakdown and for that I need to take the average of percentages. The formula to take the average is
Percentage 1 * Sample Size1 Percentage 2 * Sample Size2 / Sum of Sample Sizes
Grouped = df.groupby(['Question', 'Answer'])['Response','Sample_Size'].apply(lambda x:sum(unpivot_surveys_Statista['Response']*unpivot_surveys_Statista['Sample_Size'])/sum(unpivot_surveys_Statista['Sample_Size'])).reset_index()
But the results are showing Nan
Ideally, I would like my end result to look like this (showing only For Q5 Option 1 sample):
Question No Question Answers Response
Q5 Text q5? Option 1 39%
CodePudding user response:
Wouldn't group by country by the average of the percentage column pd.groupby['country'].mean()
and then merging the answers into a single df
to get the global average solve your problem?
CodePudding user response:
You can use:
df['Response'] = df['Response'].str[:-1].astype(int)
df['temp'] = df['Response']*df['Sample_Size']
df = df.groupby(['Question', 'Answers']).agg({'Sample_Size': 'sum', 'temp': 'sum'}).reset_index()
df['Response '] = (df['temp']/df['Sample_Size']).round(2).astype(str) '%'
df.drop(columns=['temp', 'Sample_Size'], inplace=True)
OUTPUT
Question Answers Response
0 Text q5? Option 1 39.11%
1 Text q5? Option 2 42.16%
2 Text q5? Option 3 76.53%
3 Text q5? Option 4 36.89%
4 Text q6? Option 1 24.89%
5 Text q6? Option 2 68.53%
6 Text q6? Option 3 53.21%