I have a pandas DataFrame with 24 columns (questions) and 207 rows (people responding) from a survey. The column values (answers) are within a fixed range: completely agree, rather agree, no idea, rather not agree, completely not agree.
I’ve tried this to know the frequency of values (answers) for a specific column (question):
df['Q1'].value_counts()
I’ve tried a lot of things to know the frequency of values (answers) for a selection of columns (questions) but it doesn’t work.
Desired outcome:
- Is there a way to see the frequency of values (answers) for multiple columns (questions) in one overview?
- I want to know the 3 questions that have most support (ie sum of frequencies of ‘completely agree’ and ‘rather agree’), in order of most support.
Here's the code to create a sample dataset:
import pandas as pd
df = pd.read_csv('sampledata.csv')
df = pd.DataFrame({
'Q1':['Completely disagree','Completely disagree','Rather not agree','Rather agree'],
'Q2':['Completely disagree','Rather not agree','Rather agree','Rather agree'],
'Q3':['No idea / no opinion','Rather not agree','Rather agree','Rather agree'],
'Q4':['Completely disagree','Rather not agree','Completely disagree','Completely agree']
})
df.head()
Here's the sample dataset:
0
Q1
Completely disagree
Completely disagree
No idea / no opinion
Completely disagree
1
Q2
Completely disagree
Rather not agree
Rather not agree
Rather not agree
2
Q3
Rather not agree
Rather agree
Rather agree
Completely disagree
3
Q4
Rather agree
Rather agree
Rather agree
Completely agree
CodePudding user response:
Here's a way to do the two things you've asked:
print( "way to see the frequency of values (answers) for multiple columns (questions) in one overview",
df.apply(lambda col: col.value_counts()), "", sep='\n' )
print( "3 questions that have most support, in order of most support",
df.isin(['Completely agree', 'Rather agree']).sum().sort_values(ascending=False).head(3), "", sep='\n' )
Output:
way to see the frequency of values (answers) for multiple columns (questions) in one overview
Q1 Q2 Q3 Q4
Completely agree NaN NaN NaN 1.0
Completely disagree 2.0 1.0 NaN 2.0
No idea / no opinion NaN NaN 1.0 NaN
Rather agree 1.0 2.0 2.0 NaN
Rather not agree 1.0 1.0 1.0 1.0
3 questions that have most support, in order of most support
Q2 2
Q3 2
Q1 1
dtype: int64
CodePudding user response:
For your first question, apply
a pandas.Series.value_counts
:
out1 = df.apply(pd.Series.value_counts)
And for your second question, use pandas.DataFrame.nlargest
:
out2 = out1.loc[['Completely agree', 'Rather agree']].sum(axis=0).nlargest(3).index.tolist()
# Outputs :
print(out1)
Q1 Q2 Q3 Q4
Completely agree NaN NaN NaN 1.0
Completely disagree 2.0 1.0 NaN 2.0
No idea / no opinion NaN NaN 1.0 NaN
Rather agree 1.0 2.0 2.0 NaN
Rather not agree 1.0 1.0 1.0 1.0
print(out2)
['Q2', 'Q3', 'Q1']
CodePudding user response:
1. Tidy Data:
I'd start by moving the data over to a tidy format which makes it easier to work with:
df_stacked = df.stack().reset_index().drop(columns="level_0").reset_index(drop=True)
df_stacked.columns = ["question", "answer"]
Output:
question answer
0 Q1 Completely disagree
1 Q2 Completely disagree
2 Q3 No idea / no opinion
3 Q4 Completely disagree
4 Q1 Completely disagree
5 Q2 Rather not agree
6 Q3 Rather not agree
7 Q4 Rather not agree
8 Q1 Rather not agree
9 Q2 Rather agree
10 Q3 Rather agree
11 Q4 Completely disagree
12 Q1 Rather agree
13 Q2 Rather agree
14 Q3 Rather agree
15 Q4 Completely agree
2. Frequency counts:
Then we can get the frequencies of answers per question like this. The multi-index view in the output provides a useful overview:
df_stacked.groupby(by="question").apply(lambda x: x.answer.value_counts())
Output:
question
Q1 Completely disagree 2
Rather not agree 1
Rather agree 1
Q2 Rather agree 2
Completely disagree 1
Rather not agree 1
Q3 Rather agree 2
No idea / no opinion 1
Rather not agree 1
Q4 Completely disagree 2
Rather not agree 1
Completely agree 1
3. Ranking
For ranking the questions based on the strength of the answers, I'd assign a value to each answer:
support_mapper = {
"Completely disagree": -1,
"Rather not agree": -0.5,
"No idea / no opinion": 0,
"Rather agree": 0.5,
"Completely agree": 1,
}
df_stacked = df_stacked.assign(answer_weight=df_stacked.answer.replace(to_replace=support_mapper))
Output:
question answer answer_weight
0 Q1 Completely disagree -1.0
1 Q2 Completely disagree -1.0
2 Q3 No idea / no opinion 0.0
3 Q4 Completely disagree -1.0
4 Q1 Completely disagree -1.0
5 Q2 Rather not agree -0.5
6 Q3 Rather not agree -0.5
7 Q4 Rather not agree -0.5
8 Q1 Rather not agree -0.5
9 Q2 Rather agree 0.5
10 Q3 Rather agree 0.5
11 Q4 Completely disagree -1.0
12 Q1 Rather agree 0.5
13 Q2 Rather agree 0.5
14 Q3 Rather agree 0.5
15 Q4 Completely agree 1.0
Now we can compute a mean 'agreeness' score for each question which will allow us to rank them based on amount of support - a higher positive score meaning more agreement:
ranked_questions = df_stacked[['question', 'answer_weight']].groupby(by=["question"]).mean().sort_values(by="answer_weight", ascending=False)
Output:
answer_weight
question
Q3 0.125
Q2 -0.125
Q4 -0.375
Q1 -0.500
This is quite informative as we can at-a-glace see that most of the answers are in disagreement to our questions (negative scores). This new value could be helpful in i.e. heatmap visualisation or similar to give some intuitive overview of responses.
Interestingly abokey's answer above has a different ranking. By assigning a weight to the answers here we're allowing the strength of the answers to be summed and the negative responses to cancel out some of the positive responses for the same question.
It's probably obvious how we'd get the top three now, but for the sake of completion - slice the dataframe row-wise by index:
ranked_questions.iloc[:3]
Output:
answer_weight
question
Q3 0.125
Q2 -0.125
Q4 -0.375
Conclusion:
Considering the layout of our data before we start processing checking that each column is a unique variable and each row is an observation can help us process the data.