Home > Software design >  Pandas: Analyse frequency of values in multiple DataFrame columns
Pandas: Analyse frequency of values in multiple DataFrame columns

Time:09-27

I have a pandas DataFrame with 24 columns (questions) and 207 rows (people responding) from a survey. The column values (answers) are within a fixed range: completely agree, rather agree, no idea, rather not agree, completely not agree.

I’ve tried this to know the frequency of values (answers) for a specific column (question):

df['Q1'].value_counts()

I’ve tried a lot of things to know the frequency of values (answers) for a selection of columns (questions) but it doesn’t work.

Desired outcome:

  • Is there a way to see the frequency of values (answers) for multiple columns (questions) in one overview?
  • I want to know the 3 questions that have most support (ie sum of frequencies of ‘completely agree’ and ‘rather agree’), in order of most support.

Here's the code to create a sample dataset:

import pandas as pd
df = pd.read_csv('sampledata.csv')
df = pd.DataFrame({
'Q1':['Completely disagree','Completely disagree','Rather not agree','Rather agree'],
'Q2':['Completely disagree','Rather not agree','Rather agree','Rather agree'],
'Q3':['No idea / no opinion','Rather not agree','Rather agree','Rather agree'],
'Q4':['Completely disagree','Rather not agree','Completely disagree','Completely agree']
})
df.head()

Here's the sample dataset:

0
Q1
Completely disagree
Completely disagree
No idea / no opinion
Completely disagree
1
Q2
Completely disagree
Rather not agree
Rather not agree
Rather not agree
2
Q3
Rather not agree
Rather agree
Rather agree
Completely disagree
3
Q4
Rather agree
Rather agree
Rather agree
Completely agree

CodePudding user response:

Here's a way to do the two things you've asked:

print( "way to see the frequency of values (answers) for multiple columns (questions) in one overview",
    df.apply(lambda col: col.value_counts()), "", sep='\n' )
print( "3 questions that have most support, in order of most support", 
    df.isin(['Completely agree', 'Rather agree']).sum().sort_values(ascending=False).head(3), "", sep='\n' )

Output:

way to see the frequency of values (answers) for multiple columns (questions) in one overview
                       Q1   Q2   Q3   Q4
Completely agree      NaN  NaN  NaN  1.0
Completely disagree   2.0  1.0  NaN  2.0
No idea / no opinion  NaN  NaN  1.0  NaN
Rather agree          1.0  2.0  2.0  NaN
Rather not agree      1.0  1.0  1.0  1.0

3 questions that have most support, in order of most support
Q2    2
Q3    2
Q1    1
dtype: int64

CodePudding user response:

For your first question, apply a pandas.Series.value_counts :

out1 = df.apply(pd.Series.value_counts)

And for your second question, use pandas.DataFrame.nlargest :

out2 = out1.loc[['Completely agree', 'Rather agree']].sum(axis=0).nlargest(3).index.tolist()

# Outputs :

print(out1)
                      Q1  Q2  Q3  Q4
Completely agree     NaN NaN NaN 1.0
Completely disagree  2.0 1.0 NaN 2.0
No idea / no opinion NaN NaN 1.0 NaN
Rather agree         1.0 2.0 2.0 NaN
Rather not agree     1.0 1.0 1.0 1.0

print(out2)

['Q2', 'Q3', 'Q1']

CodePudding user response:

1. Tidy Data:

I'd start by moving the data over to a tidy format which makes it easier to work with:

df_stacked = df.stack().reset_index().drop(columns="level_0").reset_index(drop=True)
df_stacked.columns = ["question", "answer"]

Output:

   question                answer
0        Q1   Completely disagree
1        Q2   Completely disagree
2        Q3  No idea / no opinion
3        Q4   Completely disagree
4        Q1   Completely disagree
5        Q2      Rather not agree
6        Q3      Rather not agree
7        Q4      Rather not agree
8        Q1      Rather not agree
9        Q2          Rather agree
10       Q3          Rather agree
11       Q4   Completely disagree
12       Q1          Rather agree
13       Q2          Rather agree
14       Q3          Rather agree
15       Q4      Completely agree

2. Frequency counts:

Then we can get the frequencies of answers per question like this. The multi-index view in the output provides a useful overview:

df_stacked.groupby(by="question").apply(lambda x: x.answer.value_counts())

Output:

question                      
Q1        Completely disagree     2
          Rather not agree        1
          Rather agree            1
Q2        Rather agree            2
          Completely disagree     1
          Rather not agree        1
Q3        Rather agree            2
          No idea / no opinion    1
          Rather not agree        1
Q4        Completely disagree     2
          Rather not agree        1
          Completely agree        1

3. Ranking

For ranking the questions based on the strength of the answers, I'd assign a value to each answer:

support_mapper = {
    "Completely disagree": -1,
    "Rather not agree": -0.5,
    "No idea / no opinion": 0,
    "Rather agree": 0.5,
    "Completely agree": 1,
}
df_stacked = df_stacked.assign(answer_weight=df_stacked.answer.replace(to_replace=support_mapper))

Output:

   question                answer  answer_weight
0        Q1   Completely disagree           -1.0
1        Q2   Completely disagree           -1.0
2        Q3  No idea / no opinion            0.0
3        Q4   Completely disagree           -1.0
4        Q1   Completely disagree           -1.0
5        Q2      Rather not agree           -0.5
6        Q3      Rather not agree           -0.5
7        Q4      Rather not agree           -0.5
8        Q1      Rather not agree           -0.5
9        Q2          Rather agree            0.5
10       Q3          Rather agree            0.5
11       Q4   Completely disagree           -1.0
12       Q1          Rather agree            0.5
13       Q2          Rather agree            0.5
14       Q3          Rather agree            0.5
15       Q4      Completely agree            1.0

Now we can compute a mean 'agreeness' score for each question which will allow us to rank them based on amount of support - a higher positive score meaning more agreement:

ranked_questions = df_stacked[['question', 'answer_weight']].groupby(by=["question"]).mean().sort_values(by="answer_weight", ascending=False)

Output:

              answer_weight
question               
Q3                0.125
Q2               -0.125
Q4               -0.375
Q1               -0.500

This is quite informative as we can at-a-glace see that most of the answers are in disagreement to our questions (negative scores). This new value could be helpful in i.e. heatmap visualisation or similar to give some intuitive overview of responses.

Interestingly abokey's answer above has a different ranking. By assigning a weight to the answers here we're allowing the strength of the answers to be summed and the negative responses to cancel out some of the positive responses for the same question.

It's probably obvious how we'd get the top three now, but for the sake of completion - slice the dataframe row-wise by index:

ranked_questions.iloc[:3]

Output:

      answer_weight
question               
Q3                0.125
Q2               -0.125
Q4               -0.375

Conclusion:

Considering the layout of our data before we start processing checking that each column is a unique variable and each row is an observation can help us process the data.

  • Related