Home > Net >  Pandas data analysis: finding frequency of responses
Pandas data analysis: finding frequency of responses

Time:08-01

My data looks like this:

question answer
Q1 1
Q1 1
Q1 1
Q1 2
Q1 1
Q1 3
Q2 3
Q2 2
Q2 4
Q2 1
Q2 4
Q2 1
Q3 4
Q3 2
Q3 4
Q3 3
Q3 2
Q3 5

Every question is repeated six times and answer can only take values between 1-5. Here is the data in a csv format:

question,answer
Q1,1
Q1,1
Q1,1
Q1,2
Q1,1
Q1,3
Q2,3
Q2,2
Q2,4
Q2,1
Q2,4
Q2,1
Q3,4
Q3,2
Q3,4
Q3,3
Q3,2
Q3,5

What I'm trying to do is find the number of answers (1-5) given to each question and output this information to arrays. For example, the result for the data above should be:

[4, 1, 1, 0, 0] # which means for question 1, answer 1 was chosen 4 times, answer 2 was chosen 1 time, answer 3 was chosen 1 time, answers 4 and 5 were chosen 0 times. 
[2, 1, 1, 2, 0] # for question 2, answer 1 was chosen 2 times, answer 2 was chosen 1 time, answer 3 was chosen 1 time, answer 4 was chosen 2 times and answer 5 was chosen 0 times. 
[0, 2, 1, 2, 1] # etc
# the sum of the elements of each array should equal 6

I looked into pd.value_counts but I have no idea how to actually use it here. Can anyone help?

CodePudding user response:

You can use crosstab and convert to_numpy:

pd.crosstab(df['question'], df['answer']).to_numpy()

Output:

array([[4, 1, 1, 0, 0],
       [2, 1, 1, 2, 0],
       [0, 2, 1, 2, 1]])

There is however a pitfall in case an answer was never chosen in any of the questions. To ensure having all columns you must reindex:

(pd.crosstab(df['question'], df['answer'])
   .reindex(range(1,5 1), axis=1, fill_value=0)
   .to_numpy()
)
  • Related