My data looks like this:
question | answer |
---|---|
Q1 | 1 |
Q1 | 1 |
Q1 | 1 |
Q1 | 2 |
Q1 | 1 |
Q1 | 3 |
Q2 | 3 |
Q2 | 2 |
Q2 | 4 |
Q2 | 1 |
Q2 | 4 |
Q2 | 1 |
Q3 | 4 |
Q3 | 2 |
Q3 | 4 |
Q3 | 3 |
Q3 | 2 |
Q3 | 5 |
Every question is repeated six times and answer
can only take values between 1-5. Here is the data in a csv format:
question,answer
Q1,1
Q1,1
Q1,1
Q1,2
Q1,1
Q1,3
Q2,3
Q2,2
Q2,4
Q2,1
Q2,4
Q2,1
Q3,4
Q3,2
Q3,4
Q3,3
Q3,2
Q3,5
What I'm trying to do is find the number of answers (1-5) given to each question and output this information to arrays. For example, the result for the data above should be:
[4, 1, 1, 0, 0] # which means for question 1, answer 1 was chosen 4 times, answer 2 was chosen 1 time, answer 3 was chosen 1 time, answers 4 and 5 were chosen 0 times.
[2, 1, 1, 2, 0] # for question 2, answer 1 was chosen 2 times, answer 2 was chosen 1 time, answer 3 was chosen 1 time, answer 4 was chosen 2 times and answer 5 was chosen 0 times.
[0, 2, 1, 2, 1] # etc
# the sum of the elements of each array should equal 6
I looked into pd.value_counts
but I have no idea how to actually use it here. Can anyone help?
CodePudding user response:
You can use crosstab
and convert to_numpy
:
pd.crosstab(df['question'], df['answer']).to_numpy()
Output:
array([[4, 1, 1, 0, 0],
[2, 1, 1, 2, 0],
[0, 2, 1, 2, 1]])
There is however a pitfall in case an answer was never chosen in any of the questions. To ensure having all columns you must reindex
:
(pd.crosstab(df['question'], df['answer'])
.reindex(range(1,5 1), axis=1, fill_value=0)
.to_numpy()
)