I have the following csv data:
question,answer
m2020_s,3
m2020_s,3
m2020_s,3
m2020_s,3
m2020_s,3
m2020_s,3
a2020_k,1
a2020_k,2
a2020_k,1
a2020_k,4
a2020_k,1
a2020_k,1
d2015_a,5
d2015_a,4
d2015_a,4
d2015_a,4
d2015_a,4
d2015_a,4
I'm using pd.crosstab
to count the number of times each answer was given but the function is messing with the order of my data. Here is my code:
import pandas as pd
df = pd.read_csv('example.csv')
output_array = pd.crosstab(df['question'], df['answer']).to_numpy()
print(output_array)
Expected result:
[[0 0 6 0 0]
[4 1 0 1 0]
[0 0 0 5 1]]
Actual result:
[[4 1 0 1 0]
[0 0 0 5 1]
[0 0 6 0 0]]
Why is this happening? And how can I preserve the data's order?
CodePudding user response:
Could you try this,
pd.crosstab(df['question'], df['answer']).reindex(df['question'].unique()).to_numpy()
O/P:
array([[0, 0, 6, 0, 0],
[4, 1, 0, 1, 0],
[0, 0, 0, 5, 1]], dtype=int64)
Explanation: Reorder index based on unique elements in your dataset based on first occurance.