I have a dataframe that looks like this:
student school class answer question
a scl first True x
a scl first False y
a scl first True y
b scl first False x
c scl sec False y
c scl sec True z
d scl sec True x
d scl sec True z
e scl third True z
e scl third False z
Note that it is possible to answer a question multiple times. Note also that not everyone may answer the same set of questions. I want to see which class performed better per question. So for each question, a ranking of the classes, one time when I consider only the first answer of a student, and one time overall.
What I did so far is just a ranking of the classes independent of what question was answered:
#only the first answer is considered
df1 = df.drop_duplicates(subset=["student", "scl", "class", "question"], keep="first")
(df1.groupby(['school', 'class'])
['answer'].mean()
.rename('ClassRanking')
.sort_values(ascending=False)
.reset_index()
)
#all the answers are considered
(df.groupby(['school', 'class'])
['answer'].mean()
.rename('ClassRanking')
.sort_values(ascending=False)
.reset_index()
)
So I do indeed have a ranking of the classes. But I don't know how to compare these classes judging by each question, because I wouldn't create a dataframe with 50 columns when I have 50 classes.
Edit:
I would imagine a dataframe like this, but this is a bit ugly when I have 50 classes:
df_all=
question class_first_res class_sec_res class_third_res
x 0.5 1 None
y 0.5 0 None
z None 1 0.5
df_first_attempt=
question class_first_res class_sec_res class_third_res
x 0.5 1 None
y 0 0 None
z None 1 1
CodePudding user response:
If I understood you correctly:
df_first = df.drop_duplicates(subset=['student', 'class', 'question'], keep='first').groupby(['class', 'question'])['answer'].apply(lambda x: x.sum()/len(x)).reset_index()
df_first = df_first.sort_values(by=['question']).rename(columns={'answer': 'ClassRanking'})
df_first = df_first.pivot_table(index='question', columns='class', values='ClassRanking').reset_index().rename_axis(None, axis=1)
df_overall = df.groupby(['class', 'question'])['answer'].apply(lambda x: x.sum()/len(x)).reset_index()
df_overall = df_overall.sort_values(by=['question']).rename(columns={'answer': 'ClassRanking'})
df_overall = df_overall.pivot_table(index='question', columns='class', values='ClassRanking').reset_index().rename_axis(None, axis=1)
df_first:
question first sec third
0 x 0.5 1.0 NaN
1 y 0.0 0.0 NaN
2 z NaN 1.0 1.0
df_overall:
question first sec third
0 x 0.5 1.0 NaN
1 y 0.5 0.0 NaN
2 z NaN 1.0 0.5
CodePudding user response:
You could try this.
pd.pivot_table(df, index="class", columns="question", values="answer")
It is similar to your examples, but rows instead of columns, but the content is the same.
On the other hand, if you would want a ranking of all the classes based on the average success on the questions, you could do this here:
pd.pivot_table(df, index="question", columns="class", values="answer").mean()