Home > Enterprise >  Avoiding writing a dataframe with a large number of columns
Avoiding writing a dataframe with a large number of columns

Time:10-19

I have a dataframe that looks like this:

student     school      class      answer      question
a           scl         first      True        x
a           scl         first      False       y
a           scl         first      True        y
b           scl         first      False       x
c           scl         sec        False       y
c           scl         sec        True        z
d           scl         sec        True        x
d           scl         sec        True        z
e           scl         third      True        z
e           scl         third      False       z

Note that it is possible to answer a question multiple times. Note also that not everyone may answer the same set of questions. I want to see which class performed better per question. So for each question, a ranking of the classes, one time when I consider only the first answer of a student, and one time overall.

What I did so far is just a ranking of the classes independent of what question was answered:

#only the first answer is considered
df1 = df.drop_duplicates(subset=["student", "scl", "class", "question"], keep="first")
(df1.groupby(['school', 'class'])
   ['answer'].mean()
   .rename('ClassRanking')
   .sort_values(ascending=False)  
   .reset_index()
    )

#all the answers are considered
(df.groupby(['school', 'class'])
       ['answer'].mean()
       .rename('ClassRanking')
       .sort_values(ascending=False)  
       .reset_index()
    )

So I do indeed have a ranking of the classes. But I don't know how to compare these classes judging by each question, because I wouldn't create a dataframe with 50 columns when I have 50 classes.

Edit:

I would imagine a dataframe like this, but this is a bit ugly when I have 50 classes:

df_all=
question     class_first_res     class_sec_res    class_third_res
x                        0.5                 1               None
y                        0.5                 0               None
z                       None                 1                0.5


df_first_attempt=
question     class_first_res     class_sec_res    class_third_res
      x                  0.5                 1               None
      y                    0                 0               None
      z                 None                 1                  1
    

CodePudding user response:

If I understood you correctly:

df_first = df.drop_duplicates(subset=['student', 'class', 'question'], keep='first').groupby(['class', 'question'])['answer'].apply(lambda x: x.sum()/len(x)).reset_index()
df_first = df_first.sort_values(by=['question']).rename(columns={'answer': 'ClassRanking'})
df_first = df_first.pivot_table(index='question', columns='class', values='ClassRanking').reset_index().rename_axis(None, axis=1)


df_overall = df.groupby(['class', 'question'])['answer'].apply(lambda x: x.sum()/len(x)).reset_index()
df_overall = df_overall.sort_values(by=['question']).rename(columns={'answer': 'ClassRanking'})
df_overall = df_overall.pivot_table(index='question', columns='class', values='ClassRanking').reset_index().rename_axis(None, axis=1)

df_first:

  question  first  sec  third
0        x    0.5  1.0    NaN
1        y    0.0  0.0    NaN
2        z    NaN  1.0    1.0

df_overall:

  question  first  sec  third
0        x    0.5  1.0    NaN
1        y    0.5  0.0    NaN
2        z    NaN  1.0    0.5

CodePudding user response:

You could try this.

pd.pivot_table(df, index="class", columns="question", values="answer")

It is similar to your examples, but rows instead of columns, but the content is the same.

On the other hand, if you would want a ranking of all the classes based on the average success on the questions, you could do this here:

pd.pivot_table(df, index="question", columns="class", values="answer").mean()
  • Related