How to summarise dataframe by way of majority votes of a column-CodePudding

This is a really tricky statistics that I want to produce. My dataframe contains information about true classes and prediction results of a machine learning model, for trips and corresponding trips' segments. The problem can best be explained with example, so I give the following example df:

df = pd.DataFrame(
    {'trip': [25, 25, 25, 25, 25, 25, 25, 25, 25, 54, 54, 54, 54,73,73,73,75,75], 
     'segment': [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2,0,0,1,1,3], 
     'class': [3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1,2,2,2,1,1], 
     'prediction': [0, 0, 3, 3, 3, 4, 4, 2, 2, 0, 0, 1, 1,4,2,4,0,2]
     }
)

df
   trip segment class prediction
0   25    0      3      0
1   25    0      3      0
2   25    0      3      3
3   25    0      3      3
4   25    0      3      3
5   25    1      3      4
6   25    1      3      4
7   25    1      3      2
8   25    1      3      2
9   54    2      1      0
10  54    2      1      0
11  54    2      1      1
12  54    2      1      1
13  73    0      2      4
14  73    0      2      2
15  73    1      2      4
16  75    1      1      0
17  75    3      1      2

From the given df, I would like to produce statistics of model's predictions at trip and segment levels, using prediction's majority votes, considering the actual class a trip or segment belongs to.

Segment's statistics

So considering the above df, I would like to produce the below segment's statistics (explanation given below):

class total-segments  correctly-predicted  accuracy-rate
  0       -                  -                  -
  1       3                  1                 0.33
  2       2                  1                 0.5
  3       2                  1                 0.5
  4       -                  -                  -

no segment of class 0, so the dash.
there are 3 distinct segments of class type 1(segment 2 of trip 54 and segments 1 & 3 of trip 75). Of all the 3, only one (segment 2 of trip 54 has majority votes of its prediction correct, so 1 correctly-predicted and 0.33 (i.e. 1/3) accuracy-rate.
there're 2 segments belonging to class type 2 ( segments 0& 1 of trip 73). Segment 0 has majority votes correct, so 1 correctly-predicted and 0.5 (i.e. 1/2) accuracy-rate.
there're 2 segments of class 3 (segments 0 & 1 of trip 25). Segment 0 has majority votes correct, so 1 correctly-predicted and 0.5 (i.e. 1/5) accuracy-rate.
no segment of class type 4.

Trip-level statistics

Similarly, considering the class type of distinct trips in df and their prediction, I want to produce the following trip-level statistics (also explained below):

class total-trips  correctly-predicted  accuracy-rate
  0       -                  -                  -
  1       2                  1                 0.5
  2       1                  0                 0.0
  3       1                  1                 1.0
  4       -                  -                  -

no trip belongs to class 0.
2 trips of class type 1(trip 54 & 75). 1 trip was predicted correct (majority votes of trip 54), so 1 correctly-predicted trip, and 0.5 accuracy-rate.
1 trip of class 2 (trip 73). Its majority votes prediction is incorrect, so 0 correctly-predicted trip, and 0.0 accuracy-rate.
1 trip of class 3 (trip 25). Its majority votes prediction is correct (3), so 1 correctly-predicted trip, and 1.0 accuracy-rate.
no trip of class 4.

Please forgive the long grammar, but this is a problem that one can understand only when well-explained.

CodePudding user response：

You can do it this way. you can comment all but the first line and then uncomment one by one to see what is happening with the command line.

res_seg = (
    df['class'].eq(df['prediction'])
      .groupby([df['class'],df['segment']]).mean()
      .ge(0.5)
      .groupby(level='class').agg(['size','sum'])
      .rename(columns={'size':'total_segments','sum':'correctly_predicted'})\
      .assign(accuracy_rate = lambda x: x['correctly_predicted']/x['total_segments'])
      .reindex(range(5), fill_value='-')
      .reset_index()
)
print(res_seg)
#    class total_segments correctly_predicted accuracy_rate
# 0      0              -                   -             -
# 1      1              3                   1      0.333333
# 2      2              2                   1           0.5
# 3      3              2                   1           0.5
# 4      4              -                   -             -

and similar for the trips, you would have to change the df['segment'] to df['trip'] in the groupby and maybe the name of the columns in the rename as well as the assign