Home > database >  How to summarise dataframe by way of majority votes of a column
How to summarise dataframe by way of majority votes of a column

Time:11-18

This is a really tricky statistics that I want to produce. My dataframe contains information about true classes and prediction results of a machine learning model, for trips and corresponding trips' segments. The problem can best be explained with example, so I give the following example df:

df = pd.DataFrame(
    {'trip': [25, 25, 25, 25, 25, 25, 25, 25, 25, 54, 54, 54, 54,73,73,73,75,75], 
     'segment': [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2,0,0,1,1,3], 
     'class': [3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1,2,2,2,1,1], 
     'prediction': [0, 0, 3, 3, 3, 4, 4, 2, 2, 0, 0, 1, 1,4,2,4,0,2]
     }
)

df
   trip segment class prediction
0   25    0      3      0
1   25    0      3      0
2   25    0      3      3
3   25    0      3      3
4   25    0      3      3
5   25    1      3      4
6   25    1      3      4
7   25    1      3      2
8   25    1      3      2
9   54    2      1      0
10  54    2      1      0
11  54    2      1      1
12  54    2      1      1
13  73    0      2      4
14  73    0      2      2
15  73    1      2      4
16  75    1      1      0
17  75    3      1      2

From the given df, I would like to produce statistics of model's predictions at trip and segment levels, using prediction's majority votes, considering the actual class a trip or segment belongs to.

Segment's statistics

So considering the above df, I would like to produce the below segment's statistics (explanation given below):

class total-segments  correctly-predicted  accuracy-rate
  0       -                  -                  -
  1       3                  1                 0.33
  2       2                  1                 0.5
  3       2                  1                 0.5
  4       -                  -                  -
  • no segment of class 0, so the dash.
  • there are 3 distinct segments of class type 1(segment 2 of trip 54 and segments 1 & 3 of trip 75). Of all the 3, only one (segment 2 of trip 54 has majority votes of its prediction correct, so 1 correctly-predicted and 0.33 (i.e. 1/3) accuracy-rate.
  • there're 2 segments belonging to class type 2 ( segments 0& 1 of trip 73). Segment 0 has majority votes correct, so 1 correctly-predicted and 0.5 (i.e. 1/2) accuracy-rate.
  • there're 2 segments of class 3 (segments 0 & 1 of trip 25). Segment 0 has majority votes correct, so 1 correctly-predicted and 0.5 (i.e. 1/5) accuracy-rate.
  • no segment of class type 4.

Trip-level statistics

Similarly, considering the class type of distinct trips in df and their prediction, I want to produce the following trip-level statistics (also explained below):

class total-trips  correctly-predicted  accuracy-rate
  0       -                  -                  -
  1       2                  1                 0.5
  2       1                  0                 0.0
  3       1                  1                 1.0
  4       -                  -                  - 
  • no trip belongs to class 0.
  • 2 trips of class type 1(trip 54 & 75). 1 trip was predicted correct (majority votes of trip 54), so 1 correctly-predicted trip, and 0.5 accuracy-rate.
  • 1 trip of class 2 (trip 73). Its majority votes prediction is incorrect, so 0 correctly-predicted trip, and 0.0 accuracy-rate.
  • 1 trip of class 3 (trip 25). Its majority votes prediction is correct (3), so 1 correctly-predicted trip, and 1.0 accuracy-rate.
  • no trip of class 4.

Please forgive the long grammar, but this is a problem that one can understand only when well-explained.

CodePudding user response:

You can do it this way. you can comment all but the first line and then uncomment one by one to see what is happening with the command line.

res_seg = (
    df['class'].eq(df['prediction'])
      .groupby([df['class'],df['segment']]).mean()
      .ge(0.5)
      .groupby(level='class').agg(['size','sum'])
      .rename(columns={'size':'total_segments','sum':'correctly_predicted'})\
      .assign(accuracy_rate = lambda x: x['correctly_predicted']/x['total_segments'])
      .reindex(range(5), fill_value='-')
      .reset_index()
)
print(res_seg)
#    class total_segments correctly_predicted accuracy_rate
# 0      0              -                   -             -
# 1      1              3                   1      0.333333
# 2      2              2                   1           0.5
# 3      3              2                   1           0.5
# 4      4              -                   -             -

and similar for the trips, you would have to change the df['segment'] to df['trip'] in the groupby and maybe the name of the columns in the rename as well as the assign

  • Related