This is a really tricky statistics that I want to produce. My dataframe
contains information about true classes and prediction results of a machine learning model, for trips and corresponding trips' segments. The problem can best be explained with example, so I give the following example df
:
df = pd.DataFrame(
{'trip': [25, 25, 25, 25, 25, 25, 25, 25, 25, 54, 54, 54, 54,73,73,73,75,75],
'segment': [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2,0,0,1,1,3],
'class': [3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1,2,2,2,1,1],
'prediction': [0, 0, 3, 3, 3, 4, 4, 2, 2, 0, 0, 1, 1,4,2,4,0,2]
}
)
df
trip segment class prediction
0 25 0 3 0
1 25 0 3 0
2 25 0 3 3
3 25 0 3 3
4 25 0 3 3
5 25 1 3 4
6 25 1 3 4
7 25 1 3 2
8 25 1 3 2
9 54 2 1 0
10 54 2 1 0
11 54 2 1 1
12 54 2 1 1
13 73 0 2 4
14 73 0 2 2
15 73 1 2 4
16 75 1 1 0
17 75 3 1 2
From the given df
, I would like to produce statistics of model's predictions at trip
and segment
levels, using prediction
's majority votes, considering the actual class
a trip
or segment
belongs to.
Segment's statistics
So considering the above df
, I would like to produce the below segment's statistics (explanation given below):
class total-segments correctly-predicted accuracy-rate
0 - - -
1 3 1 0.33
2 2 1 0.5
3 2 1 0.5
4 - - -
- no segment of
class
0
, so the dash. - there are 3 distinct segments of
class
type1
(segment2
of trip54
and segments1
&3
of trip75
). Of all the 3, only one (segment2
of trip54
has majority votes of itsprediction
correct, so1
correctly-predicted
and0.33
(i.e.1/3
) accuracy-rate. - there're 2 segments belonging to
class
type2
( segments0
&1
of trip73
). Segment0
has majority votes correct, so1
correctly-predicted
and0.5
(i.e.1/2
) accuracy-rate. - there're 2 segments of
class
3
(segments0
&1
of trip25
). Segment0
has majority votes correct, so1
correctly-predicted
and0.5
(i.e.1/5
) accuracy-rate. - no segment of
class
type4
.
Trip-level statistics
Similarly, considering the class
type of distinct trips in df
and their prediction
, I want to produce the following trip-level statistics (also explained below):
class total-trips correctly-predicted accuracy-rate
0 - - -
1 2 1 0.5
2 1 0 0.0
3 1 1 1.0
4 - - -
- no trip belongs to
class
0
. - 2 trips of
class
type1
(trip54
&75
). 1 trip was predicted correct (majority votes of trip54
), so1
correctly-predicted
trip, and0.5
accuracy-rate
. - 1 trip of
class
2
(trip73
). Its majority votes prediction is incorrect, so0
correctly-predicted
trip, and0.0
accuracy-rate
. - 1 trip of
class
3
(trip25
). Its majority votes prediction is correct (3
), so1
correctly-predicted
trip, and1.0
accuracy-rate
. - no trip of
class 4
.
Please forgive the long grammar, but this is a problem that one can understand only when well-explained.
CodePudding user response:
You can do it this way. you can comment all but the first line and then uncomment one by one to see what is happening with the command line.
res_seg = (
df['class'].eq(df['prediction'])
.groupby([df['class'],df['segment']]).mean()
.ge(0.5)
.groupby(level='class').agg(['size','sum'])
.rename(columns={'size':'total_segments','sum':'correctly_predicted'})\
.assign(accuracy_rate = lambda x: x['correctly_predicted']/x['total_segments'])
.reindex(range(5), fill_value='-')
.reset_index()
)
print(res_seg)
# class total_segments correctly_predicted accuracy_rate
# 0 0 - - -
# 1 1 3 1 0.333333
# 2 2 2 1 0.5
# 3 3 2 1 0.5
# 4 4 - - -
and similar for the trips, you would have to change the df['segment']
to df['trip']
in the groupby and maybe the name of the columns in the rename
as well as the assign