I have a dataframe with
agent_id ts pred gt
0 0 0 0
0 1 0 0
0 2 0 1
0 3 1 0
1 0 0 0
1 1 1 0
1 2 2 1
1 3 3 0
agent_id and ts are indices and pred and gt are columns.
Now I want to:
- Group By agent_id
- Get the percentage of same unique values in both columns without caring for order
I've alreay implemented a similar metric, where the order matters:
grouped_df.apply(lambda df: df.gt.eq(df.pred).mean()).to_dict()
and I've also implement the metric that I want assuming pred
and gt
would be normal lists without any grouping:
unordered_matches = len(set(pred) & set(gt)) / len(set(pred) | set(gt))
How can I achieve this now with the grouping in pandas (and converting to dict ideally)?
Just for better understanding, here how the results would be for the sample data above:
agent 0:
- set(pred) -> {0, 1}; set(gt) -> {0, 1}
- unordered_matches = 1 (100%)
agent 1:
- set(pred) -> {0, 1, 2, 3}; set(gt) -> {0, 1}
- unordered_matches = 0.5 (50%)
I would be interested in a problem specific pandas solution as well as a more generic solution how I can translate my python formulas so they work with panda grouping.
CodePudding user response:
Using set operations and groupby.apply
:
(df.groupby('agent_id')
.apply(lambda x: len((S1:=set(x['pred'])) & (S2:=set(x['gt'])))/len(S1|S2))
)
output:
agent_id
0 1.0
1 0.5
dtype: float64
As dictionary, add .to_dict()
: {0: 1.0, 1: 0.5}