Home > other >  Percentage of unique same values in two columns without order pandas
Percentage of unique same values in two columns without order pandas

Time:08-18

I have a dataframe with

agent_id ts pred gt
0      0  0    0
0      1  0    0
0      2  0    1
0      3  1    0
1      0  0    0
1      1  1    0
1      2  2    1
1      3  3    0

agent_id and ts are indices and pred and gt are columns.

Now I want to:

  • Group By agent_id
  • Get the percentage of same unique values in both columns without caring for order

I've alreay implemented a similar metric, where the order matters:

grouped_df.apply(lambda df: df.gt.eq(df.pred).mean()).to_dict()

and I've also implement the metric that I want assuming pred and gt would be normal lists without any grouping:

unordered_matches = len(set(pred) & set(gt)) / len(set(pred) | set(gt))

How can I achieve this now with the grouping in pandas (and converting to dict ideally)?

Just for better understanding, here how the results would be for the sample data above:

agent 0:

  • set(pred) -> {0, 1}; set(gt) -> {0, 1}
  • unordered_matches = 1 (100%)

agent 1:

  • set(pred) -> {0, 1, 2, 3}; set(gt) -> {0, 1}
  • unordered_matches = 0.5 (50%)

I would be interested in a problem specific pandas solution as well as a more generic solution how I can translate my python formulas so they work with panda grouping.

CodePudding user response:

Using set operations and groupby.apply:

(df.groupby('agent_id')
   .apply(lambda x: len((S1:=set(x['pred'])) & (S2:=set(x['gt'])))/len(S1|S2))
)

output:

agent_id
0    1.0
1    0.5
dtype: float64

As dictionary, add .to_dict() : {0: 1.0, 1: 0.5}

  • Related