Home > Mobile >  Compare records within each group in python dataframe
Compare records within each group in python dataframe

Time:01-12

I have a dataframe that looks like this -

id amount date group master_id
a201 100 12-10-2022 a a201
a101 70 12-10-2022 a a201
a102 90 12-10-2022 a a201
b24 150 12-10-2022 b b24
b13 120 12-10-2022 b b24
c71 10 12-10-2022 c c71
c12 100 12-10-2022 c c71
c1 12 12-10-2022 c c71

I need to compare the records within each group (a, b, c here) and generate a similarity score. The value to be checked is under master_id column (i.e. for group a, a201 is to be checked with a101 and a102). The attributes to be checked is the amount and date. If date is within 5 days, score is 1 and if the amount is within a difference of 10, the score is 1 in all other cases, score is 0. The similarity score will be the sum of the 2 individual scores.

Expected output

id amount date group master_id amount_score date_score similarity_score
a201 100 12-10-2022 a a201 1 1 2
a101 70 11-10-2022 a a201 0 1 1
a102 90 12-10-2022 a a201 0 0 0
b24 150 12-10-2022 b b24 1 1 2
b13 145 25-10-2022 b b24 1 0 1
c71 10 12-10-2022 c c71 1 1 2
c12 100 30-10-2022 c c71 0 0 0
c1 12 11-10-2022 c c71 1 1 2

What is the best way to do this operation?

Thanks in advance.

CodePudding user response:

scores = (df[["amount", "date"]]
            .sub(df.groupby("group")[["amount", "date"]].transform("first"))
            .abs()
            .lt([10, pd.Timedelta("5 days")])
            .astype(int)
            .add_suffix("_score"))

scores["similarity_score"] = scores.sum(axis="columns")

new_df = df.join(scores)
  • take the difference of each amount & date values from that of each group's first values
  • take the absolute value of differences and compare against thresholds
  • True/False is obtained; map them to 1/0 via type casting
  • add the "score" suffix to the new columns
  • compute the total scores and join with the original dataframe

to get

>>> new_df

     id  amount       date group master_id  amount_score  date_score  similarity_score
0  a201     100 2022-12-10     a      a201             1           1                 2
1  a101      70 2022-12-10     a      a201             0           1                 1
2  a102      90 2022-12-10     a      a201             0           1                 1
3   b24     150 2022-12-10     b       b24             1           1                 2
4   b13     120 2022-12-10     b       b24             0           1                 1
5   c71      10 2022-12-10     c       c71             1           1                 2
6   c12     100 2022-12-10     c       c71             0           1                 1
7    c1      12 2022-12-10     c       c71             1           1                 2
  • Related