I have a dataframe that looks like this -
id | amount | date | group | master_id |
---|---|---|---|---|
a201 | 100 | 12-10-2022 | a | a201 |
a101 | 70 | 12-10-2022 | a | a201 |
a102 | 90 | 12-10-2022 | a | a201 |
b24 | 150 | 12-10-2022 | b | b24 |
b13 | 120 | 12-10-2022 | b | b24 |
c71 | 10 | 12-10-2022 | c | c71 |
c12 | 100 | 12-10-2022 | c | c71 |
c1 | 12 | 12-10-2022 | c | c71 |
I need to compare the records within each group (a, b, c here) and generate a similarity score. The value to be checked is under master_id column (i.e. for group a, a201 is to be checked with a101 and a102). The attributes to be checked is the amount and date. If date is within 5 days, score is 1 and if the amount is within a difference of 10, the score is 1 in all other cases, score is 0. The similarity score will be the sum of the 2 individual scores.
Expected output
id | amount | date | group | master_id | amount_score | date_score | similarity_score |
---|---|---|---|---|---|---|---|
a201 | 100 | 12-10-2022 | a | a201 | 1 | 1 | 2 |
a101 | 70 | 11-10-2022 | a | a201 | 0 | 1 | 1 |
a102 | 90 | 12-10-2022 | a | a201 | 0 | 0 | 0 |
b24 | 150 | 12-10-2022 | b | b24 | 1 | 1 | 2 |
b13 | 145 | 25-10-2022 | b | b24 | 1 | 0 | 1 |
c71 | 10 | 12-10-2022 | c | c71 | 1 | 1 | 2 |
c12 | 100 | 30-10-2022 | c | c71 | 0 | 0 | 0 |
c1 | 12 | 11-10-2022 | c | c71 | 1 | 1 | 2 |
What is the best way to do this operation?
Thanks in advance.
CodePudding user response:
scores = (df[["amount", "date"]]
.sub(df.groupby("group")[["amount", "date"]].transform("first"))
.abs()
.lt([10, pd.Timedelta("5 days")])
.astype(int)
.add_suffix("_score"))
scores["similarity_score"] = scores.sum(axis="columns")
new_df = df.join(scores)
- take the difference of each amount & date values from that of each group's first values
- take the absolute value of differences and compare against thresholds
- True/False is obtained; map them to 1/0 via type casting
- add the "score" suffix to the new columns
- compute the total scores and join with the original dataframe
to get
>>> new_df
id amount date group master_id amount_score date_score similarity_score
0 a201 100 2022-12-10 a a201 1 1 2
1 a101 70 2022-12-10 a a201 0 1 1
2 a102 90 2022-12-10 a a201 0 1 1
3 b24 150 2022-12-10 b b24 1 1 2
4 b13 120 2022-12-10 b b24 0 1 1
5 c71 10 2022-12-10 c c71 1 1 2
6 c12 100 2022-12-10 c c71 0 1 1
7 c1 12 2022-12-10 c c71 1 1 2