How to give weights to matching dataframe columns?-CodePudding

I have a dataframe with multiple columns and rows. I want to ask a user to enter a row number and output the 3 most similar rows. The ranking strategy I want to use is:

1- Each attribute match has weight 1

2- Weight the match based in the position of the attribute.

Let's say user entered the row number "30". Row number "12" and "5" have both 5 matching elements which is the highest match score among others. Row number "23" has 4 matching elements which is the second-highest match score among others.

row-30 =[ 2 13  7  1  7 10  1  8  7  1]

row-12 =[11  5  4  1  7 13  1  8  7  4]

row-5 =[ 2 13  7  1 12  5  6  8 15  8]

row-23 =[ 2 10  5  1  3 10  9 10  7  6]

Then I want to calculate the weights based on the position of the matches. The left-most match should get the highest score and the right-most match should get the lowest score.

As a result, the ranking should be 5-12-23.

I'm able to get the correct ranking based on the first requirement using the following code block:

sorted(total_matchs, key=lambda x:x[1],reverse=True)[:3]

where total_matchs is a list of tuples consists of the row number and match score. But I'm having trouble building the correct algorithm for the second requirement which is position-based match weights.

Can someone help me with the correct algorithm?

CodePudding user response：

You already have match counts, good.

Then I want to calculate the weights based on the position of the matches. The left-most match should get the highest score and the right-most match should get the lowest score.

We wish to sort on two different things. In general they could be any comparable thing, like strings or reals, and we'd have to sort 2-tuples.

But here it happens we have integer counts, which could be promoted to FP scores. Define a very small number epsilon and a max index K:

eps = .001
k = 10

and use this to jam your two values into a single FP number:

match_count   eps * (k - index)

Such scores will sort nicely, and then you just extract the top three, which might for example give "double match" scores of [2.007, 2.005, 2.001]