I have a pandas dataframe as follows:
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 |
---|---|---|---|---|---|
a | b | c | d | e | f |
a | b | c | h | j | f |
a | b | c | k | e | l |
x | b | c | d | e | f |
And I want to get a score for all the rows to see which of them are mos similar to the first one considering:
- First, the number of columns with same values between the first row and the row we are considering
- If two rows have the same number of equal value compared to first row, consider digit importance, that is, go from left to right in column order and give more importance to those rows whose matching columns are the most left-ones.
In the example above, the score should be something in the following order:
- 4th row (the last one), as it has 4 columns values in common with row 1
- 3rd row, as it has 3 elements in common with row 1 and the non-matching columns are columns 4 and 6, while in row 2 these non matching columns are 4 and 5
- 2nd row, as it has the same number of coincidences than row 3 but the column 5 is matching in row 3 but not in row 2
I want to solve this using a lambda function that assigns the score to ecah row of the dataframe, given the row one as a constant.
CodePudding user response:
Set up the problem:
df = pd.DataFrame([list(s) for s in
["aaax", "bbbb", "cccc", "dhkd", "ejee", "fflf"]]).T
df.columns = [f"col_{n}" for n in range(1, 7)]
Solution:
ranking = (df == df.loc[0]).sum(1).iloc[1:]
ranking[~(ranking.duplicated('first') | ranking.duplicated('last'))] *= 10
ranking.update((df.loc[ties] == df.loc[0]).mul(np.arange(6,0,-1)).sum(1))
ranking.argsort()[::-1]
Explanation:
We first calculate each row's similarity to the first row and rank them. Then we split ties and non-ties. The non-ties are multiplied by 10. The ties are recalculated but this time we weight them by a descending scale of weights, to give more weight to a column the further left it is. Then we sum the weights to get the score for each row and update our original ranking. We return the reverse argsort
to show the desired order.
You don't need to use a lambda here, it will be slower.
Result:
3 2
2 1
1 0
The left is the row index, ranked in order.
CodePudding user response:
You could use np.lexsort
for this. This allows a nested sort based on the count of columns that match row 0, and the sum of the column index matches where leftmost matches are more valuable.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col_1': ['a', 'a', 'a', 'x'],
'col_2': ['b', 'b', 'b', 'b'],
'col_3': ['c', 'c', 'c', 'c'],
'col_4': ['d', 'h', 'k', 'd'],
'col_5': ['e', 'j', 'e', 'e'],
'col_6': ['f', 'f', 'l', 'f']})
df.loc[np.lexsort(((df.eq(df.iloc[0]) * df.columns.get_indexer(df.columns)[::-1]).sum(1).values,
df.eq(df.iloc[0]).sum(1).values))[::-1], 'rank'] = range(len(df))
print(df)
Output
col_1 col_2 col_3 col_4 col_5 col_6 rank
0 a b c d e f 0.0
1 a b c h j f 3.0
2 a b c k e l 2.0
3 x b c d e f 1.0