Python - Group(Cluster/Sort) arrays based on ranking information-CodePudding

I have a dataframe looks like this:

      A         B          C          D
0    5         4           3         2
1    4         5           3         2
2    3         5           2         1
3    4         2           5         1
4    4         5           2         1
5    4         3           5         1
...

I converted the dataframe into 2D arrays like this:

[[5 4 3 2]
 [4 5 3 2]
 [3 5 2 1]
 [4 2 5 1]
 [4 5 2 1]
 [4 3 5 1]
 ...]

The score of each row 1-5 actually means the people give the scores to item A, B, C, D. I would like to identify the people who have the same ranking, for example the people think A > B > C > D. And I would like to regroup these arrays based on the ranking information like this:

2DArray1: [[5 4 3 2]]
2DArray2: [[4 5 3 2]
           [3 5 2 1]
           [4 5 2 1]]
2DArray3: [[4 2 5 1]
           [4 3 5 1]]

For example 2DArray2 means the people who think B > A > C > D, 2DArray3 are the people think C > A > B > D . I tried different sort functions in numpy but I cannot find one suitable. How should I do?

CodePudding user response：

Numpy doesn't have a groupby function, because a groupby would return a list of lists of different sizes; whereas numpy mostly only deals with "rectangle" arrays.

A workaround would be to sort the rows so that similar rows are adjacent, then produce an array of the indices of the beginning of each group.

Since I'm too lazy to do that, here is a solution without numpy instead:

Index by the permutation directly

For each row, we compute the corresponding permutation of 'ABCD'. Then, we add the row to a dict of lists of rows, where the dictionary keys are the corresponding permutations.

from collections import defaultdict

a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]

groups = defaultdict(list)
for row in a:
    groups[tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True))].append(row)

print(groups)

Output:

defaultdict(<class 'list'>, {
    (0, 1, 2, 3): [[5, 4, 3, 2]],
    (1, 0, 2, 3): [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
    (2, 0, 1, 3): [[4, 2, 5, 1], [4, 3, 5, 1]]
})

Note that with this solution, the results might not be what you expect if some users give the same score to two different items, because sorted doesn't keep ex-aequo; instead it breaks ties by order of appearance (in this case, this means ties between two items are broken alphabetically).

Index by the index of the permutation

The permutations of 'ABCD' can be ordered lexicographically: 'ABCD' comes first, then 'ABDC' comes second, then 'ACBD' comes third...

As it turns out, there is an algorithm to compute the index at which a given permutation would come in that sequence! And that algorithm is implemented in python module more_itertools:

more_itertools.permutation_index

So, we can replace our tuple key tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True)) by a simple number key permutation_index(row, sorted(row, reverse=True)).

from collections import defaultdict
from more_itertools import permutation_index

a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]

groups = defaultdict(list)
for row in a:
    groups[permutation_index(row, sorted(row, reverse=True))].append(row)

print(groups)

Output:

defaultdict(<class 'list'>, {
    0: [[5, 4, 3, 2]],
    6: [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
    8: [[4, 2, 5, 1], [4, 3, 5, 1]]
})

Mixing permutation_index and pandas

Since the output of permutation_index is a simple number, we can easily include it in a numpy array or a pandas dataframe as a new column:

import pandas as pd
from more_itertools import permutation_index

df = pd.DataFrame({'A': [5,4,3,4,4,4], 'B': [4,5,5,2,5,3], 'C': [3,2,2,5,2,5], 'D': [2,2,1,1,1,1]})

df['perm_idx'] = df.apply(lambda row: permutation_index(row, sorted(row, reverse=True)), axis=1)

print(df)

   A  B  C  D  perm_idx
0  5  4  3  2         0
1  4  5  2  2         6
2  3  5  2  1         6
3  4  2  5  1         8
4  4  5  2  1         6
5  4  3  5  1         8

for idx, sub_df in df.groupby('perm_idx'):
    print(idx)
    print(sub_df)

0
   A  B  C  D  perm_idx
0  5  4  3  2         0
6
   A  B  C  D  perm_idx
1  4  5  2  2         6
2  3  5  2  1         6
4  4  5  2  1         6
8
   A  B  C  D  perm_idx
3  4  2  5  1         8
5  4  3  5  1         8

CodePudding user response：

You can

(i) transpose df and convert it to a dictionary,

(ii) sort this dictionary by value and get the keys,

(iii) join the sorted keys for each "person" and assign this dict to df['ranks'],

(iv) aggregate ranking points and assign it to df['pref'],

(v) groupby(['ranks']) and create lists from pref

df = pd.DataFrame({'A': {0: 5, 1: 4, 2: 3, 3: 4, 4: 4, 5: 4},
                   'B': {0: 4, 1: 5, 2: 5, 3: 2, 4: 5, 5: 3},
                   'C': {0: 3, 1: 3, 2: 2, 3: 5, 4: 2, 5: 5},
                   'D': {0: 2, 1: 2, 2: 1, 3: 1, 4: 1, 5: 1}})

df['ranks'] = pd.Series({k : ''.join(list(zip(*sorted(v.items(), key=lambda d:d[1], 
                                                      reverse=True)))[0]) 
                         for k,v in df.T.to_dict().items()})
df['pref'] = df.loc[:,'A':'D'].values.tolist()
out = df[['ranks','pref']].groupby('ranks').agg(list).to_dict()['pref']

Output:

{'ABCD': [[5, 4, 3, 2]],
 'BACD': [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
 'CABD': [[4, 2, 5, 1], [4, 3, 5, 1]]}