Home > Back-end >  Sorting/Cluster a 2D numpy array in ordered sequence based on multiple columns
Sorting/Cluster a 2D numpy array in ordered sequence based on multiple columns

Time:11-04

I have a 2D numpy array like this:

 [[4 5 2] 
  [5 5 1]
  [5 4 5]
  [5 3 4]
  [5 4 4]
  [4 3 2]]

I would like to sort/cluster this array by finding the sequence in array like this row[0]>=row[1]>=row[2], row[0]>=row[2]>row[1]... so the row of the array is in ordered sequence.

I tried to use the code: lexdf = df[np.lexsort((df[:,2], df[:,1],df[:,0]))][::-1], however it is not I want. The output of lexsort:

 [[5 5 1]
  [5 4 5]
  [5 4 4]
  [5 3 4]
  [4 5 2] 
  [4 3 2]]

The output I would like to have:

 [[5 5 1]
  [5 4 4]
  [4 3 2]
  [5 4 5]
  [5 3 4]
  [4 5 2]] 

or cluster it into three parts:

 [[5 5 1]
 [5 4 4]
 [4 3 2]]

 [[5 4 5]
 [5 3 4]]

 [[4 5 2]]

And I would like to apply this to an array with more columns, so it would be better to do it without iteration. Any ideas to generate this kind of output?

CodePudding user response:

I don't know how to do it in numpy, except maybe with some weird hacks of function numpy.split.

Here is a way to get your groups with python lists:

from itertools import groupby, pairwise

def f(sublist):
    return [x <= y for x,y in pairwise(sublist)]

# NOTE: itertools.pairwise requires python>=3.10
# For python<=3.9, use one of those alternatives:
# * more_itertools.pairwise(sublist)
# * zip(sublist, sublist[1:])

a = [[4, 5, 2], 
  [5, 5, 1],
  [5, 4, 5],
  [5, 3, 4],
  [5, 4, 4],
  [4, 3, 2]]

b = [list(g) for _,g in groupby(sorted(a, key=f), key=f)]

print(b)
# [[[4, 3, 2]],
#  [[5, 4, 5], [5, 3, 4], [5, 4, 4]],
#  [[4, 5, 2], [5, 5, 1]]]

Note: The combination groupby sorted is actually slightly subefficient, because sorted takes n log(n) time. A linear alternative is to group using a dictionary of lists. See for instance function itertoolz.groupby from module toolz.

  • Related