how to use fromiter and ndnumerate together-CodePudding

I'm currently trying to manually implement a function to represent the KNN graph of a set of points as an incidence matrix, and my idea was to take the rows of an affinity matrix(n x n matrix representing the distance between the n points), enumerate and sort them, then return indices for the first K elements

for node in range(node_count):
        # neighbor_indices[:, node] =
        print(
            np.fromiter(
                np.ndenumerate(affinity_matrix[ node,:]),
                dtype=(np.intp, np.float64),
                count=node_count,
            )#.sort(
         #    reverse=True, key=lambda x: x[1]
         #    )[1 :: k   1][0]
        )

the errors I get are dependent on the value of dtype. the obvious choice I thought was dtype=(np.intp, np.float64) or dtype=(int,np.float64) but this returns the error: ValueError: setting an array element with a sequence. meaning I'm trying to unpack multiple values to a single spot

when inspecting the output of ndenumerate in a loop, the first value appears to be a single value inside a tuple:

for x in np.ndenumerate(affinity_matrix[node, :]):
            print(x)
            print(type(x), " ", type(x[0]), " ", type(x[0][0]))

((990,), 0.9958856990164133)
<class 'tuple'>   <class 'tuple'>   <class 'int'>

but setting dtype=((int,), np.float64) throws the error: TypeError: Tuple must have size 2, but has size 1

Is there a way to use fromiter and ndenumerate together, or are they somehow incompatible?

CodePudding user response：

ndenumerate produces, for each element, a indexing tuple and the value.

In [163]: x = np.arange(6)

In [164]: list(np.ndenumerate(x))
Out[164]: [((0,), 0), ((1,), 1), ((2,), 2), ((3,), 3), ((4,), 4), ((5,), 5)]

That makes more sense when the array is 2d or more. The indexing tuples will have 2 or more values:

In [165]: list(np.ndenumerate(x.reshape(3,2)))
Out[165]: [((0, 0), 0), ((0, 1), 1), ((1, 0), 2), ((1, 1), 3), ((2, 0), 4), ((2, 1), 5)]

With 'plain' enumerate, you get a 2 element tuple:

In [166]: list(enumerate(x))
Out[166]: [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5)]

With fromiter and the compound dtype:

In [167]: np.fromiter(enumerate(x), dtype=np.dtype("i,f"))
Out[167]: 
    array([(0, 0.), (1, 1.), (2, 2.), (3, 3.), (4, 4.), (5, 5.)],
          dtype=[('f0', '<i4'), ('f1', '<f4')])

The `dtype` shows the full specification that your short hand produces.  With that dtype, you get a structured array, which can be accessed field by field:

    In [169]: _['f0'], _['f1']
    Out[169]: 
    (array([0, 1, 2, 3, 4, 5], dtype=int32),
     array([0., 1., 2., 3., 4., 5.], dtype=float32))

I've never seen `fromiter` used with `enumerate`.  Admittedly `enumerate/ndenumerate` are generators, and `fromiter` is supposed to be the better way of creating an array from generators.  Let's try some times:

    In [170]: y = np.random.rand(10000)
    In [171]: timeit np.fromiter(enumerate(y), dtype=np.dtype("i,f"))
    2.39 ms ± 68.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [172]: timeit list(enumerate(y))
    1.37 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Just 'listing' the generator is faster.  `ndenumerate` is slower.
    
    In [173]: timeit list(np.ndenumerate(y))
    4.58 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
But if your goal is an array, not a just a list, then `fromiter` is faster:

    In [174]: timeit np.array(list(enumerate(y)))
    9.99 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I can't find the source code for `ndenumerate` - it's buried in some file redirections), but I suspect it uses `ndindex` to create the indexing tuples, and then makes a new tuple from that plus the value:

    In [179]: list(np.ndindex(x.shape))
    Out[179]: [(0,), (1,), (2,), (3,), (4,), (5,)]
    
    In [180]: list(np.ndindex(3,2))
    Out[180]: [(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]

For a 1d array, it's easy to create index - `np.arange(x.shape[0])`.  For higher dimensions, `meshgrid`, `mgrid` etc can generate all the indexing arrays.

edit

For a 1d array, this function produces the same structured array as your fromiter

def foo(x):
    n = x.shape[0]
    res = np.empty(n, 'i,f')
    res['f0'] = np.arange(n)
    res['f1'] = x
    return res

In [216]: foo(x)
Out[216]: 
array([(0, 0.), (1, 1.), (2, 2.), (3, 3.), (4, 4.), (5, 5.)],
      dtype=[('f0', '<i4'), ('f1', '<f4')])

In [217]: foo(y)
Out[217]: 
array([(   0, 0.08351453), (   1, 0.86144197), (   2, 0.6635565 ), ...,
       (9997, 0.52427566), (9998, 0.7808558 ), (9999, 0.5060718 )],
      dtype=[('f0', '<i4'), ('f1', '<f4')])

In [218]: timeit foo(y)
51.8 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

CodePudding user response：

This isn't technically the answer to the question I asked. but for anyone who tries something similar, you can get the indices from the sorted array by using np.argsort

affinity_matrix = np.exp(
        -1.0 / (2 * 1) * pairwise_distances(X, metric="sqeuclidean")
    )
    neighbor_indices = np.zeros((node_count, k))

for node in range(node_count):
        # Note on stop, since we want k elements, and we starting from -2(to exclude the node itself), and stop is exclusive
        # were ending on (k 2)
        neighbor_indices[node] = affinity_matrix[node].argsort()[-2 : -(k   2) : -1]

also, to answer my actual question in case it's helpful to someone, the syntax I found that works with enumerate is specifying the dtype as a typestring

print(
            np.fromiter(
                enumerate(affinity_matrix[ node,:]),
                dtype=np.dtype("i,f"),
                count=node_count,
            )