I'm currently trying to manually implement a function to represent the KNN graph of a set of points as an incidence matrix, and my idea was to take the rows of an affinity matrix(n x n matrix representing the distance between the n points), enumerate and sort them, then return indices for the first K elements
for node in range(node_count):
# neighbor_indices[:, node] =
print(
np.fromiter(
np.ndenumerate(affinity_matrix[ node,:]),
dtype=(np.intp, np.float64),
count=node_count,
)#.sort(
# reverse=True, key=lambda x: x[1]
# )[1 :: k 1][0]
)
the errors I get are dependent on the value of dtype.
the obvious choice I thought was dtype=(np.intp, np.float64)
or dtype=(int,np.float64)
but this returns the error: ValueError: setting an array element with a sequence.
meaning I'm trying to unpack multiple values to a single spot
when inspecting the output of ndenumerate in a loop, the first value appears to be a single value inside a tuple:
for x in np.ndenumerate(affinity_matrix[node, :]):
print(x)
print(type(x), " ", type(x[0]), " ", type(x[0][0]))
((990,), 0.9958856990164133)
<class 'tuple'> <class 'tuple'> <class 'int'>
but setting dtype=((int,), np.float64)
throws the error: TypeError: Tuple must have size 2, but has size 1
Is there a way to use fromiter
and ndenumerate
together, or are they somehow incompatible?
CodePudding user response:
ndenumerate
produces, for each element, a indexing tuple and the value.
In [163]: x = np.arange(6)
In [164]: list(np.ndenumerate(x))
Out[164]: [((0,), 0), ((1,), 1), ((2,), 2), ((3,), 3), ((4,), 4), ((5,), 5)]
That makes more sense when the array is 2d or more. The indexing tuples will have 2 or more values:
In [165]: list(np.ndenumerate(x.reshape(3,2)))
Out[165]: [((0, 0), 0), ((0, 1), 1), ((1, 0), 2), ((1, 1), 3), ((2, 0), 4), ((2, 1), 5)]
With 'plain' enumerate, you get a 2 element tuple:
In [166]: list(enumerate(x))
Out[166]: [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5)]
With fromiter
and the compound dtype:
In [167]: np.fromiter(enumerate(x), dtype=np.dtype("i,f"))
Out[167]:
array([(0, 0.), (1, 1.), (2, 2.), (3, 3.), (4, 4.), (5, 5.)],
dtype=[('f0', '<i4'), ('f1', '<f4')])
The `dtype` shows the full specification that your short hand produces. With that dtype, you get a structured array, which can be accessed field by field:
In [169]: _['f0'], _['f1']
Out[169]:
(array([0, 1, 2, 3, 4, 5], dtype=int32),
array([0., 1., 2., 3., 4., 5.], dtype=float32))
I've never seen `fromiter` used with `enumerate`. Admittedly `enumerate/ndenumerate` are generators, and `fromiter` is supposed to be the better way of creating an array from generators. Let's try some times:
In [170]: y = np.random.rand(10000)
In [171]: timeit np.fromiter(enumerate(y), dtype=np.dtype("i,f"))
2.39 ms ± 68.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [172]: timeit list(enumerate(y))
1.37 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Just 'listing' the generator is faster. `ndenumerate` is slower.
In [173]: timeit list(np.ndenumerate(y))
4.58 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But if your goal is an array, not a just a list, then `fromiter` is faster:
In [174]: timeit np.array(list(enumerate(y)))
9.99 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I can't find the source code for `ndenumerate` - it's buried in some file redirections), but I suspect it uses `ndindex` to create the indexing tuples, and then makes a new tuple from that plus the value:
In [179]: list(np.ndindex(x.shape))
Out[179]: [(0,), (1,), (2,), (3,), (4,), (5,)]
In [180]: list(np.ndindex(3,2))
Out[180]: [(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]
For a 1d array, it's easy to create index - `np.arange(x.shape[0])`. For higher dimensions, `meshgrid`, `mgrid` etc can generate all the indexing arrays.
edit
For a 1d array, this function produces the same structured array as your fromiter
def foo(x):
n = x.shape[0]
res = np.empty(n, 'i,f')
res['f0'] = np.arange(n)
res['f1'] = x
return res
In [216]: foo(x)
Out[216]:
array([(0, 0.), (1, 1.), (2, 2.), (3, 3.), (4, 4.), (5, 5.)],
dtype=[('f0', '<i4'), ('f1', '<f4')])
In [217]: foo(y)
Out[217]:
array([( 0, 0.08351453), ( 1, 0.86144197), ( 2, 0.6635565 ), ...,
(9997, 0.52427566), (9998, 0.7808558 ), (9999, 0.5060718 )],
dtype=[('f0', '<i4'), ('f1', '<f4')])
In [218]: timeit foo(y)
51.8 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
CodePudding user response:
This isn't technically the answer to the question I asked. but for anyone who tries something similar, you can get the indices from the sorted array by using np.argsort
affinity_matrix = np.exp(
-1.0 / (2 * 1) * pairwise_distances(X, metric="sqeuclidean")
)
neighbor_indices = np.zeros((node_count, k))
for node in range(node_count):
# Note on stop, since we want k elements, and we starting from -2(to exclude the node itself), and stop is exclusive
# were ending on (k 2)
neighbor_indices[node] = affinity_matrix[node].argsort()[-2 : -(k 2) : -1]
also, to answer my actual question in case it's helpful to someone, the syntax I found that works with enumerate is specifying the dtype as a typestring
print(
np.fromiter(
enumerate(affinity_matrix[ node,:]),
dtype=np.dtype("i,f"),
count=node_count,
)