I have a matrix of distance between 12000 atoms (pairwise euclidean distance). And I want to convert that to a list of nodes adjacency, with the ith element of the list being a list of nodes that are within a threshold distance. For example if I have three points :
(0,0) (1,0) (1,1)
I will have the matrix :
[[0. 1. 1.41421356]
[1. 0. 1. ]
[1.41421356 1. 0. ]]
Then all pairs that satisfy the condition; distance <= 1; will be:
[[0 0]
[0 1]
[1 0]
[1 1]
[1 2]
[2 1]
[2 2]]
Then finally the adjacency list will be:
[[0,1],[0,1,2],[1,2]]
Here is a code that work:
from scipy.spatial import distance
import numpy as np
def voisinage(xyz):
#xyz is a vector of positions in 3d space
# matrice de distance
dist = distance.cdist(xyz,xyz,'euclidean')
# extract i,j pairs where distance < threshold
paires = np.argwhere(dist<threshold)
# prepare the adjacency list
Vvoisinage = [[] for i in range(len(xyz))]
# fill the adjacency list
for p in paires:
Vvoisinage[p[0]].append(p[1])
This code runs in approximately 4 to 5 seconds for around 12100 points in 3D space. I want to make it as fast as possible because it needs to run for thousands of sets of 12100 points, and there are other calculations for each set. I have tried to use networkX but it's a lot slower than this method.
The section to optimize is the last one because it takes in average 2.7 seconds so half the computation time.
Also maybe there is a faster way of doing all of this.
Thanks
CodePudding user response:
As discussed in comments, a cheap optimization is to compare the squares of distances. This gives the same result and avoids extracting square roots.
As discussed by @inarighas it is redundant to calculate the full matrix. You can avoid calculating the leading diagonal and the entire upper triangle. This will double your performance.
If you are interested in performance, you should not use an interpreted language like python. A compiled language such as C can give you as much as a 50 times performance boost. Here is the C code for your problem.
class cLoc
{
public:
float x, y, z;
cLoc(float X, float Y, float Z)
: x(X), y(Y), z(Z)
{
}
float dist2(const cLoc &other) const
{
float dx = x - other.x;
float dy = y - other.y;
float dz = z - other.z;
return dx *dx dy *dy dz *dz;
}
};
class cAtoms
{
public:
std::vector<cLoc> myAtomLocs;
std::vector<std::vector<int>> myClose;
float myMaxDist2;
void generateTest1();
void neighbors();
std::string text();
};
void cAtoms::generateTest1()
{
myAtomLocs = {
cLoc(0, 0, 0),
cLoc(1, 0, 0),
cLoc(1, 1, 0)};
myMaxDist2 = 1;
}
void cAtoms::neighbors()
{
for (int ks = 0; ks < myAtomLocs.size(); ks )
{
std::vector<int> v;
v.push_back(ks);
for (int kd = ks 1; kd < myAtomLocs.size(); kd )
{
if (myAtomLocs[ks].dist2(myAtomLocs[kd]) <= myMaxDist2)
v.push_back(kd);
}
myClose.push_back(v);
}
}
std::string cAtoms::text()
{
std::stringstream ss;
for( auto& v : myClose ) {
if( !v.size() )
continue;
ss << v[0] <<": ";
for( int k = 1; k < v.size(); k )
{
ss << v[k] << " ";
}
ss << "\n";
}
return ss.str();
}
main()
{
cAtoms myAtoms;
myAtoms.generateTest1();
myAtoms.neighbors();
std::cout << myAtoms.text();
}
output is
0: 1
1: 2
If I generate 12,100 atoms as follows
void cAtoms::generateRandom()
{
srand(time(NULL));
for( int k = 0; k < 12100; k ) {
myAtomLocs.push_back( cLoc(
(rand() % 100 ) / 10.0f,
(rand() % 100 ) / 10.0f,
(rand() % 100 ) / 10.0f
));
}
myMaxDist2 = 1;
}
then the neighbors() method runs in 125 milliseconds on my laptop.
Code for the complete application at https://github.com/JamesBremner/atoms
CodePudding user response:
First the diagonal of you distance matrix is not that useful since it's always equal to zero. To make your procedure faster, I used only numpy functions since they are usually faster than vanilla python list operations when dealing with arrays and matrices.
so first ignored the dist
matrix diagonal by setting to np.nan
Then, I grouped paires
by first index (see Is there any numpy group by function?).
Here is my code:
from scipy.spatial import distance
import numpy as np
xyz = np.array([[0,0],[1,0],[1,1]])
threshold = 1
# distance matrix
dist = distance.cdist(xyz,xyz,'euclidean')
# ignore diagonal values
np.fill_diagonal(dist, np.nan)
# extract i,j pairs where distance < threshold
paires = np.argwhere(dist<=threshold)
# groupby index
tmp = np.unique(paires[:, 0], return_index=True)
neighbors = np.split(paires[:,1], tmp[1])[1:]
indices = tmp[0]
The output corresponds to a list of lists, such as each list corresponds to nodes that are neighbors to the node corresponding to the index.
In terms of quatitative performance (on my computer ofc), your function takes ~4.5s on a 12000 random generated points whereas mine takes ~1.3s.