Vectorized numpy: check if point is inside sphere?-CodePudding

I have a numpy array centers which is N row by 3 columns and contains the 3D coordinates of the center of the spheres. Then another Nx1 array radii which contains the radii corresponding to the spheres. Finally, I have a third array points which are the points that I want to check to see if they are inside of the sphere. To do the check, I know that I need to find d = np.sqrt((points[0]-sphere[0])**2 (points[1]-sphere[1])**2 (points[2]-sphere[2])**2) for all the points and all the spheres but I'm not sure how to do this in a vectorized manner. Is this possible? Thank you.

Additional info:

I would like to run this on the GPU using CuPy, or otherwise have extremely fast performance.
There are between 2-200 spheres, and 300k points.

CodePudding user response：

Since CuPy locks us out of scipy.spatial functions, we can fall back to numpy-only functions. In this case, based on the first part of the answer here and (inexpertly) converted to CuPy methods:

d = cupy.sqrt(                                
              cupy.linalg.einsum('ij, ij ->i', centers , centers)[:, None]  \   
              cupy.linalg.einsum('ij, ij ->i', points , points) -\          
              2 * cupy.linalg.dot(centers, points.T)
             )

This distance matrix can be then checked against radii by D < radii[:, None]

CodePudding user response：

When the number of sphere is large, it is generally a good idea to use a data structure like a quadtree, k-D tree and especially ball tree. This is what Scipy uses internally and this result in an algorithm running in O(M log(N)) time where N is the number of sphere and M is the number of points.

The thing is implementing such data structure efficiently on GPU is pretty insane (because they are not SIMD friendly). In your case, the number of sphere is small enough so that using a naive algorithm running in O(N M) should be relatively fast.

The solution of @DanielF is a good start but is is not very efficient as is creates several temporary arrays and use a memory access pattern that is not SIMD friendly (which is critical on GPU so to get reasonable performances). The temporary arrays cause slow allocations and memory accesses while GPU shine for heavy computational operations (typically on large arrays).

To solve this problem, you can write your own kernel. The thing is the input array are not efficiently stored. They should be transposed so the computation can make more SIMD-friendly memory accesses and transposing arrays is relatively slow on GPU (as well as creating new arrays). Please consider using a more suitable input data layout. Additionally, computing the square root is not required since you can compare the result to the squared radius. Here is an example computing a "collision" matrix (where rows are spheres and columns are points):

import cupy as cp

points = cp.random.rand(300_000, 3)
spheres = cp.random.rand(200, 3)
radii = cp.random.rand(200, 3)

checkInSphere = cp.RawKernel(r'''
extern "C" __global__
void checkInSphere(const double* points, int pointCount, const double* spheres, const double* radii, int sphereCount, bool* found) {
    int tid = blockDim.x * blockIdx.x   threadIdx.x;

    if(tid < pointCount)
    {
        const double px = points[tid];
        const double py = points[tid pointCount];
        const double pz = points[tid pointCount*2];

        for(int i=0 ; i<sphereCount ;   i)
        {
            const double r = radii[i];
            const double sx = spheres[i];
            const double sy = spheres[i sphereCount];
            const double sz = spheres[i sphereCount*2];
            const double dx = px - sx;
            const double dy = py - sy;
            const double dz = pz - sz;
            const double sqDist = dx * dx   dy * dy   dz * dz;
            found[i*pointCount tid] = sqDist <= r * r;
        }
    }
}
''', 'checkInSphere')

threadPerBlock = 64
blockCount = (points.size   (threadPerBlock - 1)) // threadPerBlock
found = cp.empty((spheres.shape[0], points.shape[0]), dtype=cp.bool_)
checkInSphere((blockCount,), (threadPerBlock,), (points.T.copy(), points.shape[0], spheres.T.copy(), radii, spheres.shape[0], found))

This code is about 5 times faster in double precision than the one of @DanielF. The thing is my GPU like most GPUs is VERY slow to operate on double-precision (DP) floating-point numbers. In fact, mainstream client-side GPUs are not designed for such workload and are generally slower than CPUs for that. However, (expensive) server-side GPUs can efficiently operate on double-precision numbers. Thus, please consider working with simple-precision floating-point (SP) numbers if you are using a basic mainstream GPU. If you really want to use DP, then this computation is certainly faster on a CPU using tools like Numba. Note that using SP operations are generally also faster on server-side GPUs as well as CPUs. On my machine, the SP version of this code takes less than 1 ms and is about 20 times faster than the one of @DanielF.

For better performance, please consider changing the input layout, pre-allocate all the buffers, pre-compute the squares of radii, and make use of shared memory so to load the sphere locations only once. Loop unrolling and register tiling should also help to improve performance even further though it will make the code barely readable. One could also store only the index of the closest sphere for each point, or even only a boolean to say if the point is in any sphere so to reduce the size of the output (memory accesses are expensive).