I am looking for some help speeding up some code that I have written in Numpy. Here is the code:
def TimeChunks(timevalues, num):
avg = len(timevalues) / float(num)
out = []
last = 0.0
while last < len(timevalues):
out.append(timevalues[int(last):int(last avg)])
last = avg
return out
### chunk i can be called by out[i] ###
NumChunks = 100000
t1chunks = TimeChunks(t1, NumChunks)
t2chunks = TimeChunks(t2, NumChunks)
NumofBins = 2000
CoincAllChunks = 0
for i in range(NumChunks):
CoincOneChunk = 0
Hist1, something1 = np.histogram(t1chunks[i], NumofBins)
Hist2, something2 = np.histogram(t2chunks[i], NumofBins)
Mask1 = (Hist1>0)
Mask2 = (Hist2>0)
MaskCoinc = Mask1*Mask2
CoincOneChunk = np.sum(MaskCoinc)
CoincAllChunks = CoincAllChunks CoincOneChunk
Is there anything that can be done to improve this to make it more efficient for large arrays?
To explain the point of the code in a nutshell, the purpose of the code is simply to find the average "coincidences" between two NumPy arrays, representing time values of two channels (divided by some normalisation constant). This "coincidence" occurs when there is at least one time value in each of the two channels in a certain time interval.
For example:
t1 = [.4, .7, 1.1]
t2 = [0.8, .9, 1.5]
There is a coincidence in the window [0,1] and one coincidence in the interval [1, 2].
I want to find the average number of these "coincidences" when I break down my time array into a number of equally distributed bins. So for example if:
t1 = [.4, .7, 1.1, 2.1, 3, 3.3]
t2 = [0.8, .9, 1.5, 2.2, 3.1, 4]
And I want 4 bins, the intervals I'll consider are ([0,1], [1,2], [2,3], [3,4]). Therefore the total coincidences will be 4 (because there is a coincidence in each bin), and therefore the average coincidences will be 4.
This code is an attempt to do this for large time arrays for very small bin sizes, and as a result, to make it work I had to break down my time arrays into smaller chunks, and then for-loop through each of these chunks.
I've tried making this as vectorized as I can, but it still is very slow... Any ideas what can be done to speed it up further?
Any suggestions/hints will be appreciated. Thanks!.
CodePudding user response:
This is 17X faster and more correct using a custom made numba_histogram
function that beats the generic np.histogram
. Note that you are computing and comparing histograms of two different series separately, which is not accurate for your purpose. So, in my numba_histogram
function I use the same bin edges to compute the histograms of both series simultaneously.
We can still optimize it even further if you provide more precise details about the algorithm. Namely, if you provide specific details about the parameters and the criteria on which you decide that two intervals coincide.
import numpy as np
from numba import njit
@njit
def numba_histogram(a, b, n):
hista, histb = np.zeros(n, dtype=np.intp), np.zeros(n, dtype=np.intp)
a_min, a_max = min(a[0], b[0]), max(a[-1], b[-1])
for x, y in zip(a, b):
bin = n * (x - a_min) / (a_max - a_min)
if x == a_max:
hista[n - 1] = 1
elif bin >= 0 and bin < n:
hista[int(bin)] = 1
bin = n * (y - a_min) / (a_max - a_min)
if y == a_max:
histb[n - 1] = 1
elif bin >= 0 and bin < n:
histb[int(bin)] = 1
return np.sum( (hista > 0) * (histb > 0) )
@njit
def calc_coincidence(t1, t2):
NumofBins = 2000
NumChunks = 100000
avg = len(t1) / NumChunks
CoincAllChunks = 0
last = 0.0
while last < len(t1):
t1chunks = t1[int(last):int(last avg)]
t2chunks = t2[int(last):int(last avg)]
CoincAllChunks = numba_histogram(t1chunks, t2chunks, NumofBins)
last = avg
return CoincAllChunks
Test with 10**8
arrays:
t1 = np.arange(10**8) np.random.rand(10**8)
t2 = np.arange(10**8) np.random.rand(10**8)
CoincAllChunks = calc_coincidence(t1, t2)
print( CoincAllChunks )
# 34793890 Time: 24.96140170097351 sec. (Original)
# 34734897 Time: 1.499996423721313 sec. (Optimized)