Starting to build out a text similarity project where I take a list of Names(100,000 records) and find the best match in a larger piece of text(Document). Already used rapidfuzz and fuzzyset libraries to successfully do this wanted to see if there is a faster way for my specific use case. Playing around with using tri-grams and hashing the tri-gram strings. Want to avoid loops as much as possible for performance(Pythonic). Below is a snippet example, where A might be the tri-gram representation of the document and b would be the representation of the one name.
import numpy as np
a = np.array([(1,2,3), (1,3,3), (3,3,3), (3,3,4)])
b = np.array((1,2,3))
print(np.sum(a == b))
Output is 6 but want it to be [3,2,1,0] or alternatively just the max result 3.
EDIT:
In looking at the problem in more detail, the static length of the tuple won't work for names of differing lengths. E.G MATTHEW is represented as (MAT,ATT,TTH,THE,HEW) while CALEB would be represented as (CAL,ALE,LEB).
Current thinking it might be best to break to document down into tri-grams with a sliding window. Either to the max length of all the names in the list or some how dynamically to the size of the name being currently searched. Any ideas welcome
CodePudding user response:
You can use the axis
argument
>>> np.sum(a==b, axis=1)
array([3, 2, 1, 0])
This will sum across rows and return a 1D array. Then to get the max either of the following will work.
>>> np.max(np.sum(a==b, axis=1))
3
>>> np.sum(a==b, axis=1).max()
3
CodePudding user response:
Since you want to count non-zero values, one alternative is use np.count_nonzero
:
import numpy as np
a = np.array([(1, 2, 3), (1, 3, 3), (3, 3, 3), (3, 3, 4)])
b = np.array((1, 2, 3))
res = np.count_nonzero(a == b, axis=1)
print(res)
Output
[3 2 1 0]
Afterwards you can find the max value by use .max
:
res.max() # 3
It seems both approaches are comparable in terms of speed for not so-small arrays, see below for a (not very thorough) comparison:
import numpy as np
a = np.array([(1, 2, 3), (1, 3, 3), (3, 3, 3), (3, 3, 4)] * 1000)
b = np.array((1, 2, 3))
%timeit np.count_nonzero(a == b, axis=1)
110 µs ± 744 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.sum(a == b, axis=1)
108 µs ± 379 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)