I have a relatively large array called allListings
and want to filter out all rows where allListings[:][14] == listingID
.
This is the code I am using:
tempRows = list(filter(lambda x: x[14] == listingID, allListings))
The filtering is repeated in a for loop for all different listingID
Profiling shows, that this line consumes 95% of the runtime in the loop. Is there any other way to filter large arrays more efficiently?
CodePudding user response:
As suggested in comments, you may want to sort and group by this column if you are performing multiple operations on it based on the value of that column.
>>> from itertools import groupby
>>> a = [[1, 2, 3, 5],
... [4, 6, 2, 8],
... [1, 5, 7, 9],
... [3, 5, 8, 2]]
>>> b = sorted(a, key=lambda x: x[0])
>>> b
[[1, 2, 3, 5], [1, 5, 7, 9], [3, 5, 8, 2], [4, 6, 2, 8]]
>>> c = groupby(b, key=lambda x: x[0])
>>> c
<itertools.groupby object at 0x106b763e0>
>>> d = {k: list(v) for k, v in c}
>>> d
{1: [[1, 2, 3, 5], [1, 5, 7, 9]], 3: [[3, 5, 8, 2]], 4: [[4, 6, 2, 8]]}
Now, if you need all lists where the first element is 1
, you simply need:
>>> d[1]
[[1, 2, 3, 5], [1, 5, 7, 9]]
Or if you wanted everything but 1 in that first position.
>>> [x for k, v in d.items()
... if k != 1
... for x in v]
[[3, 5, 8, 2], [4, 6, 2, 8]]
This is obviously a simpler example, but should be easily applicable to your situation.
CodePudding user response:
I got about a 33% improvement by moving the filter to a cython file and compiling. The primary speedup I think is in eliminating the reload of listingID
for each compare. Just a guess on that.
test.pyx
def all_listings_filter(list data, int listingID):
return [row for row in data if row[14] == listingID]
command line
cython3 test.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python3.10 -o test.so test.c