How to efficiently filter a large python list?-CodePudding

I have a relatively large array called allListings and want to filter out all rows where allListings[:][14] == listingID.

This is the code I am using: tempRows = list(filter(lambda x: x[14] == listingID, allListings))

The filtering is repeated in a for loop for all different listingID

Profiling shows, that this line consumes 95% of the runtime in the loop. Is there any other way to filter large arrays more efficiently?

CodePudding user response：

As suggested in comments, you may want to sort and group by this column if you are performing multiple operations on it based on the value of that column.

>>> from itertools import groupby
>>> a = [[1, 2, 3, 5],
...      [4, 6, 2, 8],
...      [1, 5, 7, 9],
...      [3, 5, 8, 2]]
>>> b = sorted(a, key=lambda x: x[0])
>>> b
[[1, 2, 3, 5], [1, 5, 7, 9], [3, 5, 8, 2], [4, 6, 2, 8]]
>>> c = groupby(b, key=lambda x: x[0])
>>> c
<itertools.groupby object at 0x106b763e0>
>>> d = {k: list(v) for k, v in c}
>>> d
{1: [[1, 2, 3, 5], [1, 5, 7, 9]], 3: [[3, 5, 8, 2]], 4: [[4, 6, 2, 8]]}

Now, if you need all lists where the first element is 1, you simply need:

>>> d[1]
[[1, 2, 3, 5], [1, 5, 7, 9]]

Or if you wanted everything but 1 in that first position.

>>> [x for k, v in d.items() 
...    if k != 1 
...    for x in v] 
[[3, 5, 8, 2], [4, 6, 2, 8]]

This is obviously a simpler example, but should be easily applicable to your situation.

CodePudding user response：

I got about a 33% improvement by moving the filter to a cython file and compiling. The primary speedup I think is in eliminating the reload of listingID for each compare. Just a guess on that.

test.pyx

def all_listings_filter(list data, int listingID):
    return [row for row in data if row[14] == listingID]

command line

cython3 test.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python3.10 -o test.so test.c