Is there a more efficient way to remove items from a list based on another list?-CodePudding

PUNCT_CHARS = { '（', '）', ',', '，', '、', ':', '：', '[', ']', '#'}

words = ['a', '#good', 'student']

for word in words.copy():
    for char in PUNCT_CHARS:
        if char in word:
            words.remove(word)
            break

print(words)

['a', 'student']

I want to remove words that contain punctuations. Can the 2nd for loop be replaced with an 'any' or 'all' function to make it more efficient?

CodePudding user response：

Take advantage of your PUNCT_CHARS set to check if the sets of characters are disjoint:

out = [w for w in words if PUNCT_CHARS.isdisjoint(w)]

Output: ['a', 'student']

To modify your original object:

words[:] = [w for w in words if PUNCT_CHARS.isdisjoint(w)]

CodePudding user response：

for word in words.copy():
    if not any(char in word for char in PUNCT_CHARS):
    print(word)

Or you can use list comprehension to make such a list:

words=[i for i in words if not any(k in i for k in PUNCT_CHARS)]

CodePudding user response：

list(filter(lambda word: not (set(word) & PUNCT_CHARS), words))

CodePudding user response：

You could join your punctuation characters into a regular expression and use that to filter your list.

>>> import re
>>> r = re.compile(f"[{re.escape(''.join(PUNCT_CHARS))}]")
>>> [w for w in words if not r.search(w)]
['a', 'student']
>>>

CodePudding user response：

There are already several solutions, all of which seem to yield the result you want. One that is not mentioned, yet, is:

    words = [w for w in words if not [c for c in PUNCT_CHARS if c in w]]

But you probably would like to know which one is fastest, out of all of the ones proposed. I have timed them for comparison.

from timeit import timeit
import re

def f1():
    PUNCT_CHARS = { '（', '）', ',', '，', '、', ':', '：', '[', ']', '#'}
    words = ['a', '#good', 'student']
    for word in words.copy():
        for char in PUNCT_CHARS:
            if char in word:
                words.remove(word)
                break

def f2():
    PUNCT_CHARS = { '（', '）', ',', '，', '、', ':', '：', '[', ']', '#'}
    words = ['a', '#good', 'student']
    words = [w for w in words if not [c for c in PUNCT_CHARS if c in w]]

def f3():
    PUNCT_CHARS = { '（', '）', ',', '，', '、', ':', '：', '[', ']', '#'}
    words = ['a', '#good', 'student']
    words = list(filter(lambda word: not (set(word) & PUNCT_CHARS), words))

def f4():
    PUNCT_CHARS = { '（', '）', ',', '，', '、', ':', '：', '[', ']', '#'}
    words = ['a', '#good', 'student']
    words[:] = [w for w in words if PUNCT_CHARS.isdisjoint(w)]

def f5():
    PUNCT_CHARS = { '（', '）', ',', '，', '、', ':', '：', '[', ']', '#'}
    words = ['a', '#good', 'student']
    words = [i for i in words if not any(k in i for k in PUNCT_CHARS)]

def f6():
    PUNCT_CHARS = { '（', '）', ',', '，', '、', ':', '：', '[', ']', '#'}
    words = ['a', '#good', 'student']
    r = re.compile(f"[{re.escape(''.join(PUNCT_CHARS))}]")
    words = [w for w in words if not r.search(w)]

print(timeit(f1))
print(timeit(f2))
print(timeit(f3))
print(timeit(f4))
print(timeit(f5))
print(timeit(f6))

Here are the results:

8.882432685000822
14.432570434000809
9.28880862799997
5.890979707000952
140.1832275639972
10.627997105999384

(I re-initialized the arrays in each function in case there were any side-effects on optimization that might skew the performance results in particular cases and to keep it as close as possible to the original code.)

However, for the solution with regular expression, compiling the regular expression just once yields a time that might be even a little faster:

3.643752547002805

Especially for a longer list of words, using a regular expression might turn out to be better. I am unsure. It might also depend on whether your punctuations change or just your list of words. If you need to perform this step just once, on a short list of words, using isdisjoint() seems best. If you need to perform this operation many times, on a long list of words, with fixed punctuation, probably compile the regular expression once and reuse it. But if you care about performance in the first place, then it seems it's likely the latter.

It seems that the version with isdisjoint() and regular expression are the fastest out of the ones proposed, with the version using any trailing far behind. I am using Python3.10 under MacOS (2015 MacBook, with 2.9GHz Intel CPU). I am not sure how to properly proceed with my answer in such a case. The answers for the code cited probably deserve the votes.

(This is an aside: what you might be intending to do with your code originally was to tokenize a text and then filter out words, as I have encountered code like this many times before. In that case, the fastest approach might be "none of the above". It might be better to filter out words with those punctuation characters in the first place, as you tokenize the text, again probably using a regular expression, so that you won't have to do the second filtering pass at all, which gives you the fastest possible run-time of "zero" for your filtering step.)