Ways to speed up this list comprehension-CodePudding

I have a problem where I need to compare large lists of file paths to be processed with a list of files that have already been processed. I don't want to just change the string within remaining to match the strings in original list because that will break the portability (also the strings are actually a lot more complicated than in the example). Is there a more efficient way of generating the final jobs variable?

original_list = [f'{x}.wav' for x in range(100000)]
output_list = [f'{x}.npy' for x in range(100000)]
already_done = [f'{x}.npy' for x in range(10000)]
jobs = list(zip(original_list, output_list))
remaining = list(set(output_list).difference(already_done))
jobs = [(x,y) for x,y in jobs if y in remaining]

CodePudding user response：

Yes! Checking for membership in a list is expensive, so let remaining be a set instead of converting it to a list. Also, you don't need to convert jobs to a list. Let it be an iterator that you get from zip() and evaluate its elements only once when you need them in the list comprehension.

original_list = [f'{x}.wav' for x in range(100000)]
output_list = [f'{x}.npy' for x in range(100000)]
already_done = [f'{x}.npy' for x in range(10000)]
jobs = zip(original_list, output_list)
remaining = set(output_list).difference(already_done)
jobs = [(x,y) for x,y in jobs if y in remaining]