I have a problem where I need to compare large lists of file paths to be processed with a list of files that have already been processed. I don't want to just change the string within remaining to match the strings in original list because that will break the portability (also the strings are actually a lot more complicated than in the example). Is there a more efficient way of generating the final jobs variable?
original_list = [f'{x}.wav' for x in range(100000)]
output_list = [f'{x}.npy' for x in range(100000)]
already_done = [f'{x}.npy' for x in range(10000)]
jobs = list(zip(original_list, output_list))
remaining = list(set(output_list).difference(already_done))
jobs = [(x,y) for x,y in jobs if y in remaining]
CodePudding user response:
Yes! Checking for membership in a list is expensive, so let remaining
be a set
instead of converting it to a list. Also, you don't need to convert jobs
to a list. Let it be an iterator that you get from zip()
and evaluate its elements only once when you need them in the list comprehension.
original_list = [f'{x}.wav' for x in range(100000)]
output_list = [f'{x}.npy' for x in range(100000)]
already_done = [f'{x}.npy' for x in range(10000)]
jobs = zip(original_list, output_list)
remaining = set(output_list).difference(already_done)
jobs = [(x,y) for x,y in jobs if y in remaining]