Home > database >  Multiprocessing to count frequency in multiple text files at once
Multiprocessing to count frequency in multiple text files at once

Time:11-06

I have 2 text files. I want to find the frequency of a letter(eg: "L") in both of them. Is there a way to apply ThreadPoolExecutor or ProcessPoolExecutor to make this faster?

So far I've tried it's only increasing the time taken.

def countFreq(data):
    res = {i : data.count(i) for i in set(data)}
    print(res)

This is the frequency count function I'm using. I've converted the text files to string too.

#Normal method    
start = time.time()

countFreq(str1)
countFreq(str2)
end = time.time()

print(f"Time taken: {end-start:.5f} seconds\n")

The above one is faster than the below code, why is that

#Method multiprocessing
start = time.time()

p1 = multiprocessing.Process(countFreq(str1))
p2 = multiprocessing.Process(countFreq(str2))

p1.start()
p2.start()
p1.join()
p2.join()

end = time.time()
print(f"Time taken: {end-start:.5f} seconds\n")

Any ideas on how to run them faster? Is it an IO-related or a processing-related issue?

CodePudding user response:

Using parallel/concurrent programming won't necessarily increase the speedup of your program, sometimes it is better of keeping it sequentially, especially if what we expect from these threads/processes to do is count each letter in the text files.

Creating a new process takes a lot of resources and uses your CPU(s) in order to run them in parallel. It takes significant computing time and power to spawn and manage processes in comparison to doing the same with threads, but even that is not guaranteed.

For counting only 2 files, I would try threads/keep it sequential. When the number of files get larger, we would essentially notice a difference between the speedups of the sequential and the parallel one.

For more information I would highly recommend reading about Amdahl's law.

As a side note, you should pass the function address to the target parameter inside multiprocessing.Process and the argument to the args parameter. Note it should be of type Tuple[Any] so you should add a trailing comma as such: target=countFreq, args=(str1,)

import time
import multiprocessing


def count_freq(data):
    res = {i: data.count(i) for i in set(data)}
    print(res)


def text_to_string(path):
    with open(path, 'r') as file_handler:
        return file_handler.read()


def main():
    start = time.time()

    count_freq(text_to_string('./text1'))
    count_freq(text_to_string('./text2'))
    # about 0.001
    end = time.time()

    print(f'sequential: {end - start} s')

    start = time.time()

    p1 = multiprocessing.Process(target=count_freq, args=(text_to_string('./text1'),))
    p2 = multiprocessing.Process(target=count_freq, args=(text_to_string('./text2'),))

    p1.start()
    p2.start()

    p1.join()
    p2.join()

    end = time.time()

    print(f'concurrent: {end - start} s')


if __name__ == '__main__':
    main()

  • Related