Consider the following example from Python documentation. Does multiprocessing.Process
use serialization (pickle) to put items in the shared queue?
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print(q.get()) # prints "[42, None, 'hello']"
p.join()
I understand multiprocessing.Process
uses pickle to serialize / deserialize data to communicate with the main process. And threading.Thread
does not need serialization, so it does not use pickle. But I'm not sure how communication with Queue
happens in a multiprocessing.Process
.
Additional Context
I want multiple workers to fetch data from a database (or local storage) to fill the shared queue where the items are consumed by the main process sequentially. Each record that is fetched is large (1-1.5 mb
). The problem with using multiprocessing.Process
is that serialization / deserialization of the data takes a long time. Pytorch's DataLoader
makes use of this and is therefore unsuitable for my use case.
Is multi-threading the best alternative for such a use case?
CodePudding user response:
Yes, mutiprocessing's queues does use Pickle internally. This can be seen in multiprocessing/queues.py
of the CPython implementation. In fact, AFAIK CPython uses Pickle for transferring any object between interpreter processes. The only way to avoid this is to use shared memory but it introduces strong limitation and cannot basically be used for any type of objects.
Multithreading is limited by the Global Interpreter Lock (GIL) which basically prevent any parallel speed up except of operations releasing the GIL (eg. some Numpy functions) and IO-based ones.
Python (and especially CPython) is not the best languages for parallel computing (nor for high performance). It has not been designed with that in mind and this is nowadays a pretty strong limitation regarding the recent sharp increase of the number of cores per processor.