I'm confused trying to understand multiprocessing and multithreading
so this code will create 5 threads for each element in the list? and if the list has a million items will this create a million threads? do threads have limits?
And how many threads will be created if I specify ThreadPoolExecutor(10)
? 5 or 10
import concurrent.futures
list = [1,2,3,4,5]
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(function, list)
and when it came to multiprocessing it's more confusing, for example, does this code run 5 processes each one on a different core? and what if the CPU has only 4 cores, where the last process runs?
import concurrent.futures
list = [1,2,3,4,5]
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(function, list)
My goal is to process many folders ( thousands ) , read files inside each folder, and write data to a database
CodePudding user response:
You get one thread per CPU core (unless hyperthreading is used). Take the workload, divide it up into sections and assign each thread to a section. The multiprocessing pool library does a great job at automatically doing this.
CodePudding user response:
so this code will create 5 threads for each element in the list?
list = [1,2,3,4,5] with concurrent.futures.ThreadPoolExecutor() as executor: executor.map(function, list)
That code will create one ThreadPoolExecutor
, and then it will submit five tasks to the executor.
if the list has a million items will this create a million threads?
A task is not a thread. A task is some piece of work that needs to be done. In your example above, the tasks are "call function(1)
," "call function(2)
," etc. The ThreadPoolExecutor
object will create some number of worker threads that it uses to perform those tasks, but the number of worker threads can be much smaller than the number of tasks that it eventually performs for you.
And how many threads will be created if I specify
ThreadPoolExecutor(10)
?
The documentation for ThreadPoolExecutor
is a bit vague, but I think you can infer from it that the executor will never have more than ten workers running at the same time in that case.
Each worker runs only one task at a time. When the program submits more tasks than the maximum number of worker threads that the executor is allowed to manage (i.e., more than ten tasks in this case) then the most recently submitted tasks will wait in a queue. As soon as a worker finishes one task, it tries to pick another from the queue and perform it. If a worker finds the queue to be empty, then the worker becomes idle.
Thread pools in other libraries that I have used employ various policies regarding idle workers. Some kill off workers that have been idle for too long, and then re-create them later when new tasks are submitted. The simplest keep an exact number of worker threads at all times, and allow them to remain idle for as long as they are not needed. I don't know about Python's ThreadPoolExecutor
. The documentation does not say what it does with idle workers.
does this code run 5 processes each one on a different core? and what if the CPU has only 4 cores...?
Python doesn't know anything about cores. Deciding how and when to use the host's CPU cores is the job of the operating system scheduler. Even a host that has just one core can run a program that has many threads. Every so often (maybe 100 times each second) the scheduler is awakened by a timer interrupt. It considers all of the threads that actually are running at that moment (at most, one thread on each logical core) and it considers all of the threads that are ready-to-run (a.k.a., "runnable"), and it may preempt a running thread (i.e., move it off it's core, and back into the ready-to-run queue) so that another thread can have a turn.