Based on what I have read - for example here - I understand the I/O operations release the GIL. So, if I have to read a large number of files on the local filesystem, my understanding is that a threaded execution should speed things up.
To test this - I have a folder (input
) with about ~100k files - each file has just one line with one random integer. I have two functions - one "sequential" and one "concurrent" that just add all the numbers
import glob
import concurrent.futures
ALL_FILES = glob.glob('./input/*.txt')
def extract_num_from_file(fname):
#time.sleep(0.1)
with open(fname, 'r') as f:
file_contents = int(f.read().strip())
return file_contents
def seq_sum_map_based():
return sum(map(extract_num_from_file, ALL_FILES))
def conc_sum_map_based():
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
return sum(executor.map(extract_num_from_file, ALL_FILES))
While both functions give me the same result - the "concurrent" version is about 3-4 times slower.
In [2]: %timeit ss.seq_sum_map_based()
3.77 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit ss.conc_sum_map_based()
12.8 s ± 240 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is there something wrong with my code or in my understanding?
CodePudding user response:
The problem is that while the threads may be operating in parallel, the data must be read sequentially from the hard drive as there is only the singular read head. Worse, however, is that since you've parallelized the I/O operations, the underlying OS will schedule these I/O tasks such that these files are processed only partially before switching to another thread--after all, even if you only have a single integer, the files headers still need to be processed as well--causing the read head to jump around much more wildly than in your strictly sequential code. All of this results in massively increased overhead compared to simply reading the entirety of each file in sequence, which doesn't require so many jumps.
This wouldn't be as much of a problem if, for instance, you had a single thread loading in large amounts of data from disk while a second thread performs some time-intensive processing of it, as that would allow the time-intensive processing to continue unblocked by the I/O operations. Your particular scenario is just a really, really bad case where you've given up a GIL bottleneck in exchange for a horrifically slow I/O bottleneck.
In short, you've understood correctly that I/O operations release the GIL, you just came to the wrong conclusion about parallelizing file reads.
CodePudding user response:
The other answer stated this quite well:
you've given up a GIL bottleneck in exchange for a horrifically slow I/O bottleneck. In short, you've understood correctly that I/O operations release the GIL, you just came to the wrong conclusion about parallelizing file reads.
I'll add that threaded file reading can be more performant if you have I/O to spare, like you might have on a very fast SSD.
Using an implementation like so:
class ThreadedFileReader:
def __init__(self, files, n=5):
self.files = deque(files)
self.threads = []
self.results = queue.Queue()
for _ in range(n):
t = threading.Thread(target=self.worker)
t.start()
self.threads.append(t)
def worker(self):
while self.files:
fname = self.files.pop()
with open(fname, encoding='utf-8') as f:
data = f.read()
self.results.put(len(data))
return
def join(self):
for t in self.threads:
t.join()
I tested this on a relatively fast SSD (Samsung 970 EVO 1TB; secondary drive with no activity) reading from ~1000 files of various sizes, averaging 8800 characters. In testing, I got better performance with more threads... but diminishing returns kick in fast.
In [1]: len(ALL_FILES)
995
In [2]: %timeit ThreadedFileReader(ALL_FILES, n=1).join() # single threaded
61.8 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: %timeit ThreadedFileReader(ALL_FILES, n=2).join()
54.7 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit ThreadedFileReader(ALL_FILES, n=3).join()
56.1 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: %timeit ThreadedFileReader(ALL_FILES, n=4).join()
57.8 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit ThreadedFileReader(ALL_FILES, n=5).join()
58.9 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit ThreadedFileReader(ALL_FILES, n=50).join()
68.6 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, your idea in principle is sound, but only if you have enough I/O to spare, compared to sequential read. Unless you have very fast storage, you're unlikely to need more than an extra thread or two. If your storage is not fast at all, the single-threaded approach may be the way to go.
Remember that if you have many threads reading files at the same time, especially small files, you're likely going to be bottlenecked by your drive's random read capabilities. Comparatively, a single threaded approach on large files is likely to be bottlenecked closer to your drive's sequential read capabilities. Depending on the hardware, these could be substantially different performance ratings.
Depending on the hardware and the characteristics of the data you're reading, the benefits of sequential read performance may outweigh any potential gains from parallelizing reads.