Zip file - Parallelism-CodePudding

I want to Zip all the files in a folder by using c#. I am not sure it's a good idea to use a Parallel loop to add files into a zip folder something like below. I need to handle almost 20k files with 2 MB each. Looking for a suggestion.

            using (var archive = ZipFile.Open(zf, ZipArchiveMode.Create))
            {
               Parallel.ForEach(files, po, fPath =>
               {
                   archive.CreateEntryFromFile(fPath, Path.GetFileName(fPath));
               });
            }

CodePudding user response：

System.IO.Compression.ZipArchive is not thread safe. So you cannot use it from a parallel loop. The common convention is that static methods should be thread safe, but objects are not, unless otherwise noted in the documentation. So never just put things in parallel loops without verifying that all used classes are thread safe.

It is perfectly possible that the implementation is parallelized internally, but as far as I know, it is not. There might be other libraries available that support multi threading, you could ask at https://softwarerecs.stackexchange.com/ for such a library.

The benefit of multithreading will depend on the compression algorithm used. Some algorithms are very lightweight and will be limited by disk speed, while some algorithms will be CPU limited. The most common algorithm is Deflate, and is fairly fast, but not as fast as something like lz4.

If you are archiving already compressed data, like images, you should specify CompressionLevel.NoCompression to improve speed, since compression would not help anyway.

You might also split your work in some different way, i.e. compress different folders concurrently. Or make multiple archives of say 1k files each that can be processed in parallel.

CodePudding user response：

This is not a hands-on/technical answer. In general trying to parallelize operations that are mostly I/O-bound doesn't yield significant performance improvements, because the bottleneck is not the CPU but the capabilities of the storage device. Some hardware, like the SSD devices, react better to parallelization than others. Classic hard disks in particular react rather poorly to parallelization, with the performance usually being even worse than doing the work sequentially.

Compressing a file involves both I/O and computation, but the CPU part is pretty lightweight. So overall it's mostly an I/O-bound operation. You could attempt to implement the producer-consumer pattern like it is suggested in this answer, but don't expect to get any spectacular performance gains from it.