Why is docker pull not extracting layers in parallel?-CodePudding

Does extracting (untarring) of docker image layers by docker pull have to be conducted sequentially or could it be parallelized?

Example

docker pull mirekphd/ml-cpu-r40-base - an image which had to be split into more than 50 layers for build performance reasons - it contains around 4k R packages precompiled as DEB's (the entire CRAN Task Views contents), that would be impossible to build in docker without splitting these packages into multiple layers of roughly equal size, which cuts the build time from a whole day to minutes. Extraction stage - if parallelized - could become up to 50 times faster...

Context

When you observe docker pull for a large multi-layer image (gigabytes in size), you will notice that the download of each layer can be performed separately, in parallel. Not so for subsequent extracting (untarring) of each of these layers, which is performed sequentially. Do we know why?

From my anecdotal observations with such large images it would speed up the execution of the docker pull operation considerably.

Moreover, if splitting image into more layers would let you spin up containers faster, people would start writing Dockerfiles that are more readable and faster to both debug/test and pull/run, rather than trying to pile all instructions onto a single slow-bulding, impossibly convoluted and cache-busting string of instructions only to save a few megabytes of extra layers overhead (which will be easily recouped by parallel extraction).

CodePudding user response：

From the discussion at https://github.com/moby/moby/issues/21814, there are two main reasons that layers are not extracted in parallel:

It would not work on all storage drivers.
It would potentially use lots of CPu.

See the related comments below:

Note that not all storage drivers would be able to support parallel extraction. Some are snapshotting filesystems where the first layer must be extracted and snapshotted before the next one can be applied.

@aaronlehmann

We also don't really want a pull operation consuming tons of CPU on a host with running containers.

@cpuguy83

And the user who closed the linked issue wrote the following:

This isn't going to happen for technical reasons. There's no room for debate here. AUFS would support this, but most of the other storage drivers wouldn't support this. This also requires having specific code to implement at least two different code paths: one with this parallel extraction and one without it.

An image is basically something like this graph A->B->C->D and most Docker storage drivers can't handle extracting any layers which depend on layers which haven't been extracted already.

Should you want to speed up docker pull, you most certainly want faster storage and faster network. Go itself will contribute to performance gains once Go 1.7 is out and we start using it.

I'm going to close this right now because any gains from parallel extraction for specific drivers aren't worth the complexity for the code, the effort needed to implement it and the effort needed to maintain this in the future.

@unclejack