Generalized question
Is there any downside to compressing large folders in a Docker image and then extracting them at the container level in the image's entrypoint?
Cons:
- Not as reusable. (But assuming it's a one-off image this isn't much of a problem)
- Slow image startup due to the delay in extracting the folders.
Pros:
- Significantly smaller image sizes in some cases!!
Example dockerfile:
FROM some/image:latest
COPY ./venv.tar.gz .
COPY ./some_python_script.py .
SHELL ["/bin/bash", "-c"]
# if directory venv does not exist: extract it to root then remove the tar file
ENTRYPOINT [ ! -d "venv" ] && tar xzf venv.tar.gz && rm venv.tar.gz || \
# else: it already exists so we don't have to waste time doing it again
echo 'venv already extracted from tar' && \
# these always run
source venv/bin/activate && \
cd /app && \
python "some_python_scipt.py"
Why I'm asking
I've dockerized an entire PyTorch geospatial training application excluding the data. However, geospatial training packages are MASSIVE not to mention the size of the PyTorch and Cuda libraries. The virtual environment for my image was 10.9GB alone resulting in a total image size of 11.5GB. That was also the optimized v-env using conda-pack in a multistage build. Compressing them beforehand reduced the image size to 5.2GB.
The running container clearly ends up being the original 11.5GB while it's running. However, the reduced image size makes it much easier to manage especially regarding speed when pushing and pulling from docker hub.
Complete gist exemplifying a compressed conda virtual environment: https://gist.github.com/NoahTarr/cdc0af59ebc84fc9d936eece35ebfaf7
CodePudding user response:
I'd suggest three problems with this setup:
In terms of local disk space, this will require 1.5x the storage space to run a single instance of the container, and then each additional instance will require additional space equal to the (large) size of the entire virtual environment. That is, you need one copy of the compressed venv, plus one copy of the uncompressed venv per container. If the uncompressed venv is in the image you will only need one copy of it across all containers.
Decompressing a large tar file will take time and this could make your container startup noticeably slower.
I've had practical problems
docker push
anddocker pull
individual layers larger than about 1 GB. There are tricks you can do to make the individual layers smaller, for exampleRUN pip install pytorch
before youRUN pip install -r requirements.txt
. With this setup you have no choice but to have a single 5 GB tarfile layer.
For all that having been said, I'd believe this approach will fundamentally work; again, assuming you can docker push
and docker pull
it successfully. The one change I'd make is to break out the complicated ENTRYPOINT line into a separate shell script:
#!/bin/sh
# entrypoint.sh
# Unpack the virtual environment if it doesn't already exist
if [ ! -d venv ]; then
tar xzf venv.tar.gz
else
echo 'venv already extracted from tar'
fi
# Add the virtual environment into $PATH
. venv/bin/activate
# Run the main container CMD
exec "$@"
# Dockerfile
...
WORKDIR /app
ENTRYPOINT ["./entrypoint.sh"]
CMD ["python", "some_python_script.py"]
This will let you do things like docker run --rm your-image ls venv/python3.10/site-packages
to look inside the exploded tree without actually running your main application.