Unzip local file and delete original in Dockerfile image build-CodePudding

I'm trying to uncompress a file and delete the original, compressed, archive in my Dockerfile image build instructions. I need to do this because the file in question is larger than the 2GB limit set by Github on large file sizes (see here). The solution I'm pursuing is to compress the file (bringing it under the 2GB limit), and then decompress when I build the application. I know it's bad practice to build large images and plan to integrate a external database into the project but don't have time now to do this.

I've tried various options, but have been unsuccessful.

Compress the file in .zip format and use apt-get to install unzip and then decompress the file with unzip:

FROM python:3.8-slim

#install unzip
RUN apt-get update && apt-get install unzip

WORKDIR /app

COPY /data/databases/file.db.zip /data/databases

RUN unzip /data/databases/file.db.zip && rm -f /data/databases/file.db.zip

COPY ./ ./

This fails with unzip: cannot find or open /data/databases/file.db.zip, /data/databases/file.db.zip.zip or /data/databases/file.db.zip.ZIP. I don't understand this, as I thought COPY added files to the image.

Following this advice, I compressed the large file with gzip and tried to use the Docker native ADD command to uncompress it, i.e.:

FROM python:3.8-slim

WORKDIR /app

ADD /data/databases/file.db.gz /data/databases/file.db

COPY ./ ./

While this compiles without error, it does not decompress the file, which I can see using docker exec -t -i clean-dash /bin/bash to explore the image directory structure. Since the large file is a gzip file, my understanding is ADD should decompress it, i.e. from the docs.

How can I solve these requirements?

CodePudding user response：

ADD only decompresses local tar files, not necessarily compressed single files. It may work to package the contents in a tar file, even if it only contains a single file:

ADD ./data/databases/file.tar.gz /data/databases/

(cd data/databases && tar cvzf file.tar.gz file.db)
docker build .

If you're using the first approach, you must use a multi-stage build here. The problem is that each RUN command generates a new image layer, so the resulting image is always the previous layer plus whatever changes the RUN command makes; RUN rm a-large-file will actually result in an image that's slightly larger than the image that contains the large file.

The BusyBox tool set includes, among other things, an implementation of unzip(1), so you should be able to split this up into a stage that just unpacks the large file and then a stage that copies the result in:

FROM busybox AS unpack
WORKDIR /unpack
COPY data/databases/file.db.zip /
RUN unzip /file.db.zip


FROM python:3.8-slim
COPY --from=unpack /unpack/ /data/databases/

In terms of the Docker image any of these approaches will create a single very large layer. In the past I've run into operational problems with single layers larger than about 1 GiB, things like docker push hanging up halfway through. With the multi-stage build approach, if you have multiple files you're trying to copy, you could have several COPY steps that break the batch of files into multiple layers. (But if it's a single SQLite file, there's nothing you can really do.)

CodePudding user response：

Based on @David Maze's answer, the following worked, which I post here for completeness.

#unpacks zipped database
FROM busybox AS unpack
WORKDIR /unpack
COPY data/databases/file.db.zip /
RUN unzip /file.db.zip

FROM python:3.8-slim
COPY --from=unpack /unpack/file.db /

WORKDIR /app

COPY ./ ./

#move the unpacked db and delete the original
RUN mv /file.db ./data/databases && rm -f ./data/databases/file.db.zip