17

I need to create a Docker image (and consequently containers from that image) that use large files (containing genomic data, thus reaching ~10GB in size).

How am I supposed to optimize their usage? Am I supposed to include them in the container (such as COPY large_folder large_folder_in_container)? Is there a better way of referencing such files? The point is that it sounds strange to me to push such container (which would be >10GB) in my private repository. I wonder if there is a way of attaching a sort of volume to the container, without packing all those GBs together.

Thank you.

Eleanore
  • 1,628
  • 3
  • 16
  • 28

3 Answers3

21

Is there a better way of referencing such files?

If you already have some way to distribute the data I would use a "bind mount" to attach a volume to the containers.

docker run -v /path/to/data/on/host:/path/to/data/in/container <image> ...

That way you can change the image and you won't have to re-download the large data set each time.

If you wanted to use the registry to distribute the large data set, but want to manage changes to the data set separately, you could use a data volume container with a Dockerfile like this:

FROM tianon/true
COPY dataset /dataset
VOLUME /dataset

From your application container you can attach that volume using:

docker run -d --name dataset <data volume image name>
docker run --volumes-from dataset <image> ...

Either way, I think https://docs.docker.com/engine/tutorials/dockervolumes/ are what you want.

dnephin
  • 19,351
  • 8
  • 44
  • 38
  • 1
    Doesn't work, you cannot start-up a container without any entrypoint. https://hub.docker.com/r/tianon/true is an alternative. – stackoverflowed Feb 28 '19 at 11:11
9

Am I supposed to include them in the container (such as COPY large_folder large_folder_in_container)?

If you do so, that would include them in the image, not the container: you could launch 20 containers from that image, the actual disk space used would still be 10 GB.

If you were to make another image from your first image, the layered filesystem will reuse the layers from the parent image, and the new image would still be "only" 10GB.

VonC
  • 1,042,979
  • 435
  • 3,649
  • 4,283
  • That's useful for sure. But I am worried about the snappiness of the system when I go and pop up the image from the registry (to run a container). I am trying to fit this into a CI/CD pipeline, which would require (during the deploy phase) to load the container on a new OpenStack instance (via Packer). Since such instance is always different, this would require every time I go through the CD pipeline to load a huge container into the newly created the OpenStack instance (without any previously loaded layer), and thus moving 10GB at every commit. Is this the best solution one could find? – Eleanore Sep 15 '16 at 12:41
  • @Eleanore Once the image has been loaded into the local docker registry of your slave, the container starts immediately. But if the image changes, what would be best is to build incrementally a new image based on the previous one, and including only the changes. That being said, if *all* 10GB changes from one image to the next... you have a problem indeed. – VonC Sep 15 '16 at 12:58
0

I have was having trouble with a 900MB json file and changing the Memory limit in the preferences and it fixed it.

enter image description here

kizziah
  • 71
  • 1
  • 5
  • Reference this answer https://stackoverflow.com/questions/44533319/how-to-assign-more-memory-to-docker-container – kizziah Apr 21 '21 at 21:39