26

I totally understand that Concourse is meant to be stateless, but nevertheless is there any way to re-use already pulled docker images? In my case, I build ~10 docker images which have the same base image, but each time build is triggered Concourse pulls base image 10 times.

Is it possible to pull that image once and re-use it later (at least in scope of the same build) using standard docker resource?

Yeah, it should be possible to do that using custom image and code it in sh script, but I'm not in fond of inviting bicycles.

If standard docker resource does not allow that, is it possible to extend it somehow to enable such behaviour?

--cache-from is not helpful, as CI spends most of time pulling image, not building new layers.

Max Romanovsky
  • 2,644
  • 5
  • 26
  • 36
  • Hi, I am a bit confused as to what the problem is... If you store that base image as a resource, then concourse should cache it and use the cached image to build your new images. – Josh Zarrabi Jun 27 '17 at 03:45
  • 2
    @JoshZarrabi I don't store it as a resource, as it's referenced only in `Dockerfile's FROM` statement – Max Romanovsky Jun 27 '17 at 09:05

1 Answers1

51

Theory

First, some Concourse theory (at least as of v3.3.1):

People often talk about Concourse having a "cache", but misinterpret what that means. Every concourse worker has a set of volumes on disk which are left around, forming a volume cache. This volume cache contains volumes that have been populated by resource get and put and task outputs.

People also often misunderstand how the docker-image-resource uses Docker. There is no global docker server running with your Concourse installation, in fact Concourse containers are not Docker containers, they are runC containers. Every docker-image-resource process (check, get, put) is run inside of its own runC container, inside of which there is a local docker server running. This means that there's no global docker server that is pulling docker images and caching the layers for further use.

What this implies is that when we talk about caching with the docker-image-resource, it means loading or pre-pulling images into the local docker server.


Practice

Now to the options for optimizing build times:

load_base

Background

The load_base param in your docker-image-resource put tells the resource to first docker load an image (retrieved via a get) into its local docker server, before building the image specified via your put params.

This is useful when you need to pre-populate an image into your "docker cache." In your case, you would want to preload the image used in the FROM directive. This is more efficient because it uses Concourse's own volume caching to only pull the "base" once, making it available to the docker server during the execution of the FROM command.

Usage

You can use load_base as follows:

Suppose you want to build a custom python image, and you have a git repository with a file ci/Dockerfile as follows:

FROM ubuntu

RUN apt-get update
RUN apt-get install -y python python-pip

If you wanted to automate building/pushing of this image while taking advantage of Concourse volume caching as well as Docker image layer caching:

resources:
- name: ubuntu
  type: docker-image
  source:
    repository: ubuntu

- name: python-image
  type: docker-image
  source:
    repository: mydocker/python

- name: repo
  type: git
  source:
    uri: ...

jobs:
- name: build-image-from-base
  plan:
  - get: repo
  - get: ubuntu
    params: {save: true}
  - put: python-image
    params:
      load_base: ubuntu
      dockerfile: repo/ci/Dockerfile

cache & cache_tag

Background

The cache and cache_tag params in your docker-image-resource put tell the resource to first pull a particular image+tag from your remote source, before building the image specified via your put params.

This is useful when it's easier to pull down the image than it is to build it from scratch, e.g. you have a very long build process, such as expensive compilations

This DOES NOT utilize Concourse's volume caching, and utilizes Docker's --cache-from feature (which runs the risk of needing to first perform a docker pull) during every put.

Usage

You can use cache and cache_tag as follows:

Suppose you want to build a custom ruby image, where you compile ruby from source, and you have a git repository with a file ci/Dockerfile as follows:

FROM ubuntu

# Install Ruby
RUN mkdir /tmp/ruby;\
  cd /tmp/ruby;\
  curl ftp://ftp.ruby-lang.org/pub/ruby/2.0/ruby-2.0.0-p247.tar.gz | tar xz;\
  cd ruby-2.0.0-p247;\
  chmod +x configure;\
  ./configure --disable-install-rdoc;\
  make;\
  make install;\
  gem install bundler --no-ri --no-rdoc

RUN gem install nokogiri

If you wanted to automate building/pushing of this image while taking advantage of only Docker image layer caching:

resources: 
- name: compiled-ruby-image
  type: docker-image
  source:
    repository: mydocker/ruby
    tag: 2.0.0-compiled

- name: repo
  type: git
  source:
    uri: ...

jobs:
- name: build-image-from-cache
  plan:
  - get: repo
  - put: compiled-ruby-image
    params:
      dockerfile: repo/ci/Dockerfile
      cache: mydocker/ruby
      cache_tag: 2.0.0-compiled

Recommendation

If you want to increase efficiency of building docker images, my personal belief is that load_base should be used in most cases. Because it uses a resource get, it takes advantage of Concourse volume caching, and avoids needing to do extra docker pulls.

Tom Fenech
  • 65,210
  • 10
  • 85
  • 122
materialdesigner
  • 1,402
  • 10
  • 13
  • Thanks for the detailed answer, unfortunately I won't be able to check it out in an upcoming weeks. – Max Romanovsky Jul 13 '17 at 08:09
  • Tested, works well. Avoids the [`file not found` error](https://github.com/concourse/concourse/issues/1457#issuecomment-358491394) when using `skip_download` if the image does not already exist on the worker. Requires that `params: {save:true}` be set on the `get` for the input Docker image resource. – emc Jan 18 '18 at 22:16
  • This is a super useful answer. Could you also update it with how `cache_from` is related to `load_base` and `cache`? It looks like `cache_from` has a similar effect as `cache` in that it translates to `docker build --cache-from` but it's specifically supposed to pull from the local cache volume instead of over the network. So `cache_from` is for caching build steps and `load_base` is for caching pull layers. So I'd imagine most people will want to do both, but then do the base image layers get pulled twice? – chrishiestand Jul 04 '18 at 06:51
  • 1
    I edited to add `save: true`, as the pipeline failed otherwise, with `open ubuntu/image: no such file or directory`. – Tom Fenech Jul 05 '18 at 10:23
  • 1
    This concourse resource is old, confusing, and crufty. Note the current work to rewrite/re-engineer this: https://github.com/concourse/docker-image-resource/issues/190 – chrishiestand Jul 05 '18 at 17:31
  • 2
    this is easily the most useful information i've encountered for building docker images with concourse. thank you! – Lex Scarisbrick Sep 23 '18 at 02:32
  • First of all, thanks for this, it is amazing and very useful! I have a question and an observation. **Question**: Can't you combine `load_base` to avoid re-pulling the base image AND `cache_from` to take advantage of Docker's layers cache? (I'm going to try that as it could both help with avoid unnecessary downloads and reducing build time if cached layers can be used) **Observation**: It may be good to add a note that the `ubuntu` in the `load_base` it's the name of the `get` task, not the name of the base image (or at least that's my understanding) Thanks again ❤️ – Aldo 'xoen' Giambelluca Nov 02 '18 at 15:35
  • oh ... if there _just_ would be a way to use host docker instead of `runC` ... – Slaus Apr 14 '21 at 03:25