Offline CI

In the article about Testing Terminology, I explained how Terminology is tested and very little about the CI and the concepts I applied.

The main goal of the CI is to validate the code that will be released at some point. By running tests as often as possible, it is possible to catch bugs soon after they have been written. Using the CI as a gatekeeper, it helps not introduce them into the final code that will be released. Thus, this should improve the quality of the releases.

No test is perfect and complete enough to represent the full scale of how the code will be stressed in the wild and we can not be fully confident that no bug exist. However, one major concern is when one can not even rely on the test results.


Flakiness occurs when a test is not behaving in a consistent manner. Usually, it is when a test fails while the tested code has not changed. Sometimes, it is also when a test passes while it was usually failing and no change was intentionally introduced to change this behavior.

This produces some important issues:

Sources of flakiness

There are many reasons why a test can fail but let me try to give some common ones I encountered:

Learning on retries

Flakiness is a topic serious enough that Big Tech have teams working on trying to fix it. Most of the literature is related to handling retries:

All of those are probably great but look far too complex just to test Terminology.

Flakiness as a risk

In my personal life, while interacting with the cycling community, I encountered the concept of the Hierarchy of hazard controls where, when wanting to protect cyclists, it is more efficient to separate cyclists from motor vehicles (risk elimination) than asking them to wear a helmet.

Hierarchy of hazard controls

I am trying to apply this hierarchy to how flakiness is dealt with in the industry:

Hierarchy of flakiness counter measures

To limit the impact of flakiness, the best would be to eliminate it!

Elimination of some sources of flakiness

There are different sources of flakiness and some can be removed systematically.

In Terminology’s CI, fetching dependencies is not a source of flakiness since all the dependencies are already stored in the image used to compile and run the tests. Instead of fetching the dependencies remotely, they are installed from the image itself. This is what I call an offline CI. It is very powerful and can enable reproductible builds. By constructing the images outside of the path to test the code, issues found while building those image do not impact the way the code is tested and clearly decouples testing the code from creating the environment to test the code.

At some point, Terminology was tested against 4 different versions of the Enlightenment Foundation Libraries using this system. Terminology is also tested with 2 different compilers (gcc and clang) and many different compilation options (mostly through of the use of sanitizers).

Terminology’s dependencies are rather small and rely only on alpine packages but one could imagine using this same concept with cargo, pip, npm, go modules, …

However, just doing this is not enough. If a dependency is updated upstream, this offline CI would not be able to test code against it.

Using the CI to test … the CI

When the tests are passing and after a review, we tend to consider that the code being tested is good enough. These same tests can also be used to validate a newer dependency: if the code has not changed but only a dependency has changed, then having the tests green, means that this newer dependency is working well.

This is represented by the following workflow:

Introducing dependencies

When the CI gets complex due to multiple sets of dependencies, it is useful to try to decouple those sets using image inheritance.

Image inheritance

Let’s consider the following examples with 3 simple images based on alpine where fetching (and installing) a different dependency is done on each image:

  1. step1.Dockerfile:
      FROM alpine:latest
      RUN apk add --no-cache gcc meson ninja
  2. step2.Dockerfile:
     FROM foo:step1-pre
     RUN apk add --no-cache netcat-openbsd
  3. step3.Dockerfile:
     FROM foo:step2-pre
     RUN apk add --no-cache patch

The CI will use the image from step3.Dockerfile when it is validated.

Validating the steps

There can still be flakiness while building those images, but it’s not directly impacting the CI used to test changes in the code.

A simple script to build those images and automatically publish when they are good can be the following one:

# suffix to add to the images. useful to coordinate with change in code
# branch used to test

# build Step 1.  Exit early on failure
if ! LOG1=$(docker build --pull --no-cache -q -f step1.Dockerfile -t foo:step1-pre .); then
    echo "$LOG1"
    exit 1
# build Step 2.  Exit early on failure
if ! LOG2=$(docker build --no-cache -q -f step2.Dockerfile -t foo:step2-pre .); then
    echo "$LOG2"
    exit 1
# build Step 3.  Exit early on failure
if ! LOG3=$(docker build --no-cache -q -f step3.Dockerfile -t foo:step3-pre .); then
    echo "$LOG3"
    exit 1
# Run CI on defined branch on image `foo:step3-pre`
if ! CILOG=$( "$BRANCH" foo:step3-pre); then
    echo "failed to run tests on newer images: $CILOG"
    exit 2
docker tag foo:step1-pre "foo:step1-latest-$SFX" && docker push "foo:step1-latest-$SFX"
docker tag foo:step2-pre "foo:step2-latest-$SFX" && docker push "foo:step2-latest-$SFX"
docker tag foo:step3-pre "foo:step3-latest-$SFX" && docker push "foo:step3-latest-$SFX"

One issue with this approach is that we have to disable the docker cache. One improvement could be to only disable it when a dependency is known to have changed.

By changing the FROM clause on the intermediate images to the one ending with latest-* and enabling the docker cache, it is possible to test that the last step is the one failing, and if not, going upward in the steps. This is not very practical.

To counter this, one could try to add tests on each step. However, if the script runs often enough, there should not be a lot of changes in the steps and finding the root cause of the failure should be easy. Also, if one step is clearly broken upstream, using the latest tag for it and enabling the cache should let test other steps.

When should it run?

This workflow can run daily or weekly, or be triggered from an external tool that would alert of an updated dependency, or simply by the need to add or change a dependency. In Terminology, this is done every 2 weeks since there are not a lot of changes in the dependencies.

Sometimes, an incompatibility also needs to be fixed alongside the code. The solution is to have, in the code repository, a tag to know which image to use. In the example, this would be set using the SFX variable. That way, the CI will only use the newer image when the dependency is known to work well with the code.

The lifecycle of those images is left as exercice to the reader.


In Terminology, a copy of the repository is stored in the base image and updated whenever the image has to change. This can limit the amount of data to transfer when testing new code.