From c93779accb4a302896485039805a3510de244374 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kamil=20Trzci=C5=84ski?= Date: Thu, 4 Apr 2019 17:31:56 +0200 Subject: [PATCH] Add usefull tips about big repositories --- doc/ci/README.md | 1 + doc/ci/large_repositories/index.md | 235 +++++++++++++++++++++++++++++ 2 files changed, 236 insertions(+) create mode 100644 doc/ci/large_repositories/index.md diff --git a/doc/ci/README.md b/doc/ci/README.md index 913e9d4d789..06c6b883909 100644 --- a/doc/ci/README.md +++ b/doc/ci/README.md @@ -62,6 +62,7 @@ into more features: | [ChatOps](chatops/README.md) | Trigger CI jobs from chat, with results sent back to the channel. | | [Interactive web terminals](interactive_web_terminal/index.md) | Open an interactive web terminal to debug the running jobs. | | [Review Apps](review_apps/index.md) | Configure GitLab CI/CD to preview code changes in a per-branch basis. | +| [Optimising GitLab for large repositories](large_repositories/index.md) | Useful tips on how to optimise GitLab and GitLab Runner for big repositories. | | [Deploy Boards](https://docs.gitlab.com/ee/user/project/deploy_boards.html) **[PREMIUM]** | Check the current health and status of each CI/CD environment running on Kubernetes. | | [GitLab CI/CD for external repositories](https://docs.gitlab.com/ee/ci/ci_cd_for_external_repos/index.html) **[PREMIUM]** | Get the benefits of GitLab CI/CD combined with repositories in GitHub and BitBucket Cloud. | diff --git a/doc/ci/large_repositories/index.md b/doc/ci/large_repositories/index.md new file mode 100644 index 00000000000..576914e9dc3 --- /dev/null +++ b/doc/ci/large_repositories/index.md @@ -0,0 +1,235 @@ +# Optimising GitLab for large repositories + +Large repositories consisting of more than 50k files in a worktree +often require special consideration because of +the time required to clone and check out. + +GitLab and GitLab Runner handle this scenario well +but require optimised configuration to efficiently perform its +set of operations. + +The general guidelines for handling big repositories are simple. +Each guideline is described in more detail in the sections below: + +- Always fetch incrementally. Do not clone in a way that results in recreating all of the worktree. +- Always use shallow clone to reduce data transfer. Be aware that this puts more burden + on GitLab instance due to higher CPU impact. +- Control the clone directory if you heavily use a fork-based workflow. +- Optimise `git clean` flags to ensure that you remove or keep data that might affect or speed-up your build. + +## Shallow cloning + +> Introduced in GitLab Runner 8.9. + +GitLab and GitLab Runner always perform a full clone by default. +While it means that all changes from GitLab are received, +it often results in receiving extra commit logs. + +Ideally, you should always use `GIT_DEPTH` with a small number +like 10. This will instruct GitLab Runner to perform shallow clones. +Shallow clones makes Git request only the latest set of changes for a given branch, +up to desired number of commits as defined by the `GIT_DEPTH` variable. + +This significantly speeds up fetching of changes from Git repositories, +especially if the repository has a very long backlog consisting of number +of big files as we effectively reduce amount of data transfer. + +The following example makes GitLab Runner shallow clone to fetch only a given branch, +it does not fetch any other branches nor tags. + +```yaml +variables: + GIT_DEPTH: 10 + +test: + script: + - ls -al +``` + +## Git strategy + +> Introduced in GitLab Runner 8.9. + +By default, GitLab is configured to always prefer the `GIT_STRATEGY: fetch` strategy. +The `GIT_STRATEGY: fetch` strategy will re-use existing worktrees if found +on disk. This is different to the `GIT_STRATEGY: clone` strategy +as in case of clones, if a worktree is found, it is removed before clone. + +Usage of `fetch` is preferred because it reduces the amount of data to transfer and +does not really impact the operations that you might do on a repository from CI. + +However, `fetch` does require access to the previous worktree. This works +well when using the `shell` or `docker` executor because these +try to preserve worktrees and try to re-use them by default. + +This does not work today for `kubernetes` executor and has limitations when using +`docker+machine`. `kubernetes` executor today always clones into ephemeral directory. + +GitLab also offers the `GIT_STRATEGY: none` strategy. This disables any `fetch` and `checkout` commands +done by GitLab, requiring you to do them. + +## Git clone path + +> Introduced in GitLab Runner 11.10. + +`GIT_CLONE_PATH` allows you to control where you clone your sources. +This can have implications if you heavily use big repositories with fork workflow. + +Fork workflow from GitLab Runner's perspective is stored as a separate repository +with separate worktree. That means that GitLab Runner cannot optimise the usage +of worktrees and you might have to instruct GitLab Runner to use that. + +In such cases, ideally you want to make the GitLab Runner executor be used only used only +for the given project and not shared across different projects to make this +process more efficient. + +The `GIT_CLONE_PATH` has to be within the `$CI_BUILDS_DIR`. Currently, +it is impossible to pick any path from disk. + +## Git clean flags + +> Introduced in GitLab Runner 11.10. + +`GIT_CLEAN_FLAGS` allows you to control whether or not you require +the `git clean` command to be executed for each CI job. +By default, GitLab ensures that you have your worktree on the given SHA, +and that your repository is clean. + +`GIT_CLEAN_FLAGS` is disabled when set to `none`. On very big repositories, this +might be desired because `git clean` is disk I/O intensive. Controlling that +with `GIT_CLEAN_FLAGS: -ffdx -e .build/`, for example, allows you to control and +disable removal of some directories within the worktree between subsequent runs, +which can speed-up the incremental builds. This has the biggest effect +if you re-use existing machines, and have an existing worktree that you can re-use +for builds. + +For exact parameters accepted by `GIT_CLEAN_FLAGS`, see the documentation +for [git clean](https://git-scm.com/docs/git-clean). The +available parameters are dependent on Git version. + +## Fork-based workflow + +> Introduced in GitLab Runner 11.10. + +Following the guidelines above, lets imagine that we want to: + +- Optimise for a big project (more than 50k files in directory). +- Use forks-based workflow for contributing. +- Reuse existing worktrees. Have preconfigured runners that are pre-cloned with repositories. +- Runner assigned only to project and all forks. + +Lets consider the following two examples, one using `shell` executor and +other using `docker` executor. + +### `shell` executor example + +Lets assume that you have the following [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html). + +```toml +concurrent = 4 + +[[runners]] + url = "GITLAB_URL" + token = "TOKEN" + executor = "shell" + builds_dir = "/builds" + cache_dir = "/cache" + + [runners.custom_build_dir] + enabled = true +``` + +This `config.toml`: + +- Uses the `shell` executor, +- Specifies a custom `/builds` directory where all clones will be stored. +- Enables the ability to specify `GIT_CLONE_PATH`, +- Runs at most 4 jobs at once. + +### `docker` executor example + +Lets assume that you have the following [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html). + +```toml +concurrent = 4 + +[[runners]] + url = "GITLAB_URL" + token = "TOKEN" + executor = "docker" + builds_dir = "/builds" + cache_dir = "/cache" + + [runners.docker] + volumes = ["/builds:/builds", "/cache:/cache"] +``` + +This `config.toml`: + +- Uses the `docker` executor, +- Specifies a custom `/builds` directory on disk where all clones will be stored. + We host mount the `/builds` directory to make it reusable between subsequent runs + and be allowed to override the cloning strategy. +- Doesn't enable the ability to specify `GIT_CLONE_PATH` as it is enabled by default. +- Runs at most 4 jobs at once. + +### Our `.gitlab-ci.yml` + +Once we have the executor configured, we need to fine tune our `.gitlab-ci.yml`. + +Our pipeline will be most performant if we use the following `.gitlab-ci.yml`: + +```yaml +variables: + GIT_DEPTH: 10 + GIT_CLONE_PATH: $CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME + +build: + script: ls -al +``` + +The above configures a: + +- Shallow clone of 10, to speed up subsequent `git fetch` commands. +- Custom clone path to make it possible to re-use worktrees between parent project and all forks + because we use the same clone path for all forks. + +Why use `$CI_CONCURRENT_ID`? The main reason is to ensure that worktrees used are not conflicting +between projects. The `$CI_CONCURRENT_ID` represents a unique identifier within the given executor, +so as long as we use it to construct the path, it is guaranteed that this directory will not conflict +with other concurrent jobs running. + +### Store custom clone options in `config.toml` + +Ideally, all job-related configuration should be stored in `.gitlab-ci.yml`. +However, sometimes it is desirable to make these schemes part of Runner configuration. + +In the above example of Forks, making this configuration discoverable for users may be preferred, +but this brings administrative overhead as the `.gitlab-ci.yml` needs to be updated for each branch. +In such cases, it might be desirable to keep the `.gitlab-ci.yml` clone path agnostic, but make it +a configuration of Runner. + +We can extend our [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html) +with the following specification that will be used by Runner if `.gitlab-ci.yml` will not override it: + +```toml +concurrent = 4 + +[[runners]] + url = "GITLAB_URL" + token = "TOKEN" + executor = "docker" + builds_dir = "/builds" + cache_dir = "/cache" + + environment = [ + "GIT_DEPTH=10", + "GIT_CLONE_PATH=$CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME" + ] + + [runners.docker] + volumes = ["/builds:/builds", "/cache:/cache"] +``` + +This makes the cloning configuration to be part of given Runner, +and does not require us to update each `.gitlab-ci.yml`.