From c93779accb4a302896485039805a3510de244374 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Kamil=20Trzci=C5=84ski?= <ayufan@ayufan.eu>
Date: Thu, 4 Apr 2019 17:31:56 +0200
Subject: [PATCH] Add usefull tips about big repositories

---
 doc/ci/README.md                   |   1 +
 doc/ci/large_repositories/index.md | 235 +++++++++++++++++++++++++++++
 2 files changed, 236 insertions(+)
 create mode 100644 doc/ci/large_repositories/index.md

diff --git a/doc/ci/README.md b/doc/ci/README.md
index 913e9d4d789..06c6b883909 100644
--- a/doc/ci/README.md
+++ b/doc/ci/README.md
@@ -62,6 +62,7 @@ into more features:
 | [ChatOps](chatops/README.md)                                                                                              | Trigger CI jobs from chat, with results sent back to the channel.                          |
 | [Interactive web terminals](interactive_web_terminal/index.md)                                                            | Open an interactive web terminal to debug the running jobs.                                |
 | [Review Apps](review_apps/index.md)                                                                                       | Configure GitLab CI/CD to preview code changes in a per-branch basis.                      |
+| [Optimising GitLab for large repositories](large_repositories/index.md) | Useful tips on how to optimise GitLab and GitLab Runner for big repositories. |
 | [Deploy Boards](https://docs.gitlab.com/ee/user/project/deploy_boards.html) **[PREMIUM]**                                 | Check the current health and status of each CI/CD environment running on Kubernetes.       |
 | [GitLab CI/CD for external repositories](https://docs.gitlab.com/ee/ci/ci_cd_for_external_repos/index.html) **[PREMIUM]** | Get the benefits of GitLab CI/CD combined with repositories in GitHub and BitBucket Cloud. |
 
diff --git a/doc/ci/large_repositories/index.md b/doc/ci/large_repositories/index.md
new file mode 100644
index 00000000000..576914e9dc3
--- /dev/null
+++ b/doc/ci/large_repositories/index.md
@@ -0,0 +1,235 @@
+# Optimising GitLab for large repositories
+
+Large repositories consisting of more than 50k files in a worktree
+often require special consideration because of
+the time required to clone and check out.
+
+GitLab and GitLab Runner handle this scenario well
+but require optimised configuration to efficiently perform its
+set of operations.
+
+The general guidelines for handling big repositories are simple.
+Each guideline is described in more detail in the sections below:
+
+- Always fetch incrementally. Do not clone in a way that results in recreating all of the worktree.
+- Always use shallow clone to reduce data transfer. Be aware that this puts more burden
+  on GitLab instance due to higher CPU impact.
+- Control the clone directory if you heavily use a fork-based workflow.
+- Optimise `git clean` flags to ensure that you remove or keep data that might affect or speed-up your build.
+
+## Shallow cloning
+
+> Introduced in GitLab Runner 8.9.
+
+GitLab and GitLab Runner always perform a full clone by default.
+While it means that all changes from GitLab are received,
+it often results in receiving extra commit logs.
+
+Ideally, you should always use `GIT_DEPTH` with a small number
+like 10. This will instruct GitLab Runner to perform shallow clones.
+Shallow clones makes Git request only the latest set of changes for a given branch,
+up to desired number of commits as defined by the `GIT_DEPTH` variable.
+
+This significantly speeds up fetching of changes from Git repositories,
+especially if the repository has a very long backlog consisting of number
+of big files as we effectively reduce amount of data transfer.
+
+The following example makes GitLab Runner shallow clone to fetch only a given branch,
+it does not fetch any other branches nor tags.
+
+```yaml
+variables:
+  GIT_DEPTH: 10
+
+test:
+  script:
+    - ls -al
+```
+
+## Git strategy
+
+> Introduced in GitLab Runner 8.9.
+
+By default, GitLab is configured to always prefer the `GIT_STRATEGY: fetch` strategy.
+The `GIT_STRATEGY: fetch` strategy will re-use existing worktrees if found
+on disk. This is different to the `GIT_STRATEGY: clone` strategy
+as in case of clones, if a worktree is found, it is removed before clone.
+
+Usage of `fetch` is preferred because it reduces the amount of data to transfer and
+does not really impact the operations that you might do on a repository from CI.
+
+However, `fetch` does require access to the previous worktree. This works
+well when using the `shell` or `docker` executor because these
+try to preserve worktrees and try to re-use them by default.
+
+This does not work today for `kubernetes` executor and has limitations when using
+`docker+machine`. `kubernetes` executor today always clones into ephemeral directory.
+
+GitLab also offers the `GIT_STRATEGY: none` strategy. This disables any `fetch` and `checkout` commands
+done by GitLab, requiring you to do them.
+
+## Git clone path
+
+> Introduced in GitLab Runner 11.10.
+
+`GIT_CLONE_PATH` allows you to control where you clone your sources.
+This can have implications if you heavily use big repositories with fork workflow.
+
+Fork workflow from GitLab Runner's perspective is stored as a separate repository
+with separate worktree. That means that GitLab Runner cannot optimise the usage
+of worktrees and you might have to instruct GitLab Runner to use that.
+
+In such cases, ideally you want to make the GitLab Runner executor be used only used only
+for the given project and not shared across different projects to make this
+process more efficient.
+
+The `GIT_CLONE_PATH` has to be within the `$CI_BUILDS_DIR`. Currently,
+it is impossible to pick any path from disk.
+
+## Git clean flags
+
+> Introduced in GitLab Runner 11.10.
+
+`GIT_CLEAN_FLAGS` allows you to control whether or not you require
+the `git clean` command to be executed for each CI job.
+By default, GitLab ensures that you have your worktree on the given SHA,
+and that your repository is clean.
+
+`GIT_CLEAN_FLAGS` is disabled when set to `none`. On very big repositories, this
+might be desired because `git clean` is disk I/O intensive. Controlling that
+with `GIT_CLEAN_FLAGS: -ffdx -e .build/`, for example, allows you to control and
+disable removal of some directories within the worktree between subsequent runs,
+which can speed-up the incremental builds. This has the biggest effect
+if you re-use existing machines, and have an existing worktree that you can re-use
+for builds.
+
+For exact parameters accepted by `GIT_CLEAN_FLAGS`, see the documentation
+for [git clean](https://git-scm.com/docs/git-clean). The
+available parameters are dependent on Git version.
+
+## Fork-based workflow
+
+> Introduced in GitLab Runner 11.10.
+
+Following the guidelines above, lets imagine that we want to:
+
+- Optimise for a big project (more than 50k files in directory).
+- Use forks-based workflow for contributing.
+- Reuse existing worktrees. Have preconfigured runners that are pre-cloned with repositories.
+- Runner assigned only to project and all forks.
+
+Lets consider the following two examples, one using `shell` executor and
+other using `docker` executor.
+
+### `shell` executor example
+
+Lets assume that you have the following [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html).
+
+```toml
+concurrent = 4
+
+[[runners]]
+  url = "GITLAB_URL"
+  token = "TOKEN"
+  executor = "shell"
+  builds_dir = "/builds"
+  cache_dir = "/cache"
+
+  [runners.custom_build_dir]
+    enabled = true
+```
+
+This `config.toml`:
+
+- Uses the `shell` executor,
+- Specifies a custom `/builds` directory where all clones will be stored.
+- Enables the ability to specify `GIT_CLONE_PATH`,
+- Runs at most 4 jobs at once.
+
+### `docker` executor example
+
+Lets assume that you have the following [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html).
+
+```toml
+concurrent = 4
+
+[[runners]]
+  url = "GITLAB_URL"
+  token = "TOKEN"
+  executor = "docker"
+  builds_dir = "/builds"
+  cache_dir = "/cache"
+
+  [runners.docker]
+    volumes = ["/builds:/builds", "/cache:/cache"]
+```
+
+This `config.toml`:
+
+- Uses the `docker` executor,
+- Specifies a custom `/builds` directory on disk where all clones will be stored.
+   We host mount the `/builds` directory to make it reusable between subsequent runs
+   and be allowed to override the cloning strategy.
+- Doesn't enable the ability to specify `GIT_CLONE_PATH` as it is enabled by default.
+- Runs at most 4 jobs at once.
+
+### Our `.gitlab-ci.yml`
+
+Once we have the executor configured, we need to fine tune our `.gitlab-ci.yml`.
+
+Our pipeline will be most performant if we use the following `.gitlab-ci.yml`:
+
+```yaml
+variables:
+  GIT_DEPTH: 10
+  GIT_CLONE_PATH: $CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME
+
+build:
+  script: ls -al
+```
+
+The above configures a:
+
+- Shallow clone of 10, to speed up subsequent `git fetch` commands.
+- Custom clone path to make it possible to re-use worktrees between parent project and all forks
+  because we use the same clone path for all forks.
+
+Why use `$CI_CONCURRENT_ID`? The main reason is to ensure that worktrees used are not conflicting
+between projects. The `$CI_CONCURRENT_ID` represents a unique identifier within the given executor,
+so as long as we use it to construct the path, it is guaranteed that this directory will not conflict
+with other concurrent jobs running.
+
+### Store custom clone options in `config.toml`
+
+Ideally, all job-related configuration should be stored in `.gitlab-ci.yml`.
+However, sometimes it is desirable to make these schemes part of Runner configuration.
+
+In the above example of Forks, making this configuration discoverable for users may be preferred,
+but this brings administrative overhead as the `.gitlab-ci.yml` needs to be updated for each branch.
+In such cases, it might be desirable to keep the `.gitlab-ci.yml` clone path agnostic, but make it
+a configuration of Runner.
+
+We can extend our [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html)
+with the following specification that will be used by Runner if `.gitlab-ci.yml` will not override it:
+
+```toml
+concurrent = 4
+
+[[runners]]
+  url = "GITLAB_URL"
+  token = "TOKEN"
+  executor = "docker"
+  builds_dir = "/builds"
+  cache_dir = "/cache"
+
+  environment = [
+    "GIT_DEPTH=10",
+    "GIT_CLONE_PATH=$CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME"
+  ]
+
+  [runners.docker]
+    volumes = ["/builds:/builds", "/cache:/cache"]
+```
+
+This makes the cloning configuration to be part of given Runner,
+and does not require us to update each `.gitlab-ci.yml`.