gitlab-org--gitlab-foss/doc/administration/nfs.md

20 KiB

stage group info type
Enablement Distribution To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments reference

Using NFS with GitLab

NFS can be used as an alternative for object storage but this isn't typically recommended for performance reasons. Note however it is required for GitLab Pages.

For data objects such as LFS, Uploads, Artifacts, etc., an Object Storage service is recommended over NFS where possible, due to better performance.

WARNING: From GitLab 13.0, using NFS for Git repositories is deprecated. In GitLab 14.0, support for NFS for Git repositories is scheduled to be removed. Upgrade to Gitaly Cluster as soon as possible.

Filesystem performance can impact overall GitLab performance, especially for actions that read or write to Git repositories. For steps you can use to test filesystem performance, see Filesystem Performance Benchmarking.

Known kernel version incompatibilities

RedHat Enterprise Linux (RHEL) and CentOS v7.7 and v7.8 ship with kernel version 3.10.0-1127, which contains a bug that causes uploads to fail to copy over NFS. The following GitLab versions include a fix to work properly with that kernel version:

If you are using that kernel version, be sure to upgrade GitLab to avoid errors.

Fast lookup of authorized SSH keys

The fast SSH key lookup feature can improve performance of GitLab instances even if they're using block storage.

Fast SSH key lookup is a replacement for authorized_keys (in /var/opt/gitlab/.ssh) using the GitLab database.

NFS increases latency, so fast lookup is recommended if /var/opt/gitlab is moved to NFS.

We are investigating the use of fast lookup as the default.

NFS server

Installing the nfs-kernel-server package allows you to share directories with the clients running the GitLab application:

sudo apt-get update
sudo apt-get install nfs-kernel-server

Required features

File locking: GitLab requires advisory file locking, which is only supported natively in NFS version 4. NFSv3 also supports locking as long as Linux Kernel 2.6.5+ is used. We recommend using version 4 and do not specifically test NFSv3.

When you define your NFS exports, we recommend you also add the following options:

  • no_root_squash - NFS normally changes the root user to nobody. This is a good security measure when NFS shares will be accessed by many different users. However, in this case only GitLab will use the NFS share so it is safe. GitLab recommends the no_root_squash setting because we need to manage file permissions automatically. Without the setting you may receive errors when the Omnibus package tries to alter permissions. Note that GitLab and other bundled components do not run as root but as non-privileged users. The recommendation for no_root_squash is to allow the Omnibus package to set ownership and permissions on files, as needed. In some cases where the no_root_squash option is not available, the root flag can achieve the same result.
  • sync - Force synchronous behavior. Default is asynchronous and under certain circumstances it could lead to data loss if a failure occurs before data has synced.

Due to the complexities of running Omnibus with LDAP and the complexities of maintaining ID mapping without LDAP, in most cases you should enable numeric UIDs and GIDs (which is off by default in some cases) for simplified permission management between systems:

Disable NFS server delegation

We recommend that all NFS users disable the NFS server delegation feature. This is to avoid a Linux kernel bug which causes NFS clients to slow precipitously due to excessive network traffic from numerous TEST_STATEID NFS messages.

To disable NFS server delegation, do the following:

  1. On the NFS server, run:

    echo 0 > /proc/sys/fs/leases-enable
    sysctl -w fs.leases-enable=0
    
  2. Restart the NFS server process. For example, on CentOS run service nfs restart.

NOTE: The kernel bug may be fixed in more recent kernels with this commit. Red Hat Enterprise 7 shipped a kernel update on August 6, 2019 that may also have resolved this problem. You may not need to disable NFS server delegation if you know you are using a version of the Linux kernel that has been fixed. That said, GitLab still encourages instance administrators to keep NFS server delegation disabled.

Improving NFS performance with GitLab

NFS performance with GitLab can in some cases be improved with direct Git access using Rugged.

NOTE: From GitLab 12.1, it will automatically be detected if Rugged can and should be used per storage.

If you previously enabled Rugged using the feature flag, you will need to unset the feature flag by using:

sudo gitlab-rake gitlab:features:unset_rugged

If the Rugged feature flag is explicitly set to either true or false, GitLab will use the value explicitly set.

Improving NFS performance with Puma

NOTE: From GitLab 12.7, Rugged is not automatically enabled if Puma thread count is greater than 1.

If you want to use Rugged with Puma, set Puma thread count to 1.

If you want to use Rugged with Puma thread count more than 1, Rugged can be enabled using the feature flag.

NFS client

The nfs-common provides NFS functionality without installing server components which we don't need running on the application nodes.

apt-get update
apt-get install nfs-common

Mount options

Here is an example snippet to add to /etc/fstab:

10.1.0.1:/var/opt/gitlab/.ssh /var/opt/gitlab/.ssh nfs4 defaults,vers=4.1,hard,rsize=1048576,wsize=1048576,noatime,nofail,lookupcache=positive 0 2
10.1.0.1:/var/opt/gitlab/gitlab-rails/uploads /var/opt/gitlab/gitlab-rails/uploads nfs4 defaults,vers=4.1,hard,rsize=1048576,wsize=1048576,noatime,nofail,lookupcache=positive 0 2
10.1.0.1:/var/opt/gitlab/gitlab-rails/shared /var/opt/gitlab/gitlab-rails/shared nfs4 defaults,vers=4.1,hard,rsize=1048576,wsize=1048576,noatime,nofail,lookupcache=positive 0 2
10.1.0.1:/var/opt/gitlab/gitlab-ci/builds /var/opt/gitlab/gitlab-ci/builds nfs4 defaults,vers=4.1,hard,rsize=1048576,wsize=1048576,noatime,nofail,lookupcache=positive 0 2
10.1.0.1:/var/opt/gitlab/git-data /var/opt/gitlab/git-data nfs4 defaults,vers=4.1,hard,rsize=1048576,wsize=1048576,noatime,nofail,lookupcache=positive 0 2

You can view information and options set for each of the mounted NFS file systems by running nfsstat -m and cat /etc/fstab.

Note there are several options that you should consider using:

Setting Description
vers=4.1 NFS v4.1 should be used instead of v4.0 because there is a Linux NFS client bug in v4.0 that can cause significant problems due to stale data.
nofail Don't halt boot process waiting for this mount to become available
lookupcache=positive Tells the NFS client to honor positive cache results but invalidates any negative cache results. Negative cache results cause problems with Git. Specifically, a git push can fail to register uniformly across all NFS clients. The negative cache causes the clients to 'remember' that the files did not exist previously.
hard Instead of soft. Further details.

soft mount option

It's recommended that you use hard in your mount options, unless you have a specific reason to use soft.

On GitLab.com, we use soft because there were times when we had NFS servers reboot and soft improved availability, but everyone's infrastructure is different. If your NFS is provided by on-premise storage arrays with redundant controllers, for example, you shouldn't need to worry about NFS server availability.

The NFS man page states:

"soft" timeout can cause silent data corruption in certain cases

Read the Linux man page to understand the difference, and if you do use soft, ensure that you've taken steps to mitigate the risks.

If you experience behavior that might have been caused by writes to disk on the NFS server not occurring, such as commits going missing, use the hard option, because (from the man page):

use the soft option only when client responsiveness is more important than data integrity

Other vendors make similar recommendations, including SAP and NetApp's knowledge base, they highlight that if the NFS client driver caches data, soft means there is no certainty if writes by GitLab are actually on disk.

Mount points set with the option hard may not perform as well, and if the NFS server goes down, hard will cause processes to hang when interacting with the mount point. Use SIGKILL (kill -9) to deal with hung processes. The intr option stopped working in the 2.6 kernel.

A single NFS mount

It's recommended to nest all GitLab data directories within a mount, that allows automatic restore of backups without manually moving existing data.

mountpoint
└── gitlab-data
    ├── builds
    ├── git-data
    ├── shared
    └── uploads

To do so, we'll need to configure Omnibus with the paths to each directory nested in the mount point as follows:

Mount /gitlab-nfs then use the following Omnibus configuration to move each data location to a subdirectory:

git_data_dirs({"default" => { "path" => "/gitlab-nfs/gitlab-data/git-data"} })
gitlab_rails['uploads_directory'] = '/gitlab-nfs/gitlab-data/uploads'
gitlab_rails['shared_path'] = '/gitlab-nfs/gitlab-data/shared'
gitlab_ci['builds_directory'] = '/gitlab-nfs/gitlab-data/builds'

Run sudo gitlab-ctl reconfigure to start using the central location. Be aware that if you had existing data, you'll need to manually copy or rsync it to these new locations, and then restart GitLab.

Bind mounts

Alternatively to changing the configuration in Omnibus, bind mounts can be used to store the data on an NFS mount.

Bind mounts provide a way to specify just one NFS mount and then bind the default GitLab data locations to the NFS mount. Start by defining your single NFS mount point as you normally would in /etc/fstab. Let's assume your NFS mount point is /gitlab-nfs. Then, add the following bind mounts in /etc/fstab:

/gitlab-nfs/gitlab-data/git-data /var/opt/gitlab/git-data none bind 0 0
/gitlab-nfs/gitlab-data/.ssh /var/opt/gitlab/.ssh none bind 0 0
/gitlab-nfs/gitlab-data/uploads /var/opt/gitlab/gitlab-rails/uploads none bind 0 0
/gitlab-nfs/gitlab-data/shared /var/opt/gitlab/gitlab-rails/shared none bind 0 0
/gitlab-nfs/gitlab-data/builds /var/opt/gitlab/gitlab-ci/builds none bind 0 0

Using bind mounts will require manually making sure the data directories are empty before attempting a restore. Read more about the restore prerequisites.

Multiple NFS mounts

When using default Omnibus configuration you will need to share 4 data locations between all GitLab cluster nodes. No other locations should be shared. The following are the 4 locations need to be shared:

Location Description Default configuration
/var/opt/gitlab/git-data Git repository data. This will account for a large portion of your data git_data_dirs({"default" => { "path" => "/var/opt/gitlab/git-data"} })
/var/opt/gitlab/gitlab-rails/uploads User uploaded attachments gitlab_rails['uploads_directory'] = '/var/opt/gitlab/gitlab-rails/uploads'
/var/opt/gitlab/gitlab-rails/shared Build artifacts, GitLab Pages, LFS objects, temp files, etc. If you're using LFS this may also account for a large portion of your data gitlab_rails['shared_path'] = '/var/opt/gitlab/gitlab-rails/shared'
/var/opt/gitlab/gitlab-ci/builds GitLab CI/CD build traces gitlab_ci['builds_directory'] = '/var/opt/gitlab/gitlab-ci/builds'

Other GitLab directories should not be shared between nodes. They contain node-specific files and GitLab code that does not need to be shared. To ship logs to a central location consider using remote syslog. Omnibus GitLab packages provide configuration for UDP log shipping.

Having multiple NFS mounts will require manually making sure the data directories are empty before attempting a restore. Read more about the restore prerequisites.

Testing NFS

Once you've set up the NFS server and client, you can verify NFS is configured correctly by testing the following commands:

sudo mkdir /gitlab-nfs/test-dir
sudo chown git /gitlab-nfs/test-dir
sudo chgrp root /gitlab-nfs/test-dir
sudo chmod 0700 /gitlab-nfs/test-dir
sudo chgrp gitlab-www /gitlab-nfs/test-dir
sudo chmod 0751 /gitlab-nfs/test-dir
sudo chgrp git /gitlab-nfs/test-dir
sudo chmod 2770 /gitlab-nfs/test-dir
sudo chmod 2755 /gitlab-nfs/test-dir
sudo -u git mkdir /gitlab-nfs/test-dir/test2
sudo -u git chmod 2755 /gitlab-nfs/test-dir/test2
sudo ls -lah /gitlab-nfs/test-dir/test2
sudo -u git rm -r /gitlab-nfs/test-dir

Any Operation not permitted errors means you should investigate your NFS server export options.

NFS in a Firewalled Environment

If the traffic between your NFS server and NFS client(s) is subject to port filtering by a firewall, then you will need to reconfigure that firewall to allow NFS communication.

This guide from TDLP covers the basics of using NFS in a firewalled environment. Additionally, we encourage you to search for and review the specific documentation for your operating system or distribution and your firewall software.

Example for Ubuntu:

Check that NFS traffic from the client is allowed by the firewall on the host by running the command: sudo ufw status. If it's being blocked, then you can allow traffic from a specific client with the command below.

sudo ufw allow from <client_ip_address> to any port nfs

Known issues

Upgrade to Gitaly Cluster or disable caching if experiencing data loss

WARNING: From GitLab 13.0, using NFS for Git repositories is deprecated. In GitLab 14.0, support for NFS for Git repositories is scheduled to be removed. Upgrade to Gitaly Cluster as soon as possible.

Customers and users have reported data loss on high-traffic repositories when using NFS for Git repositories. For example, we have seen inconsistent updates after a push. The problem may be partially mitigated by adjusting caching using the following NFS client mount options:

Setting Description
lookupcache=positive Tells the NFS client to honor positive cache results but invalidates any negative cache results. Negative cache results cause problems with Git. Specifically, a git push can fail to register uniformly across all NFS clients. The negative cache causes the clients to 'remember' that the files did not exist previously.
actimeo=0 Sets the time to zero that the NFS client caches files and directories before requesting fresh information from a server.
noac Tells the NFS client not to cache file attributes and forces application writes to become synchronous so that local changes to a file become visible on the server immediately.

WARNING: The actimeo=0 and noac options both result in a significant reduction in performance, possibly leading to timeouts. You may be able to avoid timeouts and data loss using actimeo=0 and lookupcache=positive without noac, however we expect the performance reduction will still be significant. As noted above, we strongly recommend upgrading to Gitaly Cluster as soon as possible.

Avoid using AWS's Elastic File System (EFS)

GitLab strongly recommends against using AWS Elastic File System (EFS). Our support team will not be able to assist on performance issues related to file system access.

Customers and users have reported that AWS EFS does not perform well for GitLab's use-case. Workloads where many small files are written in a serialized manner, like git, are not well-suited for EFS. EBS with an NFS server on top will perform much better.

If you do choose to use EFS, avoid storing GitLab log files (e.g. those in /var/log/gitlab) there because this will also affect performance. We recommend that the log files be stored on a local volume.

For more details on another person's experience with EFS, see this Commit Brooklyn 2019 video.

Avoid using CephFS and GlusterFS

GitLab strongly recommends against using CephFS and GlusterFS. These distributed file systems are not well-suited for GitLab's input/output access patterns because Git uses many small files and access times and file locking times to propagate will make Git activity very slow.

Avoid using PostgreSQL with NFS

GitLab strongly recommends against running your PostgreSQL database across NFS. The GitLab support team will not be able to assist on performance issues related to this configuration.

Additionally, this configuration is specifically warned against in the PostgreSQL Documentation:

PostgreSQL does nothing special for NFS file systems, meaning it assumes NFS behaves exactly like locally-connected drives. If the client or server NFS implementation does not provide standard file system semantics, this can cause reliability problems. Specifically, delayed (asynchronous) writes to the NFS server can cause data corruption problems.

For supported database architecture, see our documentation about configuring a database for replication and failover.

Troubleshooting

Finding the requests that are being made to NFS

In case of NFS-related problems, it can be helpful to trace the filesystem requests that are being made by using perf:

sudo perf trace -e 'nfs4:*' -p $(pgrep -fd ',' puma && pgrep -fd ',' unicorn)

On Ubuntu 16.04, use:

sudo perf trace --no-syscalls --event 'nfs4:*' -p $(pgrep -fd ',' puma && pgrep -fd ',' unicorn)