2020-10-01 18:10:20 +00:00
---
2022-05-30 00:08:35 +00:00
stage: Systems
2020-10-01 18:10:20 +00:00
group: Geo
2022-09-21 21:13:33 +00:00
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
2020-10-01 18:10:20 +00:00
type: howto
---
2020-12-04 21:09:29 +00:00
WARNING:
2022-02-28 12:15:45 +00:00
This runbook is in [**Alpha** ](../../../../policy/alpha-beta-support.md#alpha-features ). For complete, production-ready documentation, see the
2020-10-01 18:10:20 +00:00
[disaster recovery documentation ](../index.md ).
2021-01-28 12:09:54 +00:00
# Disaster Recovery (Geo) promotion runbooks **(PREMIUM SELF)**
2020-10-01 18:10:20 +00:00
## Geo planned failover for a multi-node configuration
| Component | Configuration |
|-------------|-----------------|
| PostgreSQL | Omnibus-managed |
| Geo site | Multi-node |
| Secondaries | One |
2021-06-29 18:07:04 +00:00
This runbook guides you through a planned failover of a multi-node Geo site
2020-10-01 18:10:20 +00:00
with one secondary. The following [2000 user reference architecture ](../../../../administration/reference_architectures/2k_users.md ) is assumed:
```mermaid
graph TD
subgraph main[Geo deployment]
subgraph Primary[Primary site, multi-node]
Node_1[Rails node 1]
Node_2[Rails node 2]
Node_3[PostgreSQL node]
Node_4[Gitaly node]
Node_5[Redis node]
Node_6[Monitoring node]
end
subgraph Secondary[Secondary site, multi-node]
Node_7[Rails node 1]
Node_8[Rails node 2]
Node_9[PostgreSQL node]
Node_10[Gitaly node]
Node_11[Redis node]
Node_12[Monitoring node]
end
end
```
The load balancer node and optional NFS server are omitted for clarity.
2021-06-29 18:07:04 +00:00
This guide results in the following:
2020-10-01 18:10:20 +00:00
1. An offline primary.
1. A promoted secondary that is now the new primary.
What is not covered:
1. Re-adding the old **primary** as a secondary.
1. Adding a new secondary.
### Preparation
2020-12-04 21:09:29 +00:00
NOTE:
2020-10-01 18:10:20 +00:00
Before following any of those steps, make sure you have `root` access to the
2021-12-03 12:10:23 +00:00
**secondary** to promote it, because there isn't provided an automated way to
2020-10-01 18:10:20 +00:00
promote a Geo replica and perform a failover.
2021-09-30 15:12:24 +00:00
NOTE:
2021-12-03 09:10:57 +00:00
GitLab 13.9 through GitLab 14.3 are affected by a bug in which the Geo secondary site statuses appear to stop updating and become unhealthy. For more information, see [Geo Admin Area shows 'Unhealthy' after enabling Maintenance Mode ](../../replication/troubleshooting.md#geo-admin-area-shows-unhealthy-after-enabling-maintenance-mode ).
2021-09-30 15:12:24 +00:00
2021-11-17 15:10:28 +00:00
On the **secondary** site:
2020-10-01 18:10:20 +00:00
2022-09-14 00:10:29 +00:00
1. On the top bar, select **Main menu > Admin** .
2022-05-03 21:08:54 +00:00
1. On the left sidebar, select **Geo > Sites** to see its status.
2021-06-18 15:10:16 +00:00
Replicated objects (shown in green) should be close to 100%,
and there should be no failures (shown in red). If a large proportion of
2021-11-17 15:10:28 +00:00
objects aren't yet replicated (shown in gray), consider giving the site more
2021-06-18 15:10:16 +00:00
time to complete.
2021-07-30 15:10:25 +00:00
![Replication status ](../../replication/img/geo_dashboard_v14_0.png )
2020-10-01 18:10:20 +00:00
If any objects are failing to replicate, this should be investigated before
scheduling the maintenance window. After a planned failover, anything that
2021-06-29 18:07:04 +00:00
failed to replicate is **lost** .
2020-10-01 18:10:20 +00:00
You can use the
[Geo status API ](../../../../api/geo_nodes.md#retrieve-project-sync-or-verification-failures-that-occurred-on-the-current-node )
to review failed objects and the reasons for failure.
2022-05-18 06:08:06 +00:00
A common cause of replication failures is data that is missing on the
2021-11-17 15:10:28 +00:00
**primary** site - you can resolve these failures by restoring the data from backup,
2020-10-01 18:10:20 +00:00
or removing references to the missing data.
2021-12-03 09:10:57 +00:00
The maintenance window doesn't end until Geo replication and verification is
2020-10-01 18:10:20 +00:00
completely finished. To keep the window as short as possible, you should
ensure these processes are close to 100% as possible during active use.
2021-11-17 15:10:28 +00:00
If the **secondary** site is still replicating data from the **primary** site,
2020-10-01 18:10:20 +00:00
follow these steps to avoid unnecessary data loss:
1. Until a [read-only mode ](https://gitlab.com/gitlab-org/gitlab/-/issues/14609 )
is implemented, updates must be prevented from happening manually to the
2021-11-17 15:10:28 +00:00
**primary** . Your **secondary** site still needs read-only
access to the **primary** site during the maintenance window:
2020-10-01 18:10:20 +00:00
2022-05-03 21:08:54 +00:00
1. At the scheduled time, using your cloud provider or your site's firewall, block
all HTTP, HTTPS and SSH traffic to/from the **primary** site, **except** for your IP and
the **secondary** site's IP.
2020-10-01 18:10:20 +00:00
2022-05-03 21:08:54 +00:00
For instance, you can run the following commands on the **primary** site:
2020-10-01 18:10:20 +00:00
```shell
2022-05-03 21:08:54 +00:00
sudo iptables -A INPUT -p tcp -s < secondary_site_ip > --destination-port 22 -j ACCEPT
2020-10-01 18:10:20 +00:00
sudo iptables -A INPUT -p tcp -s < your_ip > --destination-port 22 -j ACCEPT
sudo iptables -A INPUT --destination-port 22 -j REJECT
2022-05-03 21:08:54 +00:00
sudo iptables -A INPUT -p tcp -s < secondary_site_ip > --destination-port 80 -j ACCEPT
2020-10-01 18:10:20 +00:00
sudo iptables -A INPUT -p tcp -s < your_ip > --destination-port 80 -j ACCEPT
sudo iptables -A INPUT --tcp-dport 80 -j REJECT
2022-05-03 21:08:54 +00:00
sudo iptables -A INPUT -p tcp -s < secondary_site_ip > --destination-port 443 -j ACCEPT
2020-10-01 18:10:20 +00:00
sudo iptables -A INPUT -p tcp -s < your_ip > --destination-port 443 -j ACCEPT
sudo iptables -A INPUT --tcp-dport 443 -j REJECT
```
2021-06-29 18:07:04 +00:00
From this point, users are unable to view their data or make changes on the
2021-11-17 15:10:28 +00:00
**primary** site. They are also unable to log in to the **secondary** site.
2021-12-03 12:10:23 +00:00
However, existing sessions must work for the remainder of the maintenance period, and
2021-06-29 18:07:04 +00:00
so public data is accessible throughout.
2020-10-01 18:10:20 +00:00
2021-11-17 15:10:28 +00:00
1. Verify the **primary** site is blocked to HTTP traffic by visiting it in browser via
2020-10-01 18:10:20 +00:00
another IP. The server should refuse connection.
2021-11-17 15:10:28 +00:00
1. Verify the **primary** site is blocked to Git over SSH traffic by attempting to pull an
2020-10-01 18:10:20 +00:00
existing Git repository with an SSH remote URL. The server should refuse
connection.
2021-11-17 15:10:28 +00:00
1. On the **primary** site:
2022-09-14 00:10:29 +00:00
1. On the top bar, select **Main menu > Admin** .
2021-06-18 15:10:16 +00:00
1. On the left sidebar, select **Monitoring > Background Jobs** .
2021-12-03 12:10:23 +00:00
1. On the Sidekiq dashboard, select **Cron** .
2021-06-18 15:10:16 +00:00
1. Select `Disable All` to disable any non-Geo periodic background jobs.
1. Select `Enable` for the `geo_sidekiq_cron_config_worker` cron job.
2021-12-03 09:10:57 +00:00
This job re-enables several other cron jobs that are essential for planned
2021-06-18 15:10:16 +00:00
failover to complete successfully.
2020-10-01 18:10:20 +00:00
1. Finish replicating and verifying all data:
2020-12-04 21:09:29 +00:00
WARNING:
2020-10-01 18:10:20 +00:00
Not all data is automatically replicated. Read more about
[what is excluded ](../planned_failover.md#not-all-data-is-automatically-replicated ).
1. If you are manually replicating any
[data not managed by Geo ](../../replication/datatypes.md#limitations-on-replicationverification ),
trigger the final replication process now.
2021-11-17 15:10:28 +00:00
1. On the **primary** site:
2022-09-14 00:10:29 +00:00
1. On the top bar, select **Main menu > Admin** .
2021-06-18 15:10:16 +00:00
1. On the left sidebar, select **Monitoring > Background Jobs** .
1. On the Sidekiq dashboard, select **Queues** , and wait for all queues except
those with `geo` in the name to drop to 0.
These queues contain work that has been submitted by your users; failing over
before it is completed, causes the work to be lost.
2022-05-03 21:08:54 +00:00
1. On the left sidebar, select **Geo > Sites** and wait for the
following conditions to be true of the **secondary** site you are failing over to:
2021-06-18 15:10:16 +00:00
- All replication meters reach 100% replicated, 0% failures.
- All verification meters reach 100% verified, 0% failures.
- Database replication lag is 0ms.
- The Geo log cursor is up to date (0 events behind).
2021-11-17 15:10:28 +00:00
1. On the **secondary** site:
2022-09-14 00:10:29 +00:00
1. On the top bar, select **Main menu > Admin** .
2021-06-18 15:10:16 +00:00
1. On the left sidebar, select **Monitoring > Background Jobs** .
1. On the Sidekiq dashboard, select **Queues** , and wait for all the `geo`
queues to drop to 0 queued and 0 running jobs.
1. [Run an integrity check ](../../../raketasks/check.md ) to verify the integrity
of CI artifacts, LFS objects, and uploads in file storage.
2020-10-01 18:10:20 +00:00
2021-11-17 15:10:28 +00:00
At this point, your **secondary** site contains an up-to-date copy of everything the
**primary** site has, meaning nothing is lost when you fail over.
2020-10-01 18:10:20 +00:00
2021-12-03 12:10:23 +00:00
1. In this final step, you must permanently disable the **primary** site.
2020-10-01 18:10:20 +00:00
2020-12-04 21:09:29 +00:00
WARNING:
2021-11-17 15:10:28 +00:00
When the **primary** site goes offline, there may be data saved on the **primary** site
that has not been replicated to the **secondary** site. This data should be treated
2020-10-01 18:10:20 +00:00
as lost if you proceed.
2020-12-08 03:09:37 +00:00
NOTE:
2020-10-01 18:10:20 +00:00
If you plan to [update the **primary** domain DNS record ](../index.md#step-4-optional-updating-the-primary-domain-dns-record ),
you may wish to lower the TTL now to speed up propagation.
When performing a failover, we want to avoid a split-brain situation where
writes can occur in two different GitLab instances. So to prepare for the
2021-11-17 15:10:28 +00:00
failover, you must disable the **primary** site:
2020-10-01 18:10:20 +00:00
2021-11-17 15:10:28 +00:00
- If you have SSH access to the **primary** site, stop and disable GitLab:
2020-10-01 18:10:20 +00:00
```shell
sudo gitlab-ctl stop
```
Prevent GitLab from starting up again if the server unexpectedly reboots:
```shell
sudo systemctl disable gitlab-runsvdir
```
2020-12-04 21:09:29 +00:00
NOTE:
2021-12-03 12:10:23 +00:00
(**CentOS only**) In CentOS 6 or older, it is challenging to prevent GitLab from being
2020-10-01 18:10:20 +00:00
started if the machine reboots isn't available (see [Omnibus GitLab issue #3058 ](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/3058 )).
It may be safest to uninstall the GitLab package completely with `sudo yum remove gitlab-ee` .
2020-12-04 21:09:29 +00:00
NOTE:
2020-10-01 18:10:20 +00:00
(**Ubuntu 14.04 LTS**) If you are using an older version of Ubuntu
or any other distribution based on the Upstart init system, you can prevent GitLab
from starting if the machine reboots as `root` with
`initctl stop gitlab-runsvvdir && echo 'manual' > /etc/init/gitlab-runsvdir.override && initctl reload-configuration` .
2021-11-17 15:10:28 +00:00
- If you do not have SSH access to the **primary** site, take the machine offline and
2020-10-01 18:10:20 +00:00
prevent it from rebooting. Since there are many ways you may prefer to accomplish
2021-12-03 12:10:23 +00:00
this, we avoid a single recommendation. You may have to:
2020-10-01 18:10:20 +00:00
- Reconfigure the load balancers.
2020-10-02 21:08:18 +00:00
- Change DNS records (for example, point the **primary** DNS record to the
2021-11-17 15:10:28 +00:00
**secondary** site to stop using the **primary** site).
2020-10-01 18:10:20 +00:00
- Stop the virtual servers.
- Block traffic through a firewall.
2021-11-17 15:10:28 +00:00
- Revoke object storage permissions from the **primary** site.
2020-10-01 18:10:20 +00:00
- Physically disconnect a machine.
2021-11-17 15:10:28 +00:00
### Promoting the **secondary** site running GitLab 14.5 and later
1. SSH to every Sidekiq, PostgresSQL, and Gitaly node in the **secondary** site and run one of the following commands:
2022-05-03 21:08:54 +00:00
- To promote the secondary site to primary:
2021-11-17 15:10:28 +00:00
```shell
sudo gitlab-ctl geo promote
```
2022-05-03 21:08:54 +00:00
- To promote the secondary site to primary **without any further confirmation** :
2021-11-17 15:10:28 +00:00
```shell
sudo gitlab-ctl geo promote --force
```
1. SSH into each Rails node on your **secondary** site and run one of the following commands:
2022-05-03 21:08:54 +00:00
- To promote the secondary site to primary:
2021-11-17 15:10:28 +00:00
```shell
sudo gitlab-ctl geo promote
```
2022-05-03 21:08:54 +00:00
- To promote the secondary site to primary **without any further confirmation** :
2021-11-17 15:10:28 +00:00
```shell
sudo gitlab-ctl geo promote --force
```
1. Verify you can connect to the newly promoted **primary** site using the URL used
previously for the **secondary** site.
1. If successful, the **secondary** site is now promoted to the **primary** site.
### Promoting the **secondary** site running GitLab 14.4 and earlier
WARNING:
The `gitlab-ctl promote-to-primary-node` and `gitlab-ctl promoted-db` commands are
2022-05-04 15:09:12 +00:00
deprecated in GitLab 14.5 and later, and [removed in GitLab 15.0 ](https://gitlab.com/gitlab-org/gitlab/-/issues/345207 ).
2021-11-17 15:10:28 +00:00
Use `gitlab-ctl geo promote` instead.
2020-10-01 18:10:20 +00:00
2020-12-04 21:09:29 +00:00
NOTE:
2020-10-01 18:10:20 +00:00
A new **secondary** should not be added at this time. If you want to add a new
**secondary**, do this after you have completed the entire process of promoting
the **secondary** to the **primary** .
2020-12-04 21:09:29 +00:00
WARNING:
2020-10-01 18:10:20 +00:00
If you encounter an `ActiveRecord::RecordInvalid: Validation failed: Name has already been taken` error during this process, read
2022-09-02 00:11:46 +00:00
[the troubleshooting advice ](../../replication/troubleshooting.md#fixing-errors-during-a-failover-or-when-promoting-a-secondary-to-a-primary-site ).
2020-10-01 18:10:20 +00:00
The `gitlab-ctl promote-to-primary-node` command cannot be used yet in
conjunction with multiple servers, as it can only
perform changes on a **secondary** with only a single machine. Instead, you must
do this manually.
2020-12-08 00:09:45 +00:00
WARNING:
2021-11-17 15:10:28 +00:00
In GitLab 13.2 and 13.3, promoting a secondary site to a primary while the
2020-11-10 00:08:52 +00:00
secondary is paused fails. Do not pause replication before promoting a
2021-11-17 15:10:28 +00:00
secondary. If the site is paused, be sure to resume before promoting. This
2020-11-10 00:08:52 +00:00
issue has been fixed in GitLab 13.4 and later.
2020-11-06 18:09:07 +00:00
2020-12-04 21:09:29 +00:00
WARNING:
2021-11-17 15:10:28 +00:00
If the secondary site [has been paused ](../../../geo/index.md#pausing-and-resuming-replication ), this performs
2020-11-04 18:08:42 +00:00
a point-in-time recovery to the last known state.
2021-06-29 18:07:04 +00:00
Data that was created on the primary while the secondary was paused is lost.
2020-10-01 18:10:20 +00:00
2020-11-04 18:08:42 +00:00
1. SSH in to the PostgreSQL node in the **secondary** and promote PostgreSQL separately:
2020-10-01 18:10:20 +00:00
```shell
2020-11-04 18:08:42 +00:00
sudo gitlab-ctl promote-db
2020-10-01 18:10:20 +00:00
```
In GitLab 12.8 and earlier, see [Message: `sudo: gitlab-pg-ctl: command not found` ](../../replication/troubleshooting.md#message-sudo-gitlab-pg-ctl-command-not-found ).
1. Edit `/etc/gitlab/gitlab.rb` on every machine in the **secondary** to
reflect its new status as **primary** by removing any lines that enabled the
`geo_secondary_role` :
```ruby
## In pre-11.5 documentation, the role was enabled as follows. Remove this line.
geo_secondary_role['enable'] = true
## In 11.5+ documentation, the role was enabled as follows. Remove this line.
roles ['geo_secondary_role']
```
After making these changes [Reconfigure GitLab ](../../../restart_gitlab.md#omnibus-gitlab-reconfigure ) each
machine so the changes take effect.
1. Promote the **secondary** to **primary** . SSH into a single Rails node
server and execute:
```shell
sudo gitlab-rake geo:set_secondary_as_primary
```
1. Verify you can connect to the newly promoted **primary** using the URL used
previously for the **secondary** .
1. Success! The **secondary** has now been promoted to **primary** .
### Next steps
To regain geographic redundancy as quickly as possible, you should
2021-11-17 15:10:28 +00:00
[add a new **secondary** site ](../../setup/index.md ). To
2020-10-01 18:10:20 +00:00
do that, you can re-add the old **primary** as a new secondary and bring it back
online.