Add latest changes from gitlab-org/gitlab@master
This commit is contained in:
parent
66ce6a78f6
commit
32d52eb6dd
13 changed files with 322 additions and 19 deletions
|
@ -15,27 +15,15 @@ In June 2019, Mario de la Ossa hosted a [Deep Dive] on GitLab's [Elasticsearch i
|
|||
[Google Slides]: https://docs.google.com/presentation/d/1H-pCzI_LNrgrL5pJAIQgvLX8Ji0-jIKOg1QeJQzChug/edit
|
||||
[PDF]: https://gitlab.com/gitlab-org/create-stage/uploads/c5aa32b6b07476fa8b597004899ec538/Elasticsearch_Deep_Dive.pdf
|
||||
|
||||
## Initial installation on OS X
|
||||
## Supported Versions
|
||||
|
||||
It is recommended to use the Docker image. After installing docker you can immediately spin up an instance with
|
||||
See [Version Requirements](../integration/elasticsearch.md#version-requirements).
|
||||
|
||||
```
|
||||
docker run --name elastic56 -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:5.6.12
|
||||
```
|
||||
Developers making significant changes to Elasticsearch queries should test their features against all our supported versions.
|
||||
|
||||
and use `docker stop elastic56` and `docker start elastic56` to stop/start it.
|
||||
## Setting up development environment
|
||||
|
||||
### Installing on the host
|
||||
|
||||
We currently only support Elasticsearch [5.6 to 6.x](../integration/elasticsearch.md#version-requirements)
|
||||
|
||||
Version 5.6 is available on homebrew and is the recommended version to use in order to test compatibility.
|
||||
|
||||
```
|
||||
brew install elasticsearch@5.6
|
||||
```
|
||||
|
||||
There is no need to install any plugins
|
||||
See the [Elasticsearch GDK setup instructions](https://gitlab.com/gitlab-org/gitlab-development-kit/blob/master/doc/howto/elasticsearch.md)
|
||||
|
||||
## Helpful rake tasks
|
||||
|
||||
|
|
BIN
doc/development/img/reference_architecture.png
Normal file
BIN
doc/development/img/reference_architecture.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 110 KiB |
BIN
doc/development/img/sidekiq_longest_running_jobs.png
Normal file
BIN
doc/development/img/sidekiq_longest_running_jobs.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 33 KiB |
BIN
doc/development/img/sidekiq_most_time_consuming_jobs.png
Normal file
BIN
doc/development/img/sidekiq_most_time_consuming_jobs.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 33 KiB |
295
doc/development/scalability.md
Normal file
295
doc/development/scalability.md
Normal file
|
@ -0,0 +1,295 @@
|
|||
# GitLab scalability
|
||||
|
||||
This section describes the current architecture of GitLab as it relates to
|
||||
scalability and reliability.
|
||||
|
||||
## Reference Architecture Overview
|
||||
|
||||
![Reference Architecture Diagram](img/reference_architecture.png)
|
||||
|
||||
_[diagram source - GitLab employees only](https://docs.google.com/drawings/d/1RTGtuoUrE0bDT-9smoHbFruhEMI4Ys6uNrufe5IA-VI/edit)_
|
||||
|
||||
The diagram above shows a GitLab reference architecture scaled up for 50,000
|
||||
users. We will discuss each component below.
|
||||
|
||||
## Components
|
||||
|
||||
### PostgreSQL
|
||||
|
||||
The PostgreSQL database holds all metadata for projects, issues, merge
|
||||
requests, users, etc. The schema is managed by the Rails application
|
||||
[db/schema.rb](https://gitlab.com/gitlab-org/gitlab/blob/master/db/schema.rb).
|
||||
|
||||
GitLab Web/API servers and Sidekiq nodes talk directly to the database via a
|
||||
Rails object relational model (ORM). Most SQL queries are accessed via this
|
||||
ORM, although some custom SQL is also written for performance or for
|
||||
exploiting advanced PostgreSQL features (e.g. recursive CTEs, LATERAL JOINs,
|
||||
etc.).
|
||||
|
||||
The application has a tight coupling to the database schema. When the
|
||||
application starts, Rails queries the database schema, caching the tables and
|
||||
column types for the data requested. Because of this schema cache, dropping a
|
||||
column or table while the application is running can produce 500 errors to the
|
||||
user. This is why we have a [process for dropping columns and other
|
||||
no-downtime changes](what_requires_downtime.md).
|
||||
|
||||
#### Multi-tenancy
|
||||
|
||||
A single database is used to store all customer data. Each user can belong to
|
||||
many groups or projects, and the access level (e.g. guest, developer,
|
||||
maintainer, etc.) to groups and projects determines what users can see and
|
||||
what they can access.
|
||||
|
||||
Users with admin access can access all projects and even impersonate
|
||||
users.
|
||||
|
||||
#### Sharding and partitioning
|
||||
|
||||
The database is not divided up in any way; currently all data lives in
|
||||
one database in many different tables. This works for simple
|
||||
applications, but as the data set grows, it becomes more challenging to
|
||||
maintain and support one database with tables with many rows.
|
||||
|
||||
There are two ways to deal with this:
|
||||
|
||||
- Partioning. Locally split up tables data.
|
||||
- Sharding. Distribute data across multiple databases.
|
||||
|
||||
Partioning is a built-in PostgreSQL feature and requires minimal changes
|
||||
in the application. However, it [requires PostgreSQL
|
||||
11](https://www.2ndquadrant.com/en/blog/partitioning-evolution-postgresql-11/).
|
||||
|
||||
For example, a natural way to partition is to [partition tables by
|
||||
dates](https://gitlab.com/groups/gitlab-org/-/epics/2023). For example,
|
||||
the `events` and `audit_events` table are natural candidates for this
|
||||
kind of partitioning.
|
||||
|
||||
Sharding is likely more difficult and will require significant changes
|
||||
to the schema and application. For example, if we have to store projects
|
||||
in many different databases, we immediately run into the question, "How
|
||||
can we retrieve data across different projects?" One answer to this is
|
||||
to abstract data access into API calls that abstract the database from
|
||||
the application, but this is a significant amount of work.
|
||||
|
||||
There are solutions that may help abstract the sharding to some extent
|
||||
from the application. For example, we will want to look at [Citus
|
||||
Data](https://www.citusdata.com/product/community) closely. Citus Data
|
||||
provides a Rails plugin that adds a [tenant ID to ActiveRecord
|
||||
models](https://www.citusdata.com/blog/2017/01/05/easily-scale-out-multi-tenant-apps/).
|
||||
|
||||
Sharding can also be done based on feature verticals. This is the
|
||||
microservice approach to sharding, where each service represents a
|
||||
bounded context and operates on its own service-specific database
|
||||
cluster. In that model data wouldn't be distributed according to some
|
||||
internal key (such as tenant IDs) but based on team and product
|
||||
ownership. It shares a lot of challenges with traditional, data-oriented
|
||||
sharding, however. For instance, joining data has to happen in the
|
||||
application itself rather than on the query layer (although additional
|
||||
layers like GraphQL might mitigate that) and it requires true
|
||||
parallelism to run efficiently (i.e. a scatter-gather model to collect,
|
||||
then zip up data records), which is a challenge in itself in Ruby based
|
||||
systems.
|
||||
|
||||
#### Database size
|
||||
|
||||
A recent [database checkup shows a breakdown of the table sizes on
|
||||
GitLab.com](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8022#master-1022016101-8).
|
||||
Since `merge_request_diff_files` contains over 1 TB of data, we will want to
|
||||
reduce/eliminate this table first. GitLab has support for [storing diffs in
|
||||
object storage](../administration/merge_request_diffs.html), which we [will
|
||||
want to do on
|
||||
GitLab.com](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7356).
|
||||
|
||||
#### High availability
|
||||
|
||||
There are several strategies to provide high-availability and redundancy:
|
||||
|
||||
1. Write-ahead logs (WAL) streamed to object storage (e.g. S3, Google Cloud
|
||||
Storage).
|
||||
1. Read-replicas (hot backups)
|
||||
1. Delayed replicas
|
||||
|
||||
To restore a database from a point in time, a base backup needs to have
|
||||
been taken prior to that incident. Once a database has restored from
|
||||
that backup, the database can apply the WAL logs in order until the
|
||||
database has reached the target time.
|
||||
|
||||
On GitLab.com, Consul and Patroni work together to coordinate failovers with
|
||||
the read replicas. [Omnibus ships with repmgr instead of
|
||||
Consul](../administration/high_availability/database.md).
|
||||
|
||||
#### Load-balancing
|
||||
|
||||
GitLab EE has [application support for load balancing using read
|
||||
replicas](../administration/database_load_balancing.md). This load
|
||||
balancer does some smart things that are not traditionally available in
|
||||
standard load balancers. For example, the application will only consider a
|
||||
replica if its replication lag is low (e.g. WAL data behind by < 100
|
||||
megabytes).
|
||||
|
||||
More [details are in a blog
|
||||
post](https://about.gitlab.com/2017/10/02/scaling-the-gitlab-database/).
|
||||
|
||||
### PgBouncer
|
||||
|
||||
As PostgreSQL forks a backend process for each request, PostgreSQL has a
|
||||
finite limit of connections that it can support, typically around 300 by
|
||||
default. Without a connection pooler like PgBouncer, it's quite possible to
|
||||
hit connection limits. Once the limits are reached, then GitLab will generate
|
||||
errors or slow down as it waits for a connection to be available.
|
||||
|
||||
#### High availability
|
||||
|
||||
PgBouncer is a single-threaded process. Under heavy traffic, PgBouncer can
|
||||
saturate a single core, which can result in slower response times for
|
||||
background job and/or Web requests. There are two ways to address this
|
||||
limitation:
|
||||
|
||||
1. Run multiple PgBouncer instances
|
||||
1. Use a multi-threaded connection pooler (e.g.
|
||||
[Odyssey](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7776).
|
||||
|
||||
On some Linux systems, it's possible to run [multiple PgBouncer instances on
|
||||
the same port](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/4796).
|
||||
|
||||
On GitLab.com, we run multiple PgBouncer instances on different ports to
|
||||
avoid saturating a single core.
|
||||
|
||||
In addition, the PgBouncer instances that communicate with the primary
|
||||
and secondaries are set up a bit differently:
|
||||
|
||||
1. Multiple PgBouncer instances in different availability zones talk to the
|
||||
PostgreSQL primary
|
||||
1. Multiple PgBouncer processes are colocated with PostgreSQL read replicas
|
||||
|
||||
For replicas, colocating is advantageous because it reduces network hops
|
||||
and hence latency. However, for the primary, colocating is
|
||||
disadvantageous because PgBouncer would become a single point of failure
|
||||
and cause errors. When a failover occurs, one of two things could
|
||||
happen:
|
||||
|
||||
- The primary disappears from the network.
|
||||
- The primary becomes a replica.
|
||||
|
||||
In the first case, if PgBouncer is colocated with the primary, database
|
||||
connections would time out or fail to connect, and downtime would
|
||||
occur. Having multiple PgBouncer instances in front of a load balancer
|
||||
talking to the primary can mitigate this.
|
||||
|
||||
In the second case, existing connections to the newly-demoted replica
|
||||
may execute a write query, which would fail. During a failover, it may
|
||||
be advantegeous to shut down the PgBouncer talking to the primary to
|
||||
ensure no more traffic arrives for it. The alternative would be to make
|
||||
the application aware of the failover event and terminate its
|
||||
connections gracefully.
|
||||
|
||||
### Redis
|
||||
|
||||
There are three ways Redis is used in GitLab:
|
||||
|
||||
- Queues. Sidekiq jobs marshal jobs into JSON payloads.
|
||||
- Persistent state. Session data, exclusive leases, etc.
|
||||
- Cache. Repository data (e.g. Branch and tag names), view partials, etc.
|
||||
|
||||
For GitLab instances running at scale, splitting Redis usage into
|
||||
separate Redis clusters helps for two reasons:
|
||||
|
||||
- Each has different persistence requirements.
|
||||
- Load isolation.
|
||||
|
||||
For example, the cache instance can behave like an least-recently used
|
||||
(LRU) cache by setting the `maxmemory` configuration option. That option
|
||||
should not be set for the queues or persistent clusters because data
|
||||
would be evicted from memory at random times. This would cause jobs to
|
||||
be dropped on the floor, which would cause many problems (e.g. merges
|
||||
not running, builds not updating, etc.).
|
||||
|
||||
Sidekiq also polls its queues quite frequently, and this activity can
|
||||
slow down other queries. For this reason, having a dedicated Redis
|
||||
cluster for Sidekiq can help improve performance and reduce load on the
|
||||
Redis process.
|
||||
|
||||
#### High availability/Risks
|
||||
|
||||
1. Single-core: Like PgBouncer, a single Redis process can only use one
|
||||
core. It does not support multi-threading.
|
||||
|
||||
1. Dumb secondaries: Redis secondaries (aka slaves) don't actually
|
||||
handle any load. Unlike PostgreSQL secondaries, they don't even serve
|
||||
read queries. They simply replicate data from the primary and take over
|
||||
only when the primary fails.
|
||||
|
||||
### Redis Sentinels
|
||||
|
||||
[Redis Sentinel](https://redis.io/topics/sentinel) provides high
|
||||
availability for Redis by watching the primary. If multiple Sentinels
|
||||
detect that the primary has gone away, the Sentinels performs an
|
||||
election to determine a new leader.
|
||||
|
||||
#### Failure Modes
|
||||
|
||||
No leader: A Redis cluster can get into a mode where there are no
|
||||
primaries. For example, this can happen if Redis nodes are misconfigured
|
||||
to follow the wrong node. Sometimes this requires forcing one node to
|
||||
become a primary via the [`SLAVEOF NO ONE`
|
||||
command](https://redis.io/commands/slaveof).
|
||||
|
||||
### Sidekiq
|
||||
|
||||
Sidekiq is a multi-threaded, background job processing system used in
|
||||
Ruby on Rails applications. In GitLab, Sidekiq performs the heavy
|
||||
lifting of many activities, including:
|
||||
|
||||
1. Updating merge requests after a push
|
||||
1. Sending e-mails
|
||||
1. Updating user authorizations
|
||||
1. Processing CI builds and pipelines
|
||||
|
||||
The full list of jobs can be found in the
|
||||
[app/workers](https://gitlab.com/gitlab-org/gitlab/tree/master/app/workers)
|
||||
and
|
||||
[ee/app/workers](https://gitlab.com/gitlab-org/gitlab/tree/master/ee/app/workers)
|
||||
directories in the GitLab code base.
|
||||
|
||||
#### Runaway Queues
|
||||
|
||||
As jobs are added to the Sidekiq queue, Sidekiq worker threads need to
|
||||
pull these jobs from the queue and finish them at a rate faster than
|
||||
they are added. When an imbalance occurs (e.g. delays in the database,
|
||||
slow jobs, etc.), Sidekiq queues can balloon and lead to runaway queues.
|
||||
|
||||
In recent months, many of these queues have ballooned due to delays in
|
||||
PostgreSQL, PgBouncer, and Redis. For example, PgBouncer saturation can
|
||||
cause jobs to wait a few seconds before obtaining a database connection,
|
||||
which can cascade into a large slowdown. Optimizing these basic
|
||||
interconnections comes first.
|
||||
|
||||
However, there are a number of strategies to ensure queues get drained
|
||||
in a timely manner:
|
||||
|
||||
- Add more processing capacity. This can be done by spinning up more
|
||||
instances of Sidekiq or [Sidekiq Cluster](../administration/operations/extra_sidekiq_processes.md).
|
||||
- Split jobs into smaller units of work. For example, `PostReceive`
|
||||
used to process each commit message in the push, but now it farms out
|
||||
this to `ProcessCommitWorker`.
|
||||
- Redistribute/gerrymander Sidekiq processes by queue
|
||||
types. Long-running jobs (e.g. relating to project import) can often
|
||||
squeeze out jobs that run fast (e.g. delivering e-mail). [This technique
|
||||
was used in to optimize our existing Sidekiq deployment](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7219#note_218019483).
|
||||
- Optimize jobs. Eliminating unnecessary work, reducing network calls
|
||||
(e.g. SQL, Gitaly, etc.), and optimizing processor time can yield significant
|
||||
benefits.
|
||||
|
||||
From the Sidekiq logs, it's possible to see which jobs run the most
|
||||
frequently and/or take the longest. For example, theis Kibana
|
||||
visualizations show the jobs that consume the most total time:
|
||||
|
||||
![Most time-consuming Sidekiq jobs](img/sidekiq_most_time_consuming_jobs.png)
|
||||
|
||||
_[visualization source - GitLab employees only](https://log.gitlab.net/goto/2c036582dfc3219eeaa49a76eab2564b)_
|
||||
|
||||
This shows the jobs that had the longest durations:
|
||||
|
||||
![Longest running Sidekiq jobs](img/sidekiq_longest_running_jobs.png)
|
||||
|
||||
_[visualization source - GitLab employees only](https://log.gitlab.net/goto/494f6c8afb61d98c4ff264520d184416)_
|
|
@ -377,6 +377,11 @@ This operation is safe as there's no code using the table just yet.
|
|||
Dropping tables can be done safely using a post-deployment migration, but only
|
||||
if the application no longer uses the table.
|
||||
|
||||
## Renaming Tables
|
||||
|
||||
Renaming tables requires downtime as an application may continue
|
||||
using the old table name during/after a database migration.
|
||||
|
||||
## Adding Foreign Keys
|
||||
|
||||
Adding foreign keys usually works in 3 steps:
|
||||
|
|
5
doc/user/instance_statistics/convdev.md
Normal file
5
doc/user/instance_statistics/convdev.md
Normal file
|
@ -0,0 +1,5 @@
|
|||
---
|
||||
redirect_to: '../instance_statistics/dev_ops_score.md'
|
||||
---
|
||||
|
||||
This document was moved to [another location](../instance_statistics/dev_ops_score.md).
|
|
@ -1,5 +1,6 @@
|
|||
# DevOps Score
|
||||
|
||||
> [Introduced](https://gitlab.com/gitlab-org/gitlab-foss/issues/30469) in GitLab 9.3.
|
||||
> [Renamed from Conversational Development Index](https://gitlab.com/gitlab-org/gitlab/issues/20976) in GitLab 12.6.
|
||||
|
||||
NOTE: **Note:**
|
||||
|
|
|
@ -177,7 +177,7 @@ module QA
|
|||
# The number of selectors should be able to be reduced after
|
||||
# migration to the new spinner is complete.
|
||||
# https://gitlab.com/groups/gitlab-org/-/epics/956
|
||||
has_no_css?('.gl-spinner, .fa-spinner, .spinner', wait: Capybara.default_max_wait_time)
|
||||
has_no_css?('.gl-spinner, .fa-spinner, .spinner', wait: QA::Support::Repeater::DEFAULT_MAX_WAIT_TIME)
|
||||
end
|
||||
|
||||
def finished_loading_block?
|
||||
|
|
|
@ -155,6 +155,8 @@ module QA
|
|||
def merge!
|
||||
click_element :merge_button if ready_to_merge?
|
||||
|
||||
finished_loading?
|
||||
|
||||
raise "Merge did not appear to be successful" unless merged?
|
||||
end
|
||||
|
||||
|
|
|
@ -31,6 +31,12 @@ module QA
|
|||
resource_web_url(api_post)
|
||||
end
|
||||
|
||||
def reload!
|
||||
api_get
|
||||
|
||||
self
|
||||
end
|
||||
|
||||
def remove_via_api!
|
||||
api_delete
|
||||
end
|
||||
|
|
|
@ -16,6 +16,7 @@ module QA
|
|||
|
||||
attribute :id
|
||||
attribute :name
|
||||
attribute :runners_token
|
||||
|
||||
def initialize
|
||||
@path = Runtime::Namespace.name
|
||||
|
|
|
@ -7,7 +7,7 @@ require 'rubygems'
|
|||
|
||||
# In newer Ruby, alias_method is not private then we don't need __send__
|
||||
singleton_class.__send__(:alias_method, :require_dependency, :require) # rubocop:disable GitlabSecurity/PublicSend
|
||||
$:.unshift(File.expand_path('../lib', __dir__))
|
||||
$:.unshift(File.expand_path('../../lib', __dir__))
|
||||
|
||||
require 'rspec_flaky/report'
|
||||
|
||||
|
|
Loading…
Reference in a new issue