Add latest changes from gitlab-org/gitlab@master

2020-01-22 03:08:26 +00:00 · 2020-01-22 03:08:26 +00:00 · 32d52eb6dd
commit 32d52eb6dd
parent 66ce6a78f6
13 changed files with 322 additions and 19 deletions
--- a/doc/development/elasticsearch.md
+++ b/doc/development/elasticsearch.md
@ -15,27 +15,15 @@ In June 2019, Mario de la Ossa hosted a [Deep Dive] on GitLab's [Elasticsearch i
 [Google Slides]: https://docs.google.com/presentation/d/1H-pCzI_LNrgrL5pJAIQgvLX8Ji0-jIKOg1QeJQzChug/edit
 [PDF]: https://gitlab.com/gitlab-org/create-stage/uploads/c5aa32b6b07476fa8b597004899ec538/Elasticsearch_Deep_Dive.pdf

-## Initial installation on OS X
+## Supported Versions

-It is recommended to use the Docker image. After installing docker you can immediately spin up an instance with
+See [Version Requirements](../integration/elasticsearch.md#version-requirements).

-```
-docker run --name elastic56 -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:5.6.12
-```
+Developers making significant changes to Elasticsearch queries should test their features against all our supported versions.

-and use `docker stop elastic56` and `docker start elastic56` to stop/start it.
+## Setting up development environment

-### Installing on the host
-
-We currently only support Elasticsearch [5.6 to 6.x](../integration/elasticsearch.md#version-requirements)
-
-Version 5.6 is available on homebrew and is the recommended version to use in order to test compatibility.
-
-```
-brew install elasticsearch@5.6
-```
-
-There is no need to install any plugins
+See the [Elasticsearch GDK setup instructions](https://gitlab.com/gitlab-org/gitlab-development-kit/blob/master/doc/howto/elasticsearch.md)

 ## Helpful rake tasks

--- a/doc/development/img/reference_architecture.png
+++ b/doc/development/img/reference_architecture.png
--- a/doc/development/img/sidekiq_longest_running_jobs.png
+++ b/doc/development/img/sidekiq_longest_running_jobs.png
--- a/doc/development/img/sidekiq_most_time_consuming_jobs.png
+++ b/doc/development/img/sidekiq_most_time_consuming_jobs.png
--- a/doc/development/scalability.md
+++ b/doc/development/scalability.md
@ -0,0 +1,295 @@
+# GitLab scalability
+
+This section describes the current architecture of GitLab as it relates to
+scalability and reliability.
+
+## Reference Architecture Overview
+
+![Reference Architecture Diagram](img/reference_architecture.png)
+
+_[diagram source - GitLab employees only](https://docs.google.com/drawings/d/1RTGtuoUrE0bDT-9smoHbFruhEMI4Ys6uNrufe5IA-VI/edit)_
+
+The diagram above shows a GitLab reference architecture scaled up for 50,000
+users. We will discuss each component below.
+
+## Components
+
+### PostgreSQL
+
+The PostgreSQL database holds all metadata for projects, issues, merge
+requests, users, etc. The schema is managed by the Rails application
+[db/schema.rb](https://gitlab.com/gitlab-org/gitlab/blob/master/db/schema.rb).
+
+GitLab Web/API servers and Sidekiq nodes talk directly to the database via a
+Rails object relational model (ORM). Most SQL queries are accessed via this
+ORM, although some custom SQL is also written for performance or for
+exploiting advanced PostgreSQL features (e.g. recursive CTEs, LATERAL JOINs,
+etc.).
+
+The application has a tight coupling to the database schema. When the
+application starts, Rails queries the database schema, caching the tables and
+column types for the data requested. Because of this schema cache, dropping a
+column or table while the application is running can produce 500 errors to the
+user. This is why we have a [process for dropping columns and other
+no-downtime changes](what_requires_downtime.md).
+
+#### Multi-tenancy
+
+A single database is used to store all customer data. Each user can belong to
+many groups or projects, and the access level (e.g. guest, developer,
+maintainer, etc.) to groups and projects determines what users can see and
+what they can access.
+
+Users with admin access can access all projects and even impersonate
+users.
+
+#### Sharding and partitioning
+
+The database is not divided up in any way; currently all data lives in
+one database in many different tables. This works for simple
+applications, but as the data set grows, it becomes more challenging to
+maintain and support one database with tables with many rows.
+
+There are two ways to deal with this:
+
+- Partioning. Locally split up tables data.
+- Sharding. Distribute data across multiple databases.
+
+Partioning is a built-in PostgreSQL feature and requires minimal changes
+in the application. However, it [requires PostgreSQL
+11](https://www.2ndquadrant.com/en/blog/partitioning-evolution-postgresql-11/).
+
+For example, a natural way to partition is to [partition tables by
+dates](https://gitlab.com/groups/gitlab-org/-/epics/2023). For example,
+the `events` and `audit_events` table are natural candidates for this
+kind of partitioning.
+
+Sharding is likely more difficult and will require significant changes
+to the schema and application. For example, if we have to store projects
+in many different databases, we immediately run into the question, "How
+can we retrieve data across different projects?"  One answer to this is
+to abstract data access into API calls that abstract the database from
+the application, but this is a significant amount of work.
+
+There are solutions that may help abstract the sharding to some extent
+from the application. For example, we will want to look at [Citus
+Data](https://www.citusdata.com/product/community) closely. Citus Data
+provides a Rails plugin that adds a [tenant ID to ActiveRecord
+models](https://www.citusdata.com/blog/2017/01/05/easily-scale-out-multi-tenant-apps/).
+
+Sharding can also be done based on feature verticals. This is the
+microservice approach to sharding, where each service represents a
+bounded context and operates on its own service-specific database
+cluster. In that model data wouldn't be distributed according to some
+internal key (such as tenant IDs) but based on team and product
+ownership. It shares a lot of challenges with traditional, data-oriented
+sharding, however. For instance, joining data has to happen in the
+application itself rather than on the query layer (although additional
+layers like GraphQL might mitigate that) and it requires true
+parallelism to run efficiently (i.e. a scatter-gather model to collect,
+then zip up data records), which is a challenge in itself in Ruby based
+systems.
+
+#### Database size
+
+A recent [database checkup shows a breakdown of the table sizes on
+GitLab.com](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8022#master-1022016101-8).
+Since `merge_request_diff_files` contains over 1 TB of data, we will want to
+reduce/eliminate this table first. GitLab has support for [storing diffs in
+object storage](../administration/merge_request_diffs.html), which we [will
+want to do on
+GitLab.com](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7356).
+
+#### High availability
+
+There are several strategies to provide high-availability and redundancy:
+
+1. Write-ahead logs (WAL) streamed to object storage (e.g. S3, Google Cloud
+   Storage).
+1. Read-replicas (hot backups)
+1. Delayed replicas
+
+To restore a database from a point in time, a base backup needs to have
+been taken prior to that incident. Once a database has restored from
+that backup, the database can apply the WAL logs in order until the
+database has reached the target time.
+
+On GitLab.com, Consul and Patroni work together to coordinate failovers with
+the read replicas. [Omnibus ships with repmgr instead of
+Consul](../administration/high_availability/database.md).
+
+#### Load-balancing
+
+GitLab EE has [application support for load balancing using read
+replicas](../administration/database_load_balancing.md). This load
+balancer does some smart things that are not traditionally available in
+standard load balancers. For example, the application will only consider a
+replica if its replication lag is low (e.g. WAL data behind by < 100
+megabytes).
+
+More [details are in a blog
+post](https://about.gitlab.com/2017/10/02/scaling-the-gitlab-database/).
+
+### PgBouncer
+
+As PostgreSQL forks a backend process for each request, PostgreSQL has a
+finite limit of connections that it can support, typically around 300 by
+default. Without a connection pooler like PgBouncer, it's quite possible to
+hit connection limits. Once the limits are reached, then GitLab will generate
+errors or slow down as it waits for a connection to be available.
+
+#### High availability
+
+PgBouncer is a single-threaded process. Under heavy traffic, PgBouncer can
+saturate a single core, which can result in slower response times for
+background job and/or Web requests. There are two ways to address this
+limitation:
+
+1. Run multiple PgBouncer instances
+1. Use a multi-threaded connection pooler (e.g.
+   [Odyssey](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7776).
+
+On some Linux systems, it's possible to run [multiple PgBouncer instances on
+the same port](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/4796).
+
+On GitLab.com, we run multiple PgBouncer instances on different ports to
+avoid saturating a single core.
+
+In addition, the PgBouncer instances that communicate with the primary
+and secondaries are set up a bit differently:
+
+1. Multiple PgBouncer instances in different availability zones talk to the
+   PostgreSQL primary
+1. Multiple PgBouncer processes are colocated with PostgreSQL read replicas
+
+For replicas, colocating is advantageous because it reduces network hops
+and hence latency. However, for the primary, colocating is
+disadvantageous because PgBouncer would become a single point of failure
+and cause errors.  When a failover occurs, one of two things could
+happen:
+
+- The primary disappears from the network.
+- The primary becomes a replica.
+
+In the first case, if PgBouncer is colocated with the primary, database
+connections would time out or fail to connect, and downtime would
+occur. Having multiple PgBouncer instances in front of a load balancer
+talking to the primary can mitigate this.
+
+In the second case, existing connections to the newly-demoted replica
+may execute a write query, which would fail. During a failover, it may
+be advantegeous to shut down the PgBouncer talking to the primary to
+ensure no more traffic arrives for it. The alternative would be to make
+the application aware of the failover event and terminate its
+connections gracefully.
+
+### Redis
+
+There are three ways Redis is used in GitLab:
+
+- Queues. Sidekiq jobs marshal jobs into JSON payloads.
+- Persistent state. Session data, exclusive leases, etc.
+- Cache. Repository data (e.g. Branch and tag names), view partials, etc.
+
+For GitLab instances running at scale, splitting Redis usage into
+separate Redis clusters helps for two reasons:
+
+- Each has different persistence requirements.
+- Load isolation.
+
+For example, the cache instance can behave like an least-recently used
+(LRU) cache by setting the `maxmemory` configuration option. That option
+should not be set for the queues or persistent clusters because data
+would be evicted from memory at random times. This would cause jobs to
+be dropped on the floor, which would cause many problems (e.g. merges
+not running, builds not updating, etc.).
+
+Sidekiq also polls its queues quite frequently, and this activity can
+slow down other queries. For this reason, having a dedicated Redis
+cluster for Sidekiq can help improve performance and reduce load on the
+Redis process.
+
+#### High availability/Risks
+
+1. Single-core: Like PgBouncer, a single Redis process can only use one
+core.  It does not support multi-threading.
+
+1. Dumb secondaries: Redis secondaries (aka slaves) don't actually
+handle any load. Unlike PostgreSQL secondaries, they don't even serve
+read queries. They simply replicate data from the primary and take over
+only when the primary fails.
+
+### Redis Sentinels
+
+[Redis Sentinel](https://redis.io/topics/sentinel) provides high
+availability for Redis by watching the primary. If multiple Sentinels
+detect that the primary has gone away, the Sentinels performs an
+election to determine a new leader.
+
+#### Failure Modes
+
+No leader: A Redis cluster can get into a mode where there are no
+primaries. For example, this can happen if Redis nodes are misconfigured
+to follow the wrong node. Sometimes this requires forcing one node to
+become a primary via the [`SLAVEOF NO ONE`
+command](https://redis.io/commands/slaveof).
+
+### Sidekiq
+
+Sidekiq is a multi-threaded, background job processing system used in
+Ruby on Rails applications. In GitLab, Sidekiq performs the heavy
+lifting of many activities, including:
+
+1. Updating merge requests after a push
+1. Sending e-mails
+1. Updating user authorizations
+1. Processing CI builds and pipelines
+
+The full list of jobs can be found in the
+[app/workers](https://gitlab.com/gitlab-org/gitlab/tree/master/app/workers)
+and
+[ee/app/workers](https://gitlab.com/gitlab-org/gitlab/tree/master/ee/app/workers)
+directories in the GitLab code base.
+
+#### Runaway Queues
+
+As jobs are added to the Sidekiq queue, Sidekiq worker threads need to
+pull these jobs from the queue and finish them at a rate faster than
+they are added. When an imbalance occurs (e.g. delays in the database,
+slow jobs, etc.), Sidekiq queues can balloon and lead to runaway queues.
+
+In recent months, many of these queues have ballooned due to delays in
+PostgreSQL, PgBouncer, and Redis. For example, PgBouncer saturation can
+cause jobs to wait a few seconds before obtaining a database connection,
+which can cascade into a large slowdown. Optimizing these basic
+interconnections comes first.
+
+However, there are a number of strategies to ensure queues get drained
+in a timely manner:
+
+- Add more processing capacity. This can be done by spinning up more
+  instances of Sidekiq or [Sidekiq Cluster](../administration/operations/extra_sidekiq_processes.md).
+- Split jobs into smaller units of work. For example, `PostReceive`
+  used to process each commit message in the push, but now it farms out
+  this to `ProcessCommitWorker`.
+- Redistribute/gerrymander Sidekiq processes by queue
+  types. Long-running jobs (e.g. relating to project import) can often
+  squeeze out jobs that run fast (e.g. delivering e-mail). [This technique
+  was used in to optimize our existing Sidekiq deployment](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7219#note_218019483).
+- Optimize jobs. Eliminating unnecessary work, reducing network calls
+  (e.g. SQL, Gitaly, etc.), and optimizing processor time can yield significant
+  benefits.
+
+From the Sidekiq logs, it's possible to see which jobs run the most
+frequently and/or take the longest. For example, theis Kibana
+visualizations show the jobs that consume the most total time:
+
+![Most time-consuming Sidekiq jobs](img/sidekiq_most_time_consuming_jobs.png)
+
+_[visualization source - GitLab employees only](https://log.gitlab.net/goto/2c036582dfc3219eeaa49a76eab2564b)_
+
+This shows the jobs that had the longest durations:
+
+![Longest running Sidekiq jobs](img/sidekiq_longest_running_jobs.png)
+
+_[visualization source - GitLab employees only](https://log.gitlab.net/goto/494f6c8afb61d98c4ff264520d184416)_
--- a/doc/development/what_requires_downtime.md
+++ b/doc/development/what_requires_downtime.md
@ -377,6 +377,11 @@ This operation is safe as there's no code using the table just yet.
 Dropping tables can be done safely using a post-deployment migration, but only
 if the application no longer uses the table.

+## Renaming Tables
+
+Renaming tables requires downtime as an application may continue
+using the old table name during/after a database migration.
+
 ## Adding Foreign Keys

 Adding foreign keys usually works in 3 steps:
--- a/doc/user/instance_statistics/convdev.md
+++ b/doc/user/instance_statistics/convdev.md
@ -0,0 +1,5 @@
+---
+redirect_to: '../instance_statistics/dev_ops_score.md'
+---
+
+This document was moved to [another location](../instance_statistics/dev_ops_score.md).
--- a/doc/user/instance_statistics/dev_ops_score.md
+++ b/doc/user/instance_statistics/dev_ops_score.md
@ -1,5 +1,6 @@
 # DevOps Score

+> [Introduced](https://gitlab.com/gitlab-org/gitlab-foss/issues/30469) in GitLab 9.3.
 > [Renamed from Conversational Development Index](https://gitlab.com/gitlab-org/gitlab/issues/20976) in GitLab 12.6.

 NOTE: **Note:**
--- a/qa/qa/page/base.rb
+++ b/qa/qa/page/base.rb
@ -177,7 +177,7 @@ module QA
        # The number of selectors should be able to be reduced after
        # migration to the new spinner is complete.
        # https://gitlab.com/groups/gitlab-org/-/epics/956
-        has_no_css?('.gl-spinner, .fa-spinner, .spinner', wait: Capybara.default_max_wait_time)
+        has_no_css?('.gl-spinner, .fa-spinner, .spinner', wait: QA::Support::Repeater::DEFAULT_MAX_WAIT_TIME)
      end

      def finished_loading_block?
--- a/qa/qa/page/merge_request/show.rb
+++ b/qa/qa/page/merge_request/show.rb
@ -155,6 +155,8 @@ module QA
        def merge!
          click_element :merge_button if ready_to_merge?

+          finished_loading?
+
          raise "Merge did not appear to be successful" unless merged?
        end

--- a/qa/qa/resource/api_fabricator.rb
+++ b/qa/qa/resource/api_fabricator.rb
@ -31,6 +31,12 @@ module QA
        resource_web_url(api_post)
      end

+      def reload!
+        api_get
+
+        self
+      end
+
      def remove_via_api!
        api_delete
      end
--- a/qa/qa/resource/group.rb
+++ b/qa/qa/resource/group.rb
@ -16,6 +16,7 @@ module QA

      attribute :id
      attribute :name
+      attribute :runners_token

      def initialize
        @path = Runtime::Namespace.name
--- a/scripts/flaky_examples/prune-old-flaky-examples
+++ b/scripts/flaky_examples/prune-old-flaky-examples
@ -7,7 +7,7 @@ require 'rubygems'

 # In newer Ruby, alias_method is not private then we don't need __send__
 singleton_class.__send__(:alias_method, :require_dependency, :require) # rubocop:disable GitlabSecurity/PublicSend
-$:.unshift(File.expand_path('../lib', __dir__))
+$:.unshift(File.expand_path('../../lib', __dir__))

 require 'rspec_flaky/report'