diff --git a/doc/development/elasticsearch.md b/doc/development/elasticsearch.md index bd8a4e1c6d7..e3ec69b6489 100644 --- a/doc/development/elasticsearch.md +++ b/doc/development/elasticsearch.md @@ -15,27 +15,15 @@ In June 2019, Mario de la Ossa hosted a [Deep Dive] on GitLab's [Elasticsearch i [Google Slides]: https://docs.google.com/presentation/d/1H-pCzI_LNrgrL5pJAIQgvLX8Ji0-jIKOg1QeJQzChug/edit [PDF]: https://gitlab.com/gitlab-org/create-stage/uploads/c5aa32b6b07476fa8b597004899ec538/Elasticsearch_Deep_Dive.pdf -## Initial installation on OS X +## Supported Versions -It is recommended to use the Docker image. After installing docker you can immediately spin up an instance with +See [Version Requirements](../integration/elasticsearch.md#version-requirements). -``` -docker run --name elastic56 -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:5.6.12 -``` +Developers making significant changes to Elasticsearch queries should test their features against all our supported versions. -and use `docker stop elastic56` and `docker start elastic56` to stop/start it. +## Setting up development environment -### Installing on the host - -We currently only support Elasticsearch [5.6 to 6.x](../integration/elasticsearch.md#version-requirements) - -Version 5.6 is available on homebrew and is the recommended version to use in order to test compatibility. - -``` -brew install elasticsearch@5.6 -``` - -There is no need to install any plugins +See the [Elasticsearch GDK setup instructions](https://gitlab.com/gitlab-org/gitlab-development-kit/blob/master/doc/howto/elasticsearch.md) ## Helpful rake tasks diff --git a/doc/development/img/reference_architecture.png b/doc/development/img/reference_architecture.png new file mode 100644 index 00000000000..1414200d076 Binary files /dev/null and b/doc/development/img/reference_architecture.png differ diff --git a/doc/development/img/sidekiq_longest_running_jobs.png b/doc/development/img/sidekiq_longest_running_jobs.png new file mode 100644 index 00000000000..73f74842a2f Binary files /dev/null and b/doc/development/img/sidekiq_longest_running_jobs.png differ diff --git a/doc/development/img/sidekiq_most_time_consuming_jobs.png b/doc/development/img/sidekiq_most_time_consuming_jobs.png new file mode 100644 index 00000000000..73f74842a2f Binary files /dev/null and b/doc/development/img/sidekiq_most_time_consuming_jobs.png differ diff --git a/doc/development/scalability.md b/doc/development/scalability.md new file mode 100644 index 00000000000..70a4cab39e2 --- /dev/null +++ b/doc/development/scalability.md @@ -0,0 +1,295 @@ +# GitLab scalability + +This section describes the current architecture of GitLab as it relates to +scalability and reliability. + +## Reference Architecture Overview + +![Reference Architecture Diagram](img/reference_architecture.png) + +_[diagram source - GitLab employees only](https://docs.google.com/drawings/d/1RTGtuoUrE0bDT-9smoHbFruhEMI4Ys6uNrufe5IA-VI/edit)_ + +The diagram above shows a GitLab reference architecture scaled up for 50,000 +users. We will discuss each component below. + +## Components + +### PostgreSQL + +The PostgreSQL database holds all metadata for projects, issues, merge +requests, users, etc. The schema is managed by the Rails application +[db/schema.rb](https://gitlab.com/gitlab-org/gitlab/blob/master/db/schema.rb). + +GitLab Web/API servers and Sidekiq nodes talk directly to the database via a +Rails object relational model (ORM). Most SQL queries are accessed via this +ORM, although some custom SQL is also written for performance or for +exploiting advanced PostgreSQL features (e.g. recursive CTEs, LATERAL JOINs, +etc.). + +The application has a tight coupling to the database schema. When the +application starts, Rails queries the database schema, caching the tables and +column types for the data requested. Because of this schema cache, dropping a +column or table while the application is running can produce 500 errors to the +user. This is why we have a [process for dropping columns and other +no-downtime changes](what_requires_downtime.md). + +#### Multi-tenancy + +A single database is used to store all customer data. Each user can belong to +many groups or projects, and the access level (e.g. guest, developer, +maintainer, etc.) to groups and projects determines what users can see and +what they can access. + +Users with admin access can access all projects and even impersonate +users. + +#### Sharding and partitioning + +The database is not divided up in any way; currently all data lives in +one database in many different tables. This works for simple +applications, but as the data set grows, it becomes more challenging to +maintain and support one database with tables with many rows. + +There are two ways to deal with this: + +- Partioning. Locally split up tables data. +- Sharding. Distribute data across multiple databases. + +Partioning is a built-in PostgreSQL feature and requires minimal changes +in the application. However, it [requires PostgreSQL +11](https://www.2ndquadrant.com/en/blog/partitioning-evolution-postgresql-11/). + +For example, a natural way to partition is to [partition tables by +dates](https://gitlab.com/groups/gitlab-org/-/epics/2023). For example, +the `events` and `audit_events` table are natural candidates for this +kind of partitioning. + +Sharding is likely more difficult and will require significant changes +to the schema and application. For example, if we have to store projects +in many different databases, we immediately run into the question, "How +can we retrieve data across different projects?" One answer to this is +to abstract data access into API calls that abstract the database from +the application, but this is a significant amount of work. + +There are solutions that may help abstract the sharding to some extent +from the application. For example, we will want to look at [Citus +Data](https://www.citusdata.com/product/community) closely. Citus Data +provides a Rails plugin that adds a [tenant ID to ActiveRecord +models](https://www.citusdata.com/blog/2017/01/05/easily-scale-out-multi-tenant-apps/). + +Sharding can also be done based on feature verticals. This is the +microservice approach to sharding, where each service represents a +bounded context and operates on its own service-specific database +cluster. In that model data wouldn't be distributed according to some +internal key (such as tenant IDs) but based on team and product +ownership. It shares a lot of challenges with traditional, data-oriented +sharding, however. For instance, joining data has to happen in the +application itself rather than on the query layer (although additional +layers like GraphQL might mitigate that) and it requires true +parallelism to run efficiently (i.e. a scatter-gather model to collect, +then zip up data records), which is a challenge in itself in Ruby based +systems. + +#### Database size + +A recent [database checkup shows a breakdown of the table sizes on +GitLab.com](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8022#master-1022016101-8). +Since `merge_request_diff_files` contains over 1 TB of data, we will want to +reduce/eliminate this table first. GitLab has support for [storing diffs in +object storage](../administration/merge_request_diffs.html), which we [will +want to do on +GitLab.com](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7356). + +#### High availability + +There are several strategies to provide high-availability and redundancy: + +1. Write-ahead logs (WAL) streamed to object storage (e.g. S3, Google Cloud + Storage). +1. Read-replicas (hot backups) +1. Delayed replicas + +To restore a database from a point in time, a base backup needs to have +been taken prior to that incident. Once a database has restored from +that backup, the database can apply the WAL logs in order until the +database has reached the target time. + +On GitLab.com, Consul and Patroni work together to coordinate failovers with +the read replicas. [Omnibus ships with repmgr instead of +Consul](../administration/high_availability/database.md). + +#### Load-balancing + +GitLab EE has [application support for load balancing using read +replicas](../administration/database_load_balancing.md). This load +balancer does some smart things that are not traditionally available in +standard load balancers. For example, the application will only consider a +replica if its replication lag is low (e.g. WAL data behind by < 100 +megabytes). + +More [details are in a blog +post](https://about.gitlab.com/2017/10/02/scaling-the-gitlab-database/). + +### PgBouncer + +As PostgreSQL forks a backend process for each request, PostgreSQL has a +finite limit of connections that it can support, typically around 300 by +default. Without a connection pooler like PgBouncer, it's quite possible to +hit connection limits. Once the limits are reached, then GitLab will generate +errors or slow down as it waits for a connection to be available. + +#### High availability + +PgBouncer is a single-threaded process. Under heavy traffic, PgBouncer can +saturate a single core, which can result in slower response times for +background job and/or Web requests. There are two ways to address this +limitation: + +1. Run multiple PgBouncer instances +1. Use a multi-threaded connection pooler (e.g. + [Odyssey](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7776). + +On some Linux systems, it's possible to run [multiple PgBouncer instances on +the same port](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/4796). + +On GitLab.com, we run multiple PgBouncer instances on different ports to +avoid saturating a single core. + +In addition, the PgBouncer instances that communicate with the primary +and secondaries are set up a bit differently: + +1. Multiple PgBouncer instances in different availability zones talk to the + PostgreSQL primary +1. Multiple PgBouncer processes are colocated with PostgreSQL read replicas + +For replicas, colocating is advantageous because it reduces network hops +and hence latency. However, for the primary, colocating is +disadvantageous because PgBouncer would become a single point of failure +and cause errors. When a failover occurs, one of two things could +happen: + +- The primary disappears from the network. +- The primary becomes a replica. + +In the first case, if PgBouncer is colocated with the primary, database +connections would time out or fail to connect, and downtime would +occur. Having multiple PgBouncer instances in front of a load balancer +talking to the primary can mitigate this. + +In the second case, existing connections to the newly-demoted replica +may execute a write query, which would fail. During a failover, it may +be advantegeous to shut down the PgBouncer talking to the primary to +ensure no more traffic arrives for it. The alternative would be to make +the application aware of the failover event and terminate its +connections gracefully. + +### Redis + +There are three ways Redis is used in GitLab: + +- Queues. Sidekiq jobs marshal jobs into JSON payloads. +- Persistent state. Session data, exclusive leases, etc. +- Cache. Repository data (e.g. Branch and tag names), view partials, etc. + +For GitLab instances running at scale, splitting Redis usage into +separate Redis clusters helps for two reasons: + +- Each has different persistence requirements. +- Load isolation. + +For example, the cache instance can behave like an least-recently used +(LRU) cache by setting the `maxmemory` configuration option. That option +should not be set for the queues or persistent clusters because data +would be evicted from memory at random times. This would cause jobs to +be dropped on the floor, which would cause many problems (e.g. merges +not running, builds not updating, etc.). + +Sidekiq also polls its queues quite frequently, and this activity can +slow down other queries. For this reason, having a dedicated Redis +cluster for Sidekiq can help improve performance and reduce load on the +Redis process. + +#### High availability/Risks + +1. Single-core: Like PgBouncer, a single Redis process can only use one +core. It does not support multi-threading. + +1. Dumb secondaries: Redis secondaries (aka slaves) don't actually +handle any load. Unlike PostgreSQL secondaries, they don't even serve +read queries. They simply replicate data from the primary and take over +only when the primary fails. + +### Redis Sentinels + +[Redis Sentinel](https://redis.io/topics/sentinel) provides high +availability for Redis by watching the primary. If multiple Sentinels +detect that the primary has gone away, the Sentinels performs an +election to determine a new leader. + +#### Failure Modes + +No leader: A Redis cluster can get into a mode where there are no +primaries. For example, this can happen if Redis nodes are misconfigured +to follow the wrong node. Sometimes this requires forcing one node to +become a primary via the [`SLAVEOF NO ONE` +command](https://redis.io/commands/slaveof). + +### Sidekiq + +Sidekiq is a multi-threaded, background job processing system used in +Ruby on Rails applications. In GitLab, Sidekiq performs the heavy +lifting of many activities, including: + +1. Updating merge requests after a push +1. Sending e-mails +1. Updating user authorizations +1. Processing CI builds and pipelines + +The full list of jobs can be found in the +[app/workers](https://gitlab.com/gitlab-org/gitlab/tree/master/app/workers) +and +[ee/app/workers](https://gitlab.com/gitlab-org/gitlab/tree/master/ee/app/workers) +directories in the GitLab code base. + +#### Runaway Queues + +As jobs are added to the Sidekiq queue, Sidekiq worker threads need to +pull these jobs from the queue and finish them at a rate faster than +they are added. When an imbalance occurs (e.g. delays in the database, +slow jobs, etc.), Sidekiq queues can balloon and lead to runaway queues. + +In recent months, many of these queues have ballooned due to delays in +PostgreSQL, PgBouncer, and Redis. For example, PgBouncer saturation can +cause jobs to wait a few seconds before obtaining a database connection, +which can cascade into a large slowdown. Optimizing these basic +interconnections comes first. + +However, there are a number of strategies to ensure queues get drained +in a timely manner: + +- Add more processing capacity. This can be done by spinning up more + instances of Sidekiq or [Sidekiq Cluster](../administration/operations/extra_sidekiq_processes.md). +- Split jobs into smaller units of work. For example, `PostReceive` + used to process each commit message in the push, but now it farms out + this to `ProcessCommitWorker`. +- Redistribute/gerrymander Sidekiq processes by queue + types. Long-running jobs (e.g. relating to project import) can often + squeeze out jobs that run fast (e.g. delivering e-mail). [This technique + was used in to optimize our existing Sidekiq deployment](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7219#note_218019483). +- Optimize jobs. Eliminating unnecessary work, reducing network calls + (e.g. SQL, Gitaly, etc.), and optimizing processor time can yield significant + benefits. + +From the Sidekiq logs, it's possible to see which jobs run the most +frequently and/or take the longest. For example, theis Kibana +visualizations show the jobs that consume the most total time: + +![Most time-consuming Sidekiq jobs](img/sidekiq_most_time_consuming_jobs.png) + +_[visualization source - GitLab employees only](https://log.gitlab.net/goto/2c036582dfc3219eeaa49a76eab2564b)_ + +This shows the jobs that had the longest durations: + +![Longest running Sidekiq jobs](img/sidekiq_longest_running_jobs.png) + +_[visualization source - GitLab employees only](https://log.gitlab.net/goto/494f6c8afb61d98c4ff264520d184416)_ diff --git a/doc/development/what_requires_downtime.md b/doc/development/what_requires_downtime.md index 841a05d8e61..a25d065f735 100644 --- a/doc/development/what_requires_downtime.md +++ b/doc/development/what_requires_downtime.md @@ -377,6 +377,11 @@ This operation is safe as there's no code using the table just yet. Dropping tables can be done safely using a post-deployment migration, but only if the application no longer uses the table. +## Renaming Tables + +Renaming tables requires downtime as an application may continue +using the old table name during/after a database migration. + ## Adding Foreign Keys Adding foreign keys usually works in 3 steps: diff --git a/doc/user/instance_statistics/convdev.md b/doc/user/instance_statistics/convdev.md new file mode 100644 index 00000000000..a1a4eca191c --- /dev/null +++ b/doc/user/instance_statistics/convdev.md @@ -0,0 +1,5 @@ +--- +redirect_to: '../instance_statistics/dev_ops_score.md' +--- + +This document was moved to [another location](../instance_statistics/dev_ops_score.md). diff --git a/doc/user/instance_statistics/dev_ops_score.md b/doc/user/instance_statistics/dev_ops_score.md index 68c5bb48c3c..a68f2a1e92a 100644 --- a/doc/user/instance_statistics/dev_ops_score.md +++ b/doc/user/instance_statistics/dev_ops_score.md @@ -1,5 +1,6 @@ # DevOps Score +> [Introduced](https://gitlab.com/gitlab-org/gitlab-foss/issues/30469) in GitLab 9.3. > [Renamed from Conversational Development Index](https://gitlab.com/gitlab-org/gitlab/issues/20976) in GitLab 12.6. NOTE: **Note:** diff --git a/qa/qa/page/base.rb b/qa/qa/page/base.rb index a4c44f78ad4..f07d56e85c3 100644 --- a/qa/qa/page/base.rb +++ b/qa/qa/page/base.rb @@ -177,7 +177,7 @@ module QA # The number of selectors should be able to be reduced after # migration to the new spinner is complete. # https://gitlab.com/groups/gitlab-org/-/epics/956 - has_no_css?('.gl-spinner, .fa-spinner, .spinner', wait: Capybara.default_max_wait_time) + has_no_css?('.gl-spinner, .fa-spinner, .spinner', wait: QA::Support::Repeater::DEFAULT_MAX_WAIT_TIME) end def finished_loading_block? diff --git a/qa/qa/page/merge_request/show.rb b/qa/qa/page/merge_request/show.rb index ad5b3c97cb9..940b7f332c7 100644 --- a/qa/qa/page/merge_request/show.rb +++ b/qa/qa/page/merge_request/show.rb @@ -155,6 +155,8 @@ module QA def merge! click_element :merge_button if ready_to_merge? + finished_loading? + raise "Merge did not appear to be successful" unless merged? end diff --git a/qa/qa/resource/api_fabricator.rb b/qa/qa/resource/api_fabricator.rb index e6057433b55..3c06f139738 100644 --- a/qa/qa/resource/api_fabricator.rb +++ b/qa/qa/resource/api_fabricator.rb @@ -31,6 +31,12 @@ module QA resource_web_url(api_post) end + def reload! + api_get + + self + end + def remove_via_api! api_delete end diff --git a/qa/qa/resource/group.rb b/qa/qa/resource/group.rb index 0824512d238..a30bb8cbc77 100644 --- a/qa/qa/resource/group.rb +++ b/qa/qa/resource/group.rb @@ -16,6 +16,7 @@ module QA attribute :id attribute :name + attribute :runners_token def initialize @path = Runtime::Namespace.name diff --git a/scripts/flaky_examples/prune-old-flaky-examples b/scripts/flaky_examples/prune-old-flaky-examples index 7700b93438b..4df49c6d8fa 100755 --- a/scripts/flaky_examples/prune-old-flaky-examples +++ b/scripts/flaky_examples/prune-old-flaky-examples @@ -7,7 +7,7 @@ require 'rubygems' # In newer Ruby, alias_method is not private then we don't need __send__ singleton_class.__send__(:alias_method, :require_dependency, :require) # rubocop:disable GitlabSecurity/PublicSend -$:.unshift(File.expand_path('../lib', __dir__)) +$:.unshift(File.expand_path('../../lib', __dir__)) require 'rspec_flaky/report'