Merge branch '52767-more-chaos-for-gitlab' into 'master'
Add more chaos to GitLab Closes #53362 and #52767 See merge request gitlab-org/gitlab-ce!22746
This commit is contained in:
commit
cd6450923f
5 changed files with 187 additions and 1 deletions
56
app/controllers/chaos_controller.rb
Normal file
56
app/controllers/chaos_controller.rb
Normal file
|
@ -0,0 +1,56 @@
|
||||||
|
# frozen_string_literal: true
|
||||||
|
|
||||||
|
class ChaosController < ActionController::Base
|
||||||
|
before_action :validate_request
|
||||||
|
|
||||||
|
def leakmem
|
||||||
|
memory_mb = (params[:memory_mb]&.to_i || 100)
|
||||||
|
duration_s = (params[:duration_s]&.to_i || 30).seconds
|
||||||
|
|
||||||
|
start = Time.now
|
||||||
|
retainer = []
|
||||||
|
# Add `n` 1mb chunks of memory to the retainer array
|
||||||
|
memory_mb.times { retainer << "x" * 1.megabyte }
|
||||||
|
|
||||||
|
duration_taken = (Time.now - start).seconds
|
||||||
|
Kernel.sleep duration_s - duration_taken if duration_s > duration_taken
|
||||||
|
|
||||||
|
render text: "OK", content_type: 'text/plain'
|
||||||
|
end
|
||||||
|
|
||||||
|
def cpuspin
|
||||||
|
duration_s = (params[:duration_s]&.to_i || 30).seconds
|
||||||
|
end_time = Time.now + duration_s.seconds
|
||||||
|
|
||||||
|
rand while Time.now < end_time
|
||||||
|
|
||||||
|
render text: "OK", content_type: 'text/plain'
|
||||||
|
end
|
||||||
|
|
||||||
|
def sleep
|
||||||
|
duration_s = (params[:duration_s]&.to_i || 30).seconds
|
||||||
|
Kernel.sleep duration_s
|
||||||
|
|
||||||
|
render text: "OK", content_type: 'text/plain'
|
||||||
|
end
|
||||||
|
|
||||||
|
def kill
|
||||||
|
Process.kill("KILL", Process.pid)
|
||||||
|
end
|
||||||
|
|
||||||
|
private
|
||||||
|
|
||||||
|
def validate_request
|
||||||
|
secret = ENV['GITLAB_CHAOS_SECRET']
|
||||||
|
# GITLAB_CHAOS_SECRET is required unless you're running in Development mode
|
||||||
|
if !secret && !Rails.env.development?
|
||||||
|
render text: "chaos misconfigured: please configure GITLAB_CHAOS_SECRET when using GITLAB_ENABLE_CHAOS_ENDPOINTS outside of a development environment", content_type: 'text/plain', status: 500
|
||||||
|
end
|
||||||
|
|
||||||
|
return unless secret
|
||||||
|
|
||||||
|
unless request.headers["HTTP_X_CHAOS_SECRET"] == secret
|
||||||
|
render text: "To experience chaos, please set X-Chaos-Secret header", content_type: 'text/plain', status: 401
|
||||||
|
end
|
||||||
|
end
|
||||||
|
end
|
5
changelogs/unreleased/52767-more-chaos-for-gitlab.yml
Normal file
5
changelogs/unreleased/52767-more-chaos-for-gitlab.yml
Normal file
|
@ -0,0 +1,5 @@
|
||||||
|
---
|
||||||
|
title: Add endpoints for simulating certain failure modes in the application
|
||||||
|
merge_request: 22746
|
||||||
|
author:
|
||||||
|
type: other
|
|
@ -82,6 +82,13 @@ Rails.application.routes.draw do
|
||||||
|
|
||||||
draw :operations
|
draw :operations
|
||||||
draw :instance_statistics
|
draw :instance_statistics
|
||||||
|
|
||||||
|
if ENV['GITLAB_ENABLE_CHAOS_ENDPOINTS']
|
||||||
|
get '/chaos/leakmem' => 'chaos#leakmem'
|
||||||
|
get '/chaos/cpuspin' => 'chaos#cpuspin'
|
||||||
|
get '/chaos/sleep' => 'chaos#sleep'
|
||||||
|
get '/chaos/kill' => 'chaos#kill'
|
||||||
|
end
|
||||||
end
|
end
|
||||||
|
|
||||||
concern :clusterable do
|
concern :clusterable do
|
||||||
|
|
117
doc/development/chaos_endpoints.md
Normal file
117
doc/development/chaos_endpoints.md
Normal file
|
@ -0,0 +1,117 @@
|
||||||
|
# Generating chaos in a test GitLab instance
|
||||||
|
|
||||||
|
As [Werner Vogels](https://twitter.com/Werner), the CTO at Amazon Web Services, famously put it, **Everything fails, all the time**.
|
||||||
|
|
||||||
|
As a developer, it's as important to consider the failure modes in which your software will operate as much as normal operation. Doing so can mean the difference between a minor hiccup leading to a scattering of `500` errors experienced by a tiny fraction of users and a full site outage that affects all users for an extended period.
|
||||||
|
|
||||||
|
To paraphrase [Tolstoy](https://en.wikipedia.org/wiki/Anna_Karenina_principle), _all happy servers are alike, but all failing servers are failing in their own way_. Luckily, there are ways we can attempt to simulate these failure modes, and the chaos endpoints are tools for assisting in this process.
|
||||||
|
|
||||||
|
Currently, there are four endpoints for simulating the following conditions:
|
||||||
|
|
||||||
|
- Slow requests.
|
||||||
|
- CPU-bound requests.
|
||||||
|
- Memory leaks.
|
||||||
|
- Unexpected process crashes.
|
||||||
|
|
||||||
|
## Enabling chaos endpoints
|
||||||
|
|
||||||
|
For obvious reasons, these endpoints are not enabled by default. They can be enabled by setting the `GITLAB_ENABLE_CHAOS_ENDPOINTS` environment variable to `1`.
|
||||||
|
|
||||||
|
For example, if you're using the [GDK](https://gitlab.com/gitlab-org/gitlab-development-kit) this can be done with the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
GITLAB_ENABLE_CHAOS_ENDPOINTS=1 gdk run
|
||||||
|
```
|
||||||
|
|
||||||
|
## Securing the chaos endpoints
|
||||||
|
|
||||||
|
DANGER: **Danger:**
|
||||||
|
It is highly recommended that you secure access to the chaos endpoints using a secret token. This is recommended when enabling these endpoints locally and essential when running in a staging or other shared environment. You should not enable them in production unless you absolutely know what you're doing.
|
||||||
|
|
||||||
|
A secret token can be set through the `GITLAB_CHAOS_SECRET` environment variable. For example, when using the [GDK](https://gitlab.com/gitlab-org/gitlab-development-kit) this can be done with the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
GITLAB_ENABLE_CHAOS_ENDPOINTS=1 GITLAB_CHAOS_SECRET=secret gdk run
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace `secret` with your own secret token.
|
||||||
|
|
||||||
|
## Invoking chaos
|
||||||
|
|
||||||
|
Once you have enabled the chaos endpoints and restarted the application, you can start testing using the endpoints.
|
||||||
|
|
||||||
|
## Memory leaks
|
||||||
|
|
||||||
|
To simulate a memory leak in your application, use the `/-/chaos/leakmem` endpoint.
|
||||||
|
|
||||||
|
NOTE: **Note:**
|
||||||
|
The memory is not retained after the request finishes. Once the request has completed, the Ruby garbage collector will attempt to recover the memory.
|
||||||
|
|
||||||
|
```
|
||||||
|
GET /-/chaos/leakmem
|
||||||
|
GET /-/chaos/leakmem?memory_mb=1024
|
||||||
|
GET /-/chaos/leakmem?memory_mb=1024&duration_s=50
|
||||||
|
```
|
||||||
|
|
||||||
|
| Attribute | Type | Required | Description |
|
||||||
|
| ------------ | ------- | -------- | ---------------------------------------------------------------------------------- |
|
||||||
|
| `memory_mb` | integer | no | How much memory, in MB, should be leaked. Defaults to 100MB. |
|
||||||
|
| `duration_s` | integer | no | Minimum duration, in seconds, that the memory should be retained. Defaults to 30s. |
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl http://localhost:3000/-/chaos/leakmem?memory_mb=1024&duration_s=10 --header 'X-Chaos-Secret: secret'
|
||||||
|
```
|
||||||
|
|
||||||
|
## CPU spin
|
||||||
|
|
||||||
|
This endpoint attempts to fully utilise a single core, at 100%, for the given period.
|
||||||
|
|
||||||
|
Depending on your rack server setup, your request may timeout after a predermined period (normally 60 seconds).
|
||||||
|
If you're using Unicorn, this is done by killing the worker process.
|
||||||
|
|
||||||
|
```
|
||||||
|
GET /-/chaos/cpuspin
|
||||||
|
GET /-/chaos/cpuspin?duration_s=50
|
||||||
|
```
|
||||||
|
|
||||||
|
| Attribute | Type | Required | Description |
|
||||||
|
| ------------ | ------- | -------- | --------------------------------------------------------------------- |
|
||||||
|
| `duration_s` | integer | no | Duration, in seconds, that the core will be utilised. Defaults to 30s |
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl http://localhost:3000/-/chaos/cpuspin?duration_s=60 --header 'X-Chaos-Secret: secret'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Sleep
|
||||||
|
|
||||||
|
This endpoint is similar to the CPU Spin endpoint but simulates off-processor activity, such as network calls to backend services. It will sleep for a given duration.
|
||||||
|
|
||||||
|
As with the CPU Spin endpoint, this may lead to your request timing out if duration exceeds the configured limit.
|
||||||
|
|
||||||
|
```
|
||||||
|
GET /-/chaos/sleep
|
||||||
|
GET /-/chaos/sleep?duration_s=50
|
||||||
|
```
|
||||||
|
|
||||||
|
| Attribute | Type | Required | Description |
|
||||||
|
| ------------ | ------- | -------- | ---------------------------------------------------------------------- |
|
||||||
|
| `duration_s` | integer | no | Duration, in seconds, that the request will sleep for. Defaults to 30s |
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl http://localhost:3000/-/chaos/sleep?duration_s=60 --header 'X-Chaos-Secret: secret'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Kill
|
||||||
|
|
||||||
|
This endpoint will simulate the unexpected death of a worker process using a `kill` signal.
|
||||||
|
|
||||||
|
NOTE: **Note:**
|
||||||
|
Since this endpoint uses the `KILL` signal, the worker is not given a chance to cleanup or shutdown.
|
||||||
|
|
||||||
|
```
|
||||||
|
GET /-/chaos/kill
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl http://localhost:3000/-/chaos/kill --header 'X-Chaos-Secret: secret'
|
||||||
|
```
|
|
@ -34,13 +34,14 @@ graphs/dashboards.
|
||||||
|
|
||||||
## Tooling
|
## Tooling
|
||||||
|
|
||||||
GitLab provides built-in tools to aid the process of improving performance:
|
GitLab provides built-in tools to help improve performance and availability:
|
||||||
|
|
||||||
* [Profiling](profiling.md)
|
* [Profiling](profiling.md)
|
||||||
* [Sherlock](profiling.md#sherlock)
|
* [Sherlock](profiling.md#sherlock)
|
||||||
* [GitLab Performance Monitoring](../administration/monitoring/performance/index.md)
|
* [GitLab Performance Monitoring](../administration/monitoring/performance/index.md)
|
||||||
* [Request Profiling](../administration/monitoring/performance/request_profiling.md)
|
* [Request Profiling](../administration/monitoring/performance/request_profiling.md)
|
||||||
* [QueryRecoder](query_recorder.md) for preventing `N+1` regressions
|
* [QueryRecoder](query_recorder.md) for preventing `N+1` regressions
|
||||||
|
* [Chaos endpoints](chaos_endpoints.md) for testing failure scenarios. Intended mainly for testing availability.
|
||||||
|
|
||||||
GitLab employees can use GitLab.com's performance monitoring systems located at
|
GitLab employees can use GitLab.com's performance monitoring systems located at
|
||||||
<https://dashboards.gitlab.net>, this requires you to log in using your
|
<https://dashboards.gitlab.net>, this requires you to log in using your
|
||||||
|
|
Loading…
Reference in a new issue