9.2 KiB
stage | group | info |
---|---|---|
Create | Gitaly | To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments |
Monitoring Gitaly and Gitaly Cluster
You can use the available logs and Prometheus metrics to monitor Gitaly and Gitaly Cluster (Praefect).
Metric definitions are available:
- Directly from Prometheus
/metrics
endpoint configured for Gitaly. - Using Grafana Explore on a Grafana instance configured against Prometheus.
Monitor Gitaly rate limiting
Gitaly can be configured to limit requests based on:
- Concurrency of requests.
- A rate limit.
Monitor Gitaly request limiting with the gitaly_requests_dropped_total
Prometheus metric. This metric provides a total count
of requests dropped due to request limiting. The reason
label indicates why a request was dropped:
rate
, due to rate limiting.max_size
, because the concurrency queue size was reached.max_time
, because the request exceeded the maximum queue wait time as configured in Gitaly.
Monitor Gitaly concurrency limiting
You can observe specific behavior of concurrency-queued requests using the Gitaly logs and Prometheus:
- In the Gitaly logs, look for the string (or structured log field)
acquire_ms
. Messages that have this field are reporting about the concurrency limiter. - In Prometheus, look for the following metrics:
gitaly_concurrency_limiting_in_progress
indicates how many concurrent requests are being processed.gitaly_concurrency_limiting_queued
indicates how many requests for an RPC for a given repository are waiting due to the concurrency limit being reached.gitaly_concurrency_limiting_acquiring_seconds
indicates how long a request has to wait due to concurrency limits before being processed.
Monitor Gitaly cgroups
You can observe the status of control groups (cgroups) using Prometheus:
gitaly_cgroups_memory_failed_total
, a gauge for the total number of times the memory limit has been hit. This number resets each time a server is restarted.gitaly_cgroups_cpu_usage
, a gauge that measures CPU usage per cgroup.gitaly_cgroup_procs_total
, a gauge that measures the total number of processes Gitaly has spawned under the control of cgroups.
pack-objects
cache
The following pack-objects
cache metrics are available:
gitaly_pack_objects_cache_enabled
, a gauge set to1
when the cache is enabled. Available labels:dir
andmax_age
.gitaly_pack_objects_cache_lookups_total
, a counter for cache lookups. Available label:result
.gitaly_pack_objects_generated_bytes_total
, a counter for the number of bytes written into the cache.gitaly_pack_objects_served_bytes_total
, a counter for the number of bytes read from the cache.gitaly_streamcache_filestore_disk_usage_bytes
, a gauge for the total size of cache files. Available label:dir
.gitaly_streamcache_index_entries
, a gauge for the number of entries in the cache. Available label:dir
.
Some of these metrics start with gitaly_streamcache
because they are generated by the
streamcache
internal library package in Gitaly.
Example:
gitaly_pack_objects_cache_enabled{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache",max_age="300"} 1
gitaly_pack_objects_cache_lookups_total{result="hit"} 2
gitaly_pack_objects_cache_lookups_total{result="miss"} 1
gitaly_pack_objects_generated_bytes_total 2.618649e+07
gitaly_pack_objects_served_bytes_total 7.855947e+07
gitaly_streamcache_filestore_disk_usage_bytes{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 2.6200152e+07
gitaly_streamcache_filestore_removed_total{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 1
gitaly_streamcache_index_entries{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 1
Useful queries
The following are useful queries for monitoring Gitaly:
-
Use the following Prometheus query to observe the type of connections Gitaly is serving a production environment:
sum(rate(gitaly_connections_total[5m])) by (type)
-
Use the following Prometheus query to monitor the authentication behavior of your GitLab installation:
sum(rate(gitaly_authentications_total[5m])) by (enforced, status)
In a system where authentication is configured correctly and where you have live traffic, you see something like this:
{enforced="true",status="ok"} 4424.985419441742
There may also be other numbers with rate 0, but you only have to take note of the non-zero numbers.
The only non-zero number should have
enforced="true",status="ok"
. If you have other non-zero numbers, something is wrong in your configuration.The
status="ok"
number reflects your current request rate. In the example above, Gitaly is handling about 4000 requests per second. -
Use the following Prometheus query to observe the Git protocol versions being used in a production environment:
sum(rate(gitaly_git_protocol_requests_total[1m])) by (grpc_method,git_protocol,grpc_service)
Monitor Gitaly Cluster
To monitor Gitaly Cluster (Praefect), you can use these Prometheus metrics. There are two separate metrics endpoints from which metrics can be scraped:
- The default
/metrics
endpoint. /db_metrics
, which contains metrics that require database queries.
Default Prometheus /metrics
endpoint
The following metrics are available from the /metrics
endpoint:
-
gitaly_praefect_read_distribution
, a counter to track distribution of reads. It has two labels:virtual_storage
.storage
.
They reflect configuration defined for this instance of Praefect.
-
gitaly_praefect_replication_latency_bucket
, a histogram measuring the amount of time it takes for replication to complete after the replication job starts. Available in GitLab 12.10 and later. -
gitaly_praefect_replication_delay_bucket
, a histogram measuring how much time passes between when the replication job is created and when it starts. Available in GitLab 12.10 and later. -
gitaly_praefect_node_latency_bucket
, a histogram measuring the latency in Gitaly returning health check information to Praefect. This indicates Praefect connection saturation. Available in GitLab 12.10 and later.
To monitor strong consistency, you can use the following Prometheus metrics:
gitaly_praefect_transactions_total
, the number of transactions created and voted on.gitaly_praefect_subtransactions_per_transaction_total
, the number of times nodes cast a vote for a single transaction. This can happen multiple times if multiple references are getting updated in a single transaction.gitaly_praefect_voters_per_transaction_total
: the number of Gitaly nodes taking part in a transaction.gitaly_praefect_transactions_delay_seconds
, the server-side delay introduced by waiting for the transaction to be committed.gitaly_hook_transaction_voting_delay_seconds
, the client-side delay introduced by waiting for the transaction to be committed.
To monitor the number of repositories that have no healthy, up-to-date replicas:
gitaly_praefect_unavailable_repositories
To monitor repository verification, use the following Prometheus metrics:
gitaly_praefect_verification_queue_depth
, the total number of replicas pending verification. This metric is scraped from the database and is only available when Prometheus is scraping the database metrics.gitaly_praefect_verification_jobs_dequeued_total
, the number of verification jobs picked up by the worker.gitaly_praefect_verification_jobs_completed_total
, the number of verification jobs completed by the worker. Theresult
label indicates the end result of the jobs:valid
indicates the expected replica existed on the storage.invalid
indicates the replica expected to exist did not exist on the storage.error
indicates the job failed and has to be retried.
gitaly_praefect_stale_verification_leases_released_total
, the number of stale verification leases released.
You can also monitor the Praefect logs.
Database metrics /db_metrics
endpoint
Introduced in GitLab 14.5.
The following metrics are available from the /db_metrics
endpoint:
gitaly_praefect_unavailable_repositories
, the number of repositories that have no healthy, up to date replicas.gitaly_praefect_read_only_repositories
, the number of repositories in read-only mode in a virtual storage. This metric is available for backwards compatibility reasons.gitaly_praefect_unavailable_repositories
is more accurate.gitaly_praefect_replication_queue_depth
, the number of jobs in the replication queue.