gitlab-org--gitlab-foss/doc/administration/gitaly/monitoring.md

9.4 KiB

stage group info
Systems Gitaly To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments

Monitoring Gitaly and Gitaly Cluster

You can use the available logs and Prometheus metrics to monitor Gitaly and Gitaly Cluster (Praefect).

Metric definitions are available:

  • Directly from Prometheus /metrics endpoint configured for Gitaly.
  • Using Grafana Explore on a Grafana instance configured against Prometheus.

Monitor Gitaly rate limiting

Gitaly can be configured to limit requests based on:

  • Concurrency of requests.
  • A rate limit.

Monitor Gitaly request limiting with the gitaly_requests_dropped_total Prometheus metric. This metric provides a total count of requests dropped due to request limiting. The reason label indicates why a request was dropped:

  • rate, due to rate limiting.
  • max_size, because the concurrency queue size was reached.
  • max_time, because the request exceeded the maximum queue wait time as configured in Gitaly.

Monitor Gitaly concurrency limiting

You can observe specific behavior of concurrency-queued requests using the Gitaly logs and Prometheus:

  • In the Gitaly logs, look for the string (or structured log field) acquire_ms. Messages that have this field are reporting about the concurrency limiter.
  • In Prometheus, look for the following metrics:
    • gitaly_concurrency_limiting_in_progress indicates how many concurrent requests are being processed.
    • gitaly_concurrency_limiting_queued indicates how many requests for an RPC for a given repository are waiting due to the concurrency limit being reached.
    • gitaly_concurrency_limiting_acquiring_seconds indicates how long a request has to wait due to concurrency limits before being processed.

Monitor Gitaly cgroups

You can observe the status of control groups (cgroups) using Prometheus:

  • gitaly_cgroups_reclaim_attempts_total, a gauge for the total number of times there has been a memory relcaim attempt. This number resets each time a server is restarted.
  • gitaly_cgroups_cpu_usage, a gauge that measures CPU usage per cgroup.
  • gitaly_cgroup_procs_total, a gauge that measures the total number of processes Gitaly has spawned under the control of cgroups.

pack-objects cache

The following pack-objects cache metrics are available:

  • gitaly_pack_objects_cache_enabled, a gauge set to 1 when the cache is enabled. Available labels: dir and max_age.
  • gitaly_pack_objects_cache_lookups_total, a counter for cache lookups. Available label: result.
  • gitaly_pack_objects_generated_bytes_total, a counter for the number of bytes written into the cache.
  • gitaly_pack_objects_served_bytes_total, a counter for the number of bytes read from the cache.
  • gitaly_streamcache_filestore_disk_usage_bytes, a gauge for the total size of cache files. Available label: dir.
  • gitaly_streamcache_index_entries, a gauge for the number of entries in the cache. Available label: dir.

Some of these metrics start with gitaly_streamcache because they are generated by the streamcache internal library package in Gitaly.

Example:

gitaly_pack_objects_cache_enabled{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache",max_age="300"} 1
gitaly_pack_objects_cache_lookups_total{result="hit"} 2
gitaly_pack_objects_cache_lookups_total{result="miss"} 1
gitaly_pack_objects_generated_bytes_total 2.618649e+07
gitaly_pack_objects_served_bytes_total 7.855947e+07
gitaly_streamcache_filestore_disk_usage_bytes{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 2.6200152e+07
gitaly_streamcache_filestore_removed_total{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 1
gitaly_streamcache_index_entries{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 1

Useful queries

The following are useful queries for monitoring Gitaly:

  • Use the following Prometheus query to observe the type of connections Gitaly is serving a production environment:

    sum(rate(gitaly_connections_total[5m])) by (type)
    
  • Use the following Prometheus query to monitor the authentication behavior of your GitLab installation:

    sum(rate(gitaly_authentications_total[5m])) by (enforced, status)
    

    In a system where authentication is configured correctly and where you have live traffic, you see something like this:

    {enforced="true",status="ok"}  4424.985419441742
    

    There may also be other numbers with rate 0, but you only have to take note of the non-zero numbers.

    The only non-zero number should have enforced="true",status="ok". If you have other non-zero numbers, something is wrong in your configuration.

    The status="ok" number reflects your current request rate. In the example above, Gitaly is handling about 4000 requests per second.

  • Use the following Prometheus query to observe the Git protocol versions being used in a production environment:

    sum(rate(gitaly_git_protocol_requests_total[1m])) by (grpc_method,git_protocol,grpc_service)
    

Monitor Gitaly Cluster

To monitor Gitaly Cluster (Praefect), you can use these Prometheus metrics. There are two separate metrics endpoints from which metrics can be scraped:

  • The default /metrics endpoint.
  • /db_metrics, which contains metrics that require database queries.

Default Prometheus /metrics endpoint

The following metrics are available from the /metrics endpoint:

  • gitaly_praefect_read_distribution, a counter to track distribution of reads. It has two labels:

    • virtual_storage.
    • storage.

    They reflect configuration defined for this instance of Praefect.

  • gitaly_praefect_replication_latency_bucket, a histogram measuring the amount of time it takes for replication to complete after the replication job starts. Available in GitLab 12.10 and later.

  • gitaly_praefect_replication_delay_bucket, a histogram measuring how much time passes between when the replication job is created and when it starts. Available in GitLab 12.10 and later.

  • gitaly_praefect_node_latency_bucket, a histogram measuring the latency in Gitaly returning health check information to Praefect. This indicates Praefect connection saturation. Available in GitLab 12.10 and later.

  • gitaly_praefect_connections_total, the total number of connections to Praefect. Introduced in GitLab 14.7.

To monitor strong consistency, you can use the following Prometheus metrics:

  • gitaly_praefect_transactions_total, the number of transactions created and voted on.
  • gitaly_praefect_subtransactions_per_transaction_total, the number of times nodes cast a vote for a single transaction. This can happen multiple times if multiple references are getting updated in a single transaction.
  • gitaly_praefect_voters_per_transaction_total: the number of Gitaly nodes taking part in a transaction.
  • gitaly_praefect_transactions_delay_seconds, the server-side delay introduced by waiting for the transaction to be committed.
  • gitaly_hook_transaction_voting_delay_seconds, the client-side delay introduced by waiting for the transaction to be committed.

To monitor the number of repositories that have no healthy, up-to-date replicas:

  • gitaly_praefect_unavailable_repositories

To monitor repository verification, use the following Prometheus metrics:

  • gitaly_praefect_verification_queue_depth, the total number of replicas pending verification. This metric is scraped from the database and is only available when Prometheus is scraping the database metrics.
  • gitaly_praefect_verification_jobs_dequeued_total, the number of verification jobs picked up by the worker.
  • gitaly_praefect_verification_jobs_completed_total, the number of verification jobs completed by the worker. The result label indicates the end result of the jobs:
    • valid indicates the expected replica existed on the storage.
    • invalid indicates the replica expected to exist did not exist on the storage.
    • error indicates the job failed and has to be retried.
  • gitaly_praefect_stale_verification_leases_released_total, the number of stale verification leases released.

You can also monitor the Praefect logs.

Database metrics /db_metrics endpoint

Introduced in GitLab 14.5.

The following metrics are available from the /db_metrics endpoint:

  • gitaly_praefect_unavailable_repositories, the number of repositories that have no healthy, up to date replicas.
  • gitaly_praefect_read_only_repositories, the number of repositories in read-only mode in a virtual storage. This metric is available for backwards compatibility reasons. gitaly_praefect_unavailable_repositories is more accurate.
  • gitaly_praefect_replication_queue_depth, the number of jobs in the replication queue.