331 lines
15 KiB
Markdown
331 lines
15 KiB
Markdown
---
|
|
stage: Manage
|
|
group: Optimize
|
|
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
|
|
---
|
|
|
|
# Aggregated Value Stream Analytics
|
|
|
|
> - [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/335391) in GitLab 14.7.
|
|
|
|
DISCLAIMER:
|
|
This page contains information related to upcoming products, features, and functionality.
|
|
It is important to note that the information presented is for informational purposes only.
|
|
Please do not rely on this information for purchasing or planning purposes.
|
|
As with all projects, the items mentioned on this page are subject to change or delay.
|
|
The development, release, and timing of any products, features, or functionality remain at the
|
|
sole discretion of GitLab Inc.
|
|
|
|
This page provides a high-level overview of the aggregated backend for
|
|
Value Stream Analytics (VSA).
|
|
|
|
## Current Status
|
|
|
|
As of 14.8 the aggregated VSA backend is used only in the `gitlab-org` group, for testing purposes
|
|
. We plan to gradually roll it out in the next major release (15.0) for the rest of the groups.
|
|
|
|
## Motivation
|
|
|
|
The aggregated backend aims to solve the performance limitations of the VSA feature and set it up
|
|
for long-term growth.
|
|
|
|
Our main database is not prepared for analytical workloads. Executing long-running queries can
|
|
affect the reliability of the application. For large groups, the current
|
|
implementation (old backend) is slow and, in some cases, doesn't even load due to the configured
|
|
statement timeout (15s).
|
|
|
|
The database queries in the old backend use the core domain models directly through
|
|
`IssuableFinders` classes: ([MergeRequestsFinder](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/finders/merge_requests_finder.rb) and [IssuesFinder](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/finders/issues_finder.rb)).
|
|
With the requested change of the [date range filters](https://gitlab.com/groups/gitlab-org/-/epics/6046),
|
|
this approach was no longer viable from the performance point of view.
|
|
|
|
Benefits of the aggregated VSA backend:
|
|
|
|
- Simpler database queries (fewer JOINs).
|
|
- Faster aggregations, only a single table is accessed.
|
|
- Possibility to introduce further aggregations for improving the first page load time.
|
|
- Better performance for large groups (with many sub-groups, projects, issues and, merge requests).
|
|
- Ready for database decomposition. The VSA related database tables could live in a separate
|
|
database with a minimal development effort.
|
|
- Ready for keyset pagination which can be useful for exporting the data.
|
|
- Possibility to implement more complex event definitions.
|
|
- For example, the start event can be two timestamp columns where the earliest value would be
|
|
used by the system.
|
|
- Example: `MIN(issues.created_at, issues.updated_at)`
|
|
|
|
## How does Value Stream Analytics work?
|
|
|
|
Value Stream Analytics calculates the duration between two timestamp columns or timestamp
|
|
expressions and runs various aggregations on the data.
|
|
|
|
Examples:
|
|
|
|
- Duration between the Merge Request creation time and Merge Request merge time.
|
|
- Duration between the Issue creation time and Issue close time.
|
|
|
|
This duration is exposed in various ways:
|
|
|
|
- Aggregation: median, average
|
|
- Listing: list the duration for individual Merge Request and Issue records
|
|
|
|
Apart from the durations, we expose the record count within a stage.
|
|
|
|
### Stages
|
|
|
|
A stage represents an event pair (start and end events) with additional metadata, such as the name
|
|
of the stage. Stages are configurable by the user within the pairing rules defined in the backend.
|
|
|
|
**Example stage: Code Review**
|
|
|
|
- Start event identifier: Merge Request creation time
|
|
- Start event column: uses the `merge_requests.created_at` timestamp column.
|
|
- End event identifier: Merge Request merge time
|
|
- End event column: uses the `merge_request_metrics.merged_at` timestamp column.
|
|
- Stage event hash ID: a calculated hash for the pair of start and end event identifiers.
|
|
- If two stages have the same configuration of start and end events, then their stage event hash
|
|
IDs are identical.
|
|
- The stage event hash ID is later used to store the aggregated data in partitioned database tables.
|
|
|
|
### Value streams
|
|
|
|
Value streams are container objects for the stages. There can be multiple value streams per group
|
|
or project focusing on different aspects of the Dev Ops lifecycle.
|
|
|
|
### Example configuration
|
|
|
|
![vsa object hierarchy example](img/object_hierarchy_example_V14_10.png)
|
|
|
|
In this example, there are two independent value streams set up for two teams that are using
|
|
different development workflows within the `Test Group` (top-level namespace).
|
|
|
|
The first value stream uses standard timestamp-based events for defining the stages. The second
|
|
value stream uses label events.
|
|
|
|
Each value stream and stage item from the example will be persisted in the database. Notice that
|
|
the `Deployment` stage is identical for both value streams; that means that the underlying
|
|
`stage_event_hash_id` is the same for both stages. The `stage_event_hash_id` reduces
|
|
the amount of data the backend collects and plays a vital role in database partitioning.
|
|
|
|
We expect value streams and stages to be rarely changed. When stages (start and end events) are
|
|
changed, the aggregated data gets stale. This is fixed by the periodical aggregation occurring
|
|
every day.
|
|
|
|
### Feature availability
|
|
|
|
The aggregated VSA feature is available on the group and project level however, the aggregated
|
|
backend is only available for Premium and Ultimate customers due to data storage and data
|
|
computation costs. Storing de-normalized, aggregated data requires significant disk space.
|
|
|
|
## Aggregated value stream analytics architecture
|
|
|
|
The main idea behind the aggregated VSA backend is separation: VSA database tables and queries do
|
|
not use the core domain models directly (Issue, MergeRequest). This allows us to scale and
|
|
optimize VSA independently from the other parts of the application.
|
|
|
|
The architecture consists of two main mechanisms:
|
|
|
|
- Periodical data collection and loading (happens in the background).
|
|
- Querying the collected data (invoked by the user).
|
|
|
|
### Data loading
|
|
|
|
The aggregated nature of VSA comes from the periodical data loading. The system queries the core
|
|
domain models to collect the stage and timestamp data. This data is periodically inserted into the
|
|
VSA database tables.
|
|
|
|
High-level overview for each top-level namespace with Premium or Ultimate license:
|
|
|
|
1. Load all stages in the group.
|
|
1. Iterate over the issues and merge requests records.
|
|
1. Based on the stage configurations (start and end event identifiers) collect the timestamp data.
|
|
1. `INSERT` or `UPDATE` the data into the VSA database tables.
|
|
|
|
The data loading is implemented within the [`Analytics::CycleAnalytics::DataLoaderService`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/analytics/cycle_analytics/data_loader_service.rb)
|
|
class. There are groups containing a lot of data, so to avoid overloading the primary database,
|
|
the service performs operations in batches and enforces strict application limits:
|
|
|
|
- Load records in batches.
|
|
- Insert records in batches.
|
|
- Stop processing when a limit is reached, schedule a background job to continue the processing
|
|
later.
|
|
- Continue processing data from a specific point.
|
|
|
|
As of GitLab 14.7, the data loading is done manually. Once the feature is ready, the service will
|
|
be invoked periodically by the system via a cron job (this part is not implemented yet).
|
|
|
|
#### Record iteration
|
|
|
|
The batched iteration is implemented with the
|
|
[efficient IN operator](../database/efficient_in_operator_queries.md). The background job scans
|
|
all issues and merge request records in the group hierarchy ordered by the `updated_at` and the
|
|
`id` columns. For already aggregated groups, the `DataLoaderService` continues the aggregation
|
|
from a specific point which saves time.
|
|
|
|
Collecting the timestamp data happens on every iteration. The `DataLoaderService` determines which
|
|
stage events are configured within the group hierarchy and builds a query that selects the
|
|
required timestamps. The stage record knows which events are configured and the events know how to
|
|
select the timestamp columns.
|
|
|
|
Example for collected stage events: merge request merged, merge request created, merge request
|
|
closed
|
|
|
|
Generated SQL query for loading the timestamps:
|
|
|
|
```sql
|
|
SELECT
|
|
-- the list of columns depends on the configured stages
|
|
"merge_request_metrics"."merged_at",
|
|
"merge_requests"."created_at",
|
|
"merge_request_metrics"."latest_closed_at"
|
|
FROM "merge_requests"
|
|
LEFT OUTER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
|
|
WHERE "merge_requests"."id" IN (1, 2, 3, 4) -- ids are coming from the batching query
|
|
```
|
|
|
|
The `merged_at` column is located in a separate table (`merge_request_metrics`). The
|
|
`Gitlab::Analytics::CycleAnalytics::StagEvents::MergeRequestMerged` class adds itself to a scope
|
|
for loading the timestamp data without affecting the number of rows (uses `LEFT JOIN`). This
|
|
behavior is implemented for each `StageEvent` class with the `include_in` method.
|
|
|
|
The data collection query works on the event level. It extracts the event timestamps from the
|
|
stages and ensures that we don't collect the same data multiple times. The events mentioned above
|
|
could come from the following stage configuration:
|
|
|
|
- merge request created - merge request merged
|
|
- merge request created - merge request closed
|
|
|
|
Other combinations might be also possible, but we prevent the ones that make no sense, for example:
|
|
|
|
- merge request merged - merge request created
|
|
|
|
Creation time always happens first, so this stage always reports negative duration.
|
|
|
|
#### Data scope
|
|
|
|
The data collection scans and processes all issues and merge requests records in the group
|
|
hierarchy, starting from the top-level group. This means that if a group only has one value stream
|
|
in a sub-group, we nevertheless collect data of all issues and merge requests in the hierarchy of
|
|
this group. This aims to simplify the data collection mechanism. Moreover, data research shows
|
|
that most group hierarchies have their stages configured on the top level.
|
|
|
|
During the data collection process, the collected timestamp data is transformed into rows. For
|
|
each configured stage, if the start event timestamp is present, the system inserts or updates one
|
|
event record. This allows us to determine the upper limit of the inserted rows per group by
|
|
counting all issues and merge requests and multiplying the sum by the stage count.
|
|
|
|
#### Data consistency concerns
|
|
|
|
Due to the async nature of the data collection, data consistency issues are bound to happen. This
|
|
is a trade-off that makes the query performance significantly faster. We think that for analytical
|
|
workload a slight lag in the data is acceptable.
|
|
|
|
Before the rollout we plan to implement some indicators on the VSA page that shows the most
|
|
recent backend activities. For example, indicators that show the last data collection timestamp
|
|
and the last consistency check timestamp.
|
|
|
|
#### Database structure
|
|
|
|
VSA collects data for the following domain models: `Issue` and `MergeRequest`. To keep the
|
|
aggregated data separated, we use two additional database tables:
|
|
|
|
- `analytics_cycle_analytics_issue_stage_events`
|
|
- `analytics_cycle_analytics_merge_request_stage_events`
|
|
|
|
Both tables are hash partitioned by the `stage_event_hash_id`. Each table uses 32 partitions. It's
|
|
an arbitrary number and it could be changed. Important is to keep the partitions under 100GB in
|
|
size (which gives the feature a lot of headroom).
|
|
|
|
|Column|Description|
|
|
|-|-|
|
|
|`stage_event_hash_id`|partitioning key|
|
|
|`merge_request_id` or `issue_id`|reference to the domain record (Issuable)|
|
|
|`group_id`|reference to the group (de-normalization)|
|
|
|`project_id`|reference to the project|
|
|
|`milestone_id`|duplicated data from the domain record table|
|
|
|`author_id`|duplicated data from the domain record table|
|
|
|`state_id`|duplicated data from the domain record table|
|
|
|`start_event_timestamp`|timestamp derived from the stage configuration|
|
|
|`end_event_timestamp`|timestamp derived from the stage configuration|
|
|
|
|
With accordance to the data separation requirements, the table doesn't have any foreign keys. The
|
|
consistency is ensured by a background job (eventually consistent).
|
|
|
|
### Data querying
|
|
|
|
The base query always includes the following filters:
|
|
|
|
- `stage_event_hash_id` - partition key
|
|
- `project_id` or `group_id` - depending if it's a project or group level query
|
|
- `end_event_timestamp` - date range filter (last 30 days)
|
|
|
|
Example: Selecting review stage duration for the GitLab project
|
|
|
|
```sql
|
|
SELECT end_event_timestamp - start_event_timestamp
|
|
FROM analytics_cycle_analytics_merge_request_stage_events
|
|
WHERE
|
|
stage_event_hash_id = 16 AND -- hits a specific partition
|
|
project_id = 278964 AND
|
|
end_event_timestamp > '2022-01-01' AND end_event_timestamp < '2022-01-30'
|
|
```
|
|
|
|
#### Query generation
|
|
|
|
The query backend is hidden behind the same interface that the old backend implementation uses.
|
|
Thanks to this, we can easily switch between the old and new query backends.
|
|
|
|
- `DataCollector`: entrypoint for querying VSA data
|
|
- `BaseQueryBuilder`: provides the base `ActiveRecord` scope (filters are applied here).
|
|
- `average`: average aggregation.
|
|
- `median`: median aggregation.
|
|
- `count`: row counting.
|
|
- `records`: list of issue or merge request records.
|
|
|
|
#### Filters
|
|
|
|
VSA supports various filters on the base query. Most of the filters require no additional JOINs:
|
|
|
|
|Filter name|Description|
|
|
|-|-|
|
|
|`milestone_title`|The backend translates it to `milestone_id` filter|
|
|
|`author_username`|The backend translates it to `author_id` filter|
|
|
|`project_ids`|Only used on the group-level|
|
|
|
|
Exceptions: these filters are applied on other tables which means we `JOIN` them.
|
|
|
|
|Filter name|Description|
|
|
|-|-|
|
|
|`label_name`|Array filter, using the `label_links` table|
|
|
|`assignee_username`|Array filter, using the `*_assignees` table|
|
|
|
|
To fully decompose the database, the required ID values would need to be replicated in the VSA
|
|
database tables. This change could be implemented using array columns.
|
|
|
|
### Endpoints
|
|
|
|
The feature uses private JSON APIs for delivering the data to the frontend. On the first page load
|
|
, the following requests are invoked:
|
|
|
|
- Initial HTML page load which is mostly empty. Some configuration data is exposed via `data`
|
|
attributes.
|
|
- `value_streams` - Load the available value streams for the given group.
|
|
- `stages` - Load the stages for the currently selected value stream.
|
|
- `median` - For each stage, request the median duration.
|
|
- `count` - For each stage, request the number of items in the stage (this is a
|
|
[limit count](../merge_request_performance_guidelines.md#badge-counters), maximum 1000 rows).
|
|
- `average_duration_chart` - Data for the duration chart.
|
|
- `summary`, `time_summary` - Top-level aggregations, most of the metrics are using different APIs/
|
|
finders and not invoking the aggregated backend.
|
|
|
|
When clicking on a specific stage, the `records` endpoint is invoked, which returns the related
|
|
records (paginated) for the chosen stage in a specific order.
|
|
|
|
### Database decomposition
|
|
|
|
By separating the query logic from the main application code, the feature is ready for database
|
|
decomposition. If we decide that VSA requires a separate database instance, then moving the
|
|
aggregated tables can be accomplished with little effort.
|
|
|
|
A different database technology could also be used to further improve the performance of the
|
|
feature, for example [Timescale DB](https://www.timescale.com).
|