Merge branch 'docs-label-filtering' into 'master'
Add development documentation on label filtering Closes #49651 See merge request gitlab-org/gitlab-ce!30741
This commit is contained in:
commit
0d2b5bff8c
|
@ -113,6 +113,10 @@ description: 'Learn how to contribute to GitLab.'
|
||||||
- [Database helper modules](database_helpers.md)
|
- [Database helper modules](database_helpers.md)
|
||||||
- [Code comments](code_comments.md)
|
- [Code comments](code_comments.md)
|
||||||
|
|
||||||
|
## Case studies
|
||||||
|
|
||||||
|
- [Database case study: Filtering by label](filtering_by_label.md)
|
||||||
|
|
||||||
## Integration guides
|
## Integration guides
|
||||||
|
|
||||||
- [Jira Connect app](integrations/jira_connect.md)
|
- [Jira Connect app](integrations/jira_connect.md)
|
||||||
|
|
|
@ -0,0 +1,166 @@
|
||||||
|
# Filtering by label
|
||||||
|
|
||||||
|
## Introduction
|
||||||
|
|
||||||
|
GitLab has [labels](../user/project/labels.md) that can be assigned to issues,
|
||||||
|
merge requests, and epics. Labels on those objects are a many-to-many relation
|
||||||
|
through the polymorphic `label_links` table.
|
||||||
|
|
||||||
|
To filter these objects by multiple labels - for instance, 'all open
|
||||||
|
issues with the label ~Plan and the label ~backend' - we generate a
|
||||||
|
query containing a `GROUP BY` clause. In a simple form, this looks like:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
issues.*
|
||||||
|
FROM
|
||||||
|
issues
|
||||||
|
INNER JOIN label_links ON label_links.target_id = issues.id
|
||||||
|
AND label_links.target_type = 'Issue'
|
||||||
|
INNER JOIN labels ON labels.id = label_links.label_id
|
||||||
|
WHERE
|
||||||
|
issues.project_id = 13083
|
||||||
|
AND (issues.state IN ('opened'))
|
||||||
|
AND labels.title IN ('Plan',
|
||||||
|
'backend')
|
||||||
|
GROUP BY
|
||||||
|
issues.id
|
||||||
|
HAVING (COUNT(DISTINCT labels.title) = 2)
|
||||||
|
ORDER BY
|
||||||
|
issues.updated_at DESC,
|
||||||
|
issues.id DESC
|
||||||
|
LIMIT 20 OFFSET 0
|
||||||
|
```
|
||||||
|
|
||||||
|
In particular, note that:
|
||||||
|
|
||||||
|
1. We `GROUP BY issues.id` so that we can ...
|
||||||
|
2. Use the `HAVING (COUNT(DISTINCT labels.title) = 2)` condition to ensure that
|
||||||
|
all matched issues have both labels.
|
||||||
|
|
||||||
|
This is more complicated than is ideal. It makes the query construction more
|
||||||
|
prone to errors (such as
|
||||||
|
[gitlab-org/gitlab-ce#15557](https://gitlab.com/gitlab-org/gitlab-ce/issues/15557)).
|
||||||
|
|
||||||
|
## Attempt A: WHERE EXISTS
|
||||||
|
|
||||||
|
### Attempt A1: use multiple subqueries with WHERE EXISTS
|
||||||
|
|
||||||
|
In
|
||||||
|
[gitlab-org/gitlab-ce#37137](https://gitlab.com/gitlab-org/gitlab-ce/issues/37137)
|
||||||
|
and its associated merge request
|
||||||
|
[gitlab-org/gitlab-ce!14022](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14022),
|
||||||
|
we tried to replace the `GROUP BY` with multiple uses of `WHERE EXISTS`. For the
|
||||||
|
example above, this would give:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
WHERE (EXISTS (
|
||||||
|
SELECT
|
||||||
|
TRUE
|
||||||
|
FROM
|
||||||
|
label_links
|
||||||
|
INNER JOIN labels ON labels.id = label_links.label_id
|
||||||
|
WHERE
|
||||||
|
labels.title = 'Plan'
|
||||||
|
AND target_type = 'Issue'
|
||||||
|
AND target_id = issues.id))
|
||||||
|
AND (EXISTS (
|
||||||
|
SELECT
|
||||||
|
TRUE
|
||||||
|
FROM
|
||||||
|
label_links
|
||||||
|
INNER JOIN labels ON labels.id = label_links.label_id
|
||||||
|
WHERE
|
||||||
|
labels.title = 'backend'
|
||||||
|
AND target_type = 'Issue'
|
||||||
|
AND target_id = issues.id))
|
||||||
|
```
|
||||||
|
|
||||||
|
While this worked without schema changes, and did improve readability somewhat,
|
||||||
|
it did not improve query performance.
|
||||||
|
|
||||||
|
## Attempt B: Denormalize using an array column
|
||||||
|
|
||||||
|
Having [removed MySQL support in GitLab
|
||||||
|
12.1](https://about.gitlab.com/2019/06/27/removing-mysql-support/), using
|
||||||
|
[Postgres's arrays](https://www.postgresql.org/docs/9.6/arrays.html) became more
|
||||||
|
tractable as we didn't have to support two databases. We discussed denormalizing
|
||||||
|
the `label_links` table for querying in
|
||||||
|
[gitlab-org/gitlab-ce#49651](https://gitlab.com/gitlab-org/gitlab-ce/issues/49651),
|
||||||
|
with two options: label IDs and titles.
|
||||||
|
|
||||||
|
We can think of both of those as array columns on `issues`, `merge_requests`,
|
||||||
|
and `epics`: `issues.label_ids` would be an array column of label IDs, and
|
||||||
|
`issues.label_titles` would be an array of label titles.
|
||||||
|
|
||||||
|
These array columns can be complemented with [GIN
|
||||||
|
indexes](https://www.postgresql.org/docs/9.6/gin-intro.html) to improve
|
||||||
|
matching.
|
||||||
|
|
||||||
|
### Attempt B1: store label IDs for each object
|
||||||
|
|
||||||
|
This has some strong advantages over titles:
|
||||||
|
|
||||||
|
1. Unless a label is deleted, or a project is moved, we never need to
|
||||||
|
bulk-update the denormalized column.
|
||||||
|
2. It uses less storage than the titles.
|
||||||
|
|
||||||
|
Unfortunately, our application design makes this hard. If we were able to query
|
||||||
|
just by label ID easily, we wouldn't need the `INNER JOIN labels` in the initial
|
||||||
|
query at the start of this document. GitLab allows users to filter by label
|
||||||
|
title across projects and even across groups, so a filter by the label ~Plan may
|
||||||
|
include labels with multiple distinct IDs.
|
||||||
|
|
||||||
|
We do not want users to have to know about the different IDs, which means that
|
||||||
|
given this data set:
|
||||||
|
|
||||||
|
| Project | ~Plan label ID | ~backend label ID |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| A | 11 | 12 |
|
||||||
|
| B | 21 | 22 |
|
||||||
|
| C | 31 | 32 |
|
||||||
|
|
||||||
|
We would need something like:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
WHERE
|
||||||
|
label_ids @> ARRAY[11, 12]
|
||||||
|
OR label_ids @> ARRAY[21, 22]
|
||||||
|
OR label_ids @> ARRAY[31, 32]
|
||||||
|
```
|
||||||
|
|
||||||
|
This can get even more complicated when we consider that in some cases, there
|
||||||
|
might be two ~backend labels - with different IDs - that could apply to the same
|
||||||
|
object, so the number of combinations would balloon further.
|
||||||
|
|
||||||
|
### Attempt B2: store label titles for each object
|
||||||
|
|
||||||
|
From the perspective of updating the labelable object, this is the worst
|
||||||
|
option. We have to bulk update the objects when:
|
||||||
|
|
||||||
|
1. The objects are moved from one project to another.
|
||||||
|
1. The project is moved from one group to another.
|
||||||
|
1. The label is renamed.
|
||||||
|
1. The label is deleted.
|
||||||
|
|
||||||
|
It also uses much more storage. Querying is simple, though:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
WHERE
|
||||||
|
label_titles @> ARRAY['Plan', 'backend']
|
||||||
|
```
|
||||||
|
|
||||||
|
And our [tests in
|
||||||
|
gitlab-org/gitlab-ce#49651](https://gitlab.com/gitlab-org/gitlab-ce/issues/49651#note_188777346)
|
||||||
|
showed that this could be fast.
|
||||||
|
|
||||||
|
However, at present, the disadvantages outweigh the advantages.
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
We have yet to find a method that is demonstratably better than the current
|
||||||
|
method, when considering:
|
||||||
|
|
||||||
|
1. Query performance.
|
||||||
|
1. Readability.
|
||||||
|
1. Ease of maintaining schema consistency.
|
Loading…
Reference in New Issue