2019-07-08 04:50:38 -04:00
|
|
|
# Pseudonymizer **(ULTIMATE)**
|
2019-05-05 11:21:25 -04:00
|
|
|
|
2020-02-06 10:09:11 -05:00
|
|
|
> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/5532) in [GitLab Ultimate][ee] 11.1.
|
2019-05-05 11:21:25 -04:00
|
|
|
|
|
|
|
As GitLab's database hosts sensitive information, using it unfiltered for analytics
|
|
|
|
implies high security requirements. To help alleviate this constraint, the Pseudonymizer
|
|
|
|
service is used to export GitLab's data in a pseudonymized way.
|
|
|
|
|
|
|
|
CAUTION: **Warning:**
|
|
|
|
This process is not impervious. If the source data is available, it's possible for
|
|
|
|
a user to correlate data to the pseudonymized version.
|
|
|
|
|
|
|
|
The Pseudonymizer currently uses `HMAC(SHA256)` to mutate fields that shouldn't
|
|
|
|
be textually exported. This ensures that:
|
|
|
|
|
|
|
|
- the end-user of the data source cannot infer/revert the pseudonymized fields
|
|
|
|
- the referential integrity is maintained
|
|
|
|
|
|
|
|
## Configuration
|
|
|
|
|
|
|
|
To configure the pseudonymizer, you need to:
|
|
|
|
|
|
|
|
- Provide a manifest file that describes which fields should be included or
|
2019-09-18 10:02:45 -04:00
|
|
|
pseudonymized ([example `manifest.yml` file](https://gitlab.com/gitlab-org/gitlab/tree/master/config/pseudonymizer.yml)).
|
2019-07-08 23:28:41 -04:00
|
|
|
A default manifest is provided with the GitLab installation. Using a relative file path will be resolved from the Rails root.
|
2019-05-05 11:21:25 -04:00
|
|
|
Alternatively, you can use an absolute file path.
|
2019-07-08 23:28:41 -04:00
|
|
|
- Use an object storage and specify the connection parameters in the `pseudonymizer.upload.connection` configuration option.
|
2019-05-05 11:21:25 -04:00
|
|
|
|
|
|
|
**For Omnibus installations:**
|
|
|
|
|
|
|
|
1. Edit `/etc/gitlab/gitlab.rb` and add the following lines by replacing with
|
|
|
|
the values you want:
|
|
|
|
|
2019-07-09 03:16:17 -04:00
|
|
|
```ruby
|
|
|
|
gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml'
|
|
|
|
gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name
|
|
|
|
gitlab_rails['pseudonymizer_upload_connection'] = {
|
|
|
|
'provider' => 'AWS',
|
|
|
|
'region' => 'eu-central-1',
|
|
|
|
'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',
|
|
|
|
'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
NOTE: **Note:**
|
|
|
|
If you are using AWS IAM profiles, be sure to omit the AWS access key and secret access key/value pairs.
|
|
|
|
|
|
|
|
```ruby
|
|
|
|
gitlab_rails['pseudonymizer_upload_connection'] = {
|
|
|
|
'provider' => 'AWS',
|
|
|
|
'region' => 'eu-central-1',
|
|
|
|
'use_iam_profile' => true
|
|
|
|
}
|
|
|
|
```
|
2019-05-05 11:21:25 -04:00
|
|
|
|
|
|
|
1. Save the file and [reconfigure GitLab](restart_gitlab.md#omnibus-gitlab-reconfigure)
|
|
|
|
for the changes to take effect.
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
**For installations from source:**
|
|
|
|
|
|
|
|
1. Edit `/home/git/gitlab/config/gitlab.yml` and add or amend the following
|
|
|
|
lines:
|
|
|
|
|
2019-07-09 03:16:17 -04:00
|
|
|
```yaml
|
|
|
|
pseudonymizer:
|
2019-07-11 18:53:54 -04:00
|
|
|
manifest: config/pseudonymizer.yml
|
|
|
|
upload:
|
2019-07-09 03:16:17 -04:00
|
|
|
remote_directory: 'gitlab-elt' # bucket name
|
|
|
|
connection:
|
|
|
|
provider: AWS
|
|
|
|
aws_access_key_id: AWS_ACCESS_KEY_ID
|
|
|
|
aws_secret_access_key: AWS_SECRET_ACCESS_KEY
|
|
|
|
region: eu-central-1
|
|
|
|
```
|
2019-05-05 11:21:25 -04:00
|
|
|
|
|
|
|
1. Save the file and [restart GitLab](restart_gitlab.md#installations-from-source)
|
|
|
|
for the changes to take effect.
|
|
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
|
|
You can optionally run the pseudonymizer using the following environment variables:
|
|
|
|
|
|
|
|
- `PSEUDONYMIZER_OUTPUT_DIR` - where to store the output CSV files (defaults to `/tmp`)
|
|
|
|
- `PSEUDONYMIZER_BATCH` - the batch size when querying the DB (defaults to `100000`)
|
|
|
|
|
2020-01-30 10:09:15 -05:00
|
|
|
```shell
|
2019-05-05 11:21:25 -04:00
|
|
|
## Omnibus
|
|
|
|
sudo gitlab-rake gitlab:db:pseudonymizer
|
|
|
|
|
|
|
|
## Source
|
|
|
|
sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production
|
|
|
|
```
|
|
|
|
|
|
|
|
This will produce some CSV files that might be very large, so make sure the
|
|
|
|
`PSEUDONYMIZER_OUTPUT_DIR` has sufficient space. As a rule of thumb, at least
|
|
|
|
10% of the database size is recommended.
|
|
|
|
|
|
|
|
After the pseudonymizer has run, the output CSV files should be uploaded to the
|
|
|
|
configured object storage and deleted from the local disk.
|
|
|
|
|
|
|
|
[ee]: https://about.gitlab.com/pricing/
|