2020-10-23 20:08:35 -04:00
---
2020-12-01 13:09:42 -05:00
stage: Enablement
group: Distribution
2020-11-26 01:09:20 -05:00
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
2020-10-23 20:08:35 -04:00
---
2019-07-08 04:50:38 -04:00
# Pseudonymizer **(ULTIMATE)**
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
Your GitLab database contains sensitive information. To protect sensitive information
when you run analytics on your database, you can use the Pseudonymizer service, which:
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
1. Uses `HMAC(SHA256)` to mutate fields containing sensitive information.
1. Preserves references (referential integrity) between fields.
1. Exports your GitLab data, scrubbed of sensitive material.
2019-05-05 11:21:25 -04:00
2020-12-04 16:09:29 -05:00
WARNING:
2021-11-01 05:13:14 -04:00
If the source data is available, users can compare and correlate the scrubbed data
with the original.
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
To generate a pseudonymized data set:
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
1. [Configure Pseudonymizer ](#configure-pseudonymizer ) fields and output location.
1. [Enable Pseudonymizer data collection ](#enable-pseudonymizer-data-collection ).
1. Optional. [Generate a data set manually ](#generate-data-set-manually ).
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
## Configure Pseudonymizer
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
To use the Pseudonymizer, configure both the fields you want to anonymize, and the location to
store the scrubbed data:
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
1. **Create a manifest file** : This file describes the fields to include or pseudonymize.
- **Default manifest** - GitLab provides a default manifest in your GitLab installation
([example `manifest.yml` file](https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/pseudonymizer.yml)).
To use the example manifest file, use the `config/pseudonymizer.yml` relative path
when you configure connection parameters.
- **Custom manifest** - To use a custom manifest file, use the absolute path to
the file when you configure the connection parameters.
1. **Configure connection parameters** : In the configuration method appropriate for
your version of GitLab, specify the [object storage ](object_storage.md )
connection parameters (`pseudonymizer.upload.connection`).
2020-04-09 11:09:29 -04:00
2019-05-05 11:21:25 -04:00
**For Omnibus installations:**
1. Edit `/etc/gitlab/gitlab.rb` and add the following lines by replacing with
the values you want:
2019-07-09 03:16:17 -04:00
```ruby
gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml'
gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name
gitlab_rails['pseudonymizer_upload_connection'] = {
'provider' => 'AWS',
'region' => 'eu-central-1',
'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',
'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'
}
```
2021-11-01 05:13:14 -04:00
If you are using AWS IAM profiles, omit the AWS access key and secret access key/value pairs.
2019-07-09 03:16:17 -04:00
```ruby
gitlab_rails['pseudonymizer_upload_connection'] = {
'provider' => 'AWS',
'region' => 'eu-central-1',
'use_iam_profile' => true
}
```
2019-05-05 11:21:25 -04:00
1. Save the file and [reconfigure GitLab ](restart_gitlab.md#omnibus-gitlab-reconfigure )
for the changes to take effect.
---
**For installations from source:**
1. Edit `/home/git/gitlab/config/gitlab.yml` and add or amend the following
lines:
2019-07-09 03:16:17 -04:00
```yaml
pseudonymizer:
2019-07-11 18:53:54 -04:00
manifest: config/pseudonymizer.yml
upload:
2019-07-09 03:16:17 -04:00
remote_directory: 'gitlab-elt' # bucket name
connection:
provider: AWS
aws_access_key_id: AWS_ACCESS_KEY_ID
aws_secret_access_key: AWS_SECRET_ACCESS_KEY
region: eu-central-1
```
2019-05-05 11:21:25 -04:00
1. Save the file and [restart GitLab ](restart_gitlab.md#installations-from-source )
for the changes to take effect.
2021-11-01 05:13:14 -04:00
## Enable Pseudonymizer data collection
To enable data collection:
1. On the top bar, select **Menu > Admin** .
1. On the left sidebar, select **Settings > Metrics and Profiling** , then expand
**Pseudonymizer data collection** .
1. Select **Enable Pseudonymizer data collection** .
1. Select **Save changes** .
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
## Generate data set manually
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
You can also run the Pseudonymizer manually:
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
1. Set these environment variables:
- `PSEUDONYMIZER_OUTPUT_DIR` - Where to store the output CSV files. Defaults to `/tmp` .
These commands produce CSV files that can be quite large. Make sure the directory
can store a file at least 10% of the size of your database.
- `PSEUDONYMIZER_BATCH` - The batch size when querying the database. Defaults to `100000` .
1. Run the command appropriate for your application:
- **Omnibus GitLab**:
`sudo gitlab-rake gitlab:db:pseudonymizer`
- **Installations from source**:
`sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production`
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
After you run the command, upload the output CSV files to your configured object
storage. After the upload completes, delete the output file from the local disk.
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
## Related topics
2019-05-05 11:21:25 -04:00
2021-11-01 05:13:14 -04:00
- [Using object storage with GitLab ](object_storage.md ).