kotovalexarian-likes-gitlab/gitlab-org--gitlab-foss

GitLab Bot df40cd1c38 Add latest changes from gitlab-org/gitlab@master

2020-11-19 21:09:07 +00:00

3.8 KiB

Raw Blame History

stage	group	info
none	unassigned	To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers

Pseudonymizer (ULTIMATE)

Introduced in GitLab Ultimate 11.1.

As GitLab's database hosts sensitive information, using it unfiltered for analytics implies high security requirements. To help alleviate this constraint, the Pseudonymizer service is used to export GitLab's data in a pseudonymized way.

CAUTION: Warning: This process is not impervious. If the source data is available, it's possible for a user to correlate data to the pseudonymized version.

The Pseudonymizer currently uses HMAC(SHA256) to mutate fields that shouldn't be textually exported. This ensures that:

the end-user of the data source cannot infer/revert the pseudonymized fields
the referential integrity is maintained

Configuration

To configure the pseudonymizer, you need to:

Provide a manifest file that describes which fields should be included or pseudonymized (example manifest.yml file). A default manifest is provided with the GitLab installation, using a relative file path that resolves from the Rails root. Alternatively, you can use an absolute file path.
Use an object storage and specify the connection parameters in the pseudonymizer.upload.connection configuration option.

For Omnibus installations:

Edit /etc/gitlab/gitlab.rb and add the following lines by replacing with the values you want:

gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml'
gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name
gitlab_rails['pseudonymizer_upload_connection'] = {
  'provider' => 'AWS',
  'region' => 'eu-central-1',
  'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',
  'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'
}

If you are using AWS IAM profiles, be sure to omit the AWS access key and secret access key/value pairs.

gitlab_rails['pseudonymizer_upload_connection'] = {
  'provider' => 'AWS',
  'region' => 'eu-central-1',
  'use_iam_profile' => true
}

Save the file and reconfigure GitLab for the changes to take effect.

For installations from source:

Edit /home/git/gitlab/config/gitlab.yml and add or amend the following lines:

pseudonymizer:
  manifest: config/pseudonymizer.yml
  upload:
    remote_directory: 'gitlab-elt' # bucket name
    connection:
      provider: AWS
      aws_access_key_id: AWS_ACCESS_KEY_ID
      aws_secret_access_key: AWS_SECRET_ACCESS_KEY
      region: eu-central-1

Save the file and restart GitLab for the changes to take effect.

Usage

You can optionally run the pseudonymizer using the following environment variables:

PSEUDONYMIZER_OUTPUT_DIR - where to store the output CSV files (defaults to /tmp)
PSEUDONYMIZER_BATCH - the batch size when querying the DB (defaults to 100000)

## Omnibus
sudo gitlab-rake gitlab:db:pseudonymizer

## Source
sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production

This produces some CSV files that might be very large, so make sure the PSEUDONYMIZER_OUTPUT_DIR has sufficient space. As a rule of thumb, at least 10% of the database size is recommended.

After the pseudonymizer has run, the output CSV files should be uploaded to the configured object storage and deleted from the local disk.

3.8 KiB Raw Blame History

Pseudonymizer (ULTIMATE)

Configuration

Usage

3.8 KiB

Raw Blame History