gitlab-org--gitlab-foss/app/services/projects/import_service.rb

# frozen_string_literal: true

module Projects
  class ImportService < BaseService
    include Gitlab::ShellAdapter

    Error = Class.new(StandardError)

    # Returns true if this importer is supposed to perform its work in the
    # background.
    #
    # This method will only return `true` if async importing is explicitly
    # supported by an importer class (`Gitlab::GithubImport::ParallelImporter`
    # for example).
    def async?
      has_importer? && !!importer_class.try(:async?)
    end

    def execute
      add_repository_to_project

      download_lfs_objects

      import_data

      success
    rescue Gitlab::UrlBlocker::BlockedUrlError => e
      Gitlab::Sentry.track_acceptable_exception(e, extra: { project_path: project.full_path, importer: project.import_type })

      error(s_("ImportProjects|Error importing repository %{project_safe_import_url} into %{project_full_path} - %{message}") % { project_safe_import_url: project.safe_import_url, project_full_path: project.full_path, message: e.message })
    rescue => e
      message = Projects::ImportErrorFilter.filter_message(e.message)

      Gitlab::Sentry.track_acceptable_exception(e, extra: { project_path: project.full_path, importer: project.import_type })

      error(s_("ImportProjects|Error importing repository %{project_safe_import_url} into %{project_full_path} - %{message}") % { project_safe_import_url: project.safe_import_url, project_full_path: project.full_path, message: message })
    end

    private

    def add_repository_to_project
      if project.external_import? && !unknown_url?
        begin
          Gitlab::UrlBlocker.validate!(project.import_url, ports: Project::VALID_IMPORT_PORTS)
        rescue Gitlab::UrlBlocker::BlockedUrlError => e
          raise e, s_("ImportProjects|Blocked import URL: %{message}") % { message: e.message }
        end
      end

      # We should skip the repository for a GitHub import or GitLab project import,
      # because these importers fetch the project repositories for us.
      return if importer_imports_repository?

      if unknown_url?
        # In this case, we only want to import issues, not a repository.
        create_repository
      elsif !project.repository_exists?
        import_repository
      end
    end

    def create_repository
      unless project.create_repository
        raise Error, s_('ImportProjects|The repository could not be created.')
      end
    end

    def import_repository
      begin
        refmap = importer_class.try(:refmap) if has_importer?

        if refmap
          project.ensure_repository
          project.repository.fetch_as_mirror(project.import_url, refmap: refmap)
        else
          gitlab_shell.import_project_repository(project)
        end
      rescue Gitlab::Shell::Error => e
        # Expire cache to prevent scenarios such as:
        # 1. First import failed, but the repo was imported successfully, so +exists?+ returns true
        # 2. Retried import, repo is broken or not imported but +exists?+ still returns true
        project.repository.expire_content_cache if project.repository_exists?

        raise Error, e.message
      end
    end

    def download_lfs_objects
      # In this case, we only want to import issues
      return if unknown_url?

      # If it has its own repository importer, it has to implements its own lfs import download
      return if importer_imports_repository?

      return unless project.lfs_enabled?

      result = Projects::LfsPointers::LfsImportService.new(project).execute

      if result[:status] == :error
        # To avoid aborting the importing process, we silently fail
        # if any exception raises.
        Gitlab::AppLogger.error("The Lfs import process failed. #{result[:message]}")
      end
    end

    def import_data
      return unless has_importer?

      project.repository.expire_content_cache unless project.gitlab_project_import?

      unless importer.execute
        raise Error, s_('ImportProjects|The remote data could not be imported.')
      end
    end

    def importer_class
      @importer_class ||= Gitlab::ImportSources.importer(project.import_type)
    end

    def has_importer?
      Gitlab::ImportSources.importer_names.include?(project.import_type)
    end

    def importer
      importer_class.new(project)
    end

    def unknown_url?
      project.import_url == Project::UNKNOWN_IMPORT_URL
    end

    def importer_imports_repository?
      has_importer? && importer_class.try(:imports_repository?)
    end
  end
end
Enable more frozen string in app/services/*/.rb Partially addresses #47424. 2018-07-17 16:50:37 +00:00			`# frozen_string_literal: true`

Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`module Projects`
			`class ImportService < BaseService`
			`include Gitlab::ShellAdapter`

Enable and autocorrect the CustomErrorClass cop 2017-03-01 11:00:37 +00:00			`Error = Class.new(StandardError)`
Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00
Rewrite the GitHub importer from scratch Prior to this MR there were two GitHub related importers: * Github::Import: the main importer used for GitHub projects * Gitlab::GithubImport: importer that's somewhat confusingly used for importing Gitea projects (apparently they have a compatible API) This MR renames the Gitea importer to Gitlab::LegacyGithubImport and introduces a new GitHub importer in the Gitlab::GithubImport namespace. This new GitHub importer uses Sidekiq for importing multiple resources in parallel, though it also has the ability to import data sequentially should this be necessary. The new code is spread across the following directories: * lib/gitlab/github_import: this directory contains most of the importer code such as the classes used for importing resources. * app/workers/gitlab/github_import: this directory contains the Sidekiq workers, most of which simply use the code from the directory above. * app/workers/concerns/gitlab/github_import: this directory provides a few modules that are included in every GitHub importer worker. == Stages The import work is divided into separate stages, with each stage importing a specific set of data. Stages will schedule the work that needs to be performed, followed by scheduling a job for the "AdvanceStageWorker" worker. This worker will periodically check if all work is completed and schedule the next stage if this is the case. If work is not yet completed this worker will reschedule itself. Using this approach we don't have to block threads by calling `sleep()`, as doing so for large projects could block the thread from doing any work for many hours. == Retrying Work Workers will reschedule themselves whenever necessary. For example, hitting the GitHub API's rate limit will result in jobs rescheduling themselves. These jobs are not processed until the rate limit has been reset. == User Lookups Part of the importing process involves looking up user details in the GitHub API so we can map them to GitLab users. The old importer used an in-memory cache, but this obviously doesn't work when the work is spread across different threads. The new importer uses a Redis cache and makes sure we only perform API/database calls if absolutely necessary. Frequently used keys are refreshed, and lookup misses are also cached; removing the need for performing API/database calls if we know we don't have the data we're looking for. == Performance & Models The new importer in various places uses raw INSERT statements (as generated by `Gitlab::Database.bulk_insert`) instead of using Rails models. This allows us to bypass any validations and callbacks, drastically reducing the number of SQL queries and Gitaly RPC calls necessary to import projects. To ensure the code produces valid data the corresponding tests check if the produced rows are valid according to the model validation rules. 2017-10-13 16:50:36 +00:00			`# Returns true if this importer is supposed to perform its work in the`
			`# background.`
			`#`
			# This method will only return `true` if async importing is explicitly
			# supported by an importer class (`Gitlab::GithubImport::ParallelImporter`
			`# for example).`
			`def async?`
Prefer polymorphism over specific type checks in Import service 2017-11-15 12:27:37 +00:00			`has_importer? && !!importer_class.try(:async?)`
Rewrite the GitHub importer from scratch Prior to this MR there were two GitHub related importers: * Github::Import: the main importer used for GitHub projects * Gitlab::GithubImport: importer that's somewhat confusingly used for importing Gitea projects (apparently they have a compatible API) This MR renames the Gitea importer to Gitlab::LegacyGithubImport and introduces a new GitHub importer in the Gitlab::GithubImport namespace. This new GitHub importer uses Sidekiq for importing multiple resources in parallel, though it also has the ability to import data sequentially should this be necessary. The new code is spread across the following directories: * lib/gitlab/github_import: this directory contains most of the importer code such as the classes used for importing resources. * app/workers/gitlab/github_import: this directory contains the Sidekiq workers, most of which simply use the code from the directory above. * app/workers/concerns/gitlab/github_import: this directory provides a few modules that are included in every GitHub importer worker. == Stages The import work is divided into separate stages, with each stage importing a specific set of data. Stages will schedule the work that needs to be performed, followed by scheduling a job for the "AdvanceStageWorker" worker. This worker will periodically check if all work is completed and schedule the next stage if this is the case. If work is not yet completed this worker will reschedule itself. Using this approach we don't have to block threads by calling `sleep()`, as doing so for large projects could block the thread from doing any work for many hours. == Retrying Work Workers will reschedule themselves whenever necessary. For example, hitting the GitHub API's rate limit will result in jobs rescheduling themselves. These jobs are not processed until the rate limit has been reset. == User Lookups Part of the importing process involves looking up user details in the GitHub API so we can map them to GitLab users. The old importer used an in-memory cache, but this obviously doesn't work when the work is spread across different threads. The new importer uses a Redis cache and makes sure we only perform API/database calls if absolutely necessary. Frequently used keys are refreshed, and lookup misses are also cached; removing the need for performing API/database calls if we know we don't have the data we're looking for. == Performance & Models The new importer in various places uses raw INSERT statements (as generated by `Gitlab::Database.bulk_insert`) instead of using Rails models. This allows us to bypass any validations and callbacks, drastically reducing the number of SQL queries and Gitaly RPC calls necessary to import projects. To ensure the code produces valid data the corresponding tests check if the produced rows are valid according to the model validation rules. 2017-10-13 16:50:36 +00:00			`end`

Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`def execute`
Prefer polymorphism over specific type checks in Import service 2017-11-15 12:27:37 +00:00			`add_repository_to_project`
Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00
Support LFS objects when creating a project by import 2018-06-06 16:42:18 +00:00			`download_lfs_objects`

Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`import_data`

			`success`
Fix path disclosure on Project Import 2018-12-05 13:31:43 +00:00			`rescue Gitlab::UrlBlocker::BlockedUrlError => e`
			`Gitlab::Sentry.track_acceptable_exception(e, extra: { project_path: project.full_path, importer: project.import_type })`

Externalize strings detected by rubocop-i18n - Externalize strings in milestones_helper - Externalize strings in app/services - Update PO file 2019-04-15 12:25:48 +00:00			`error(s_("ImportProjects\|Error importing repository %{project_safe_import_url} into %{project_full_path} - %{message}") % { project_safe_import_url: project.safe_import_url, project_full_path: project.full_path, message: e.message })`
Fix path disclosure on Project Import 2018-12-05 13:31:43 +00:00			`rescue => e`
			`message = Projects::ImportErrorFilter.filter_message(e.message)`

			`Gitlab::Sentry.track_acceptable_exception(e, extra: { project_path: project.full_path, importer: project.import_type })`

Externalize strings detected by rubocop-i18n - Externalize strings in milestones_helper - Externalize strings in app/services - Update PO file 2019-04-15 12:25:48 +00:00			`error(s_("ImportProjects\|Error importing repository %{project_safe_import_url} into %{project_full_path} - %{message}") % { project_safe_import_url: project.safe_import_url, project_full_path: project.full_path, message: message })`
Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`end`

			`private`

adapted current services stuff to use new project import, plus fixes a few issues, updated routes, etc... 2016-06-14 18:32:19 +00:00			`def add_repository_to_project`
Prefer polymorphism over specific type checks in Import service 2017-11-15 12:27:37 +00:00			`if project.external_import? && !unknown_url?`
Raise more descriptive errors when URLs are blocked 2018-03-28 17:27:16 +00:00			`begin`
Add validation to webhook and service URLs to ensure they are not blocked because of SSRF 2018-06-01 11:43:53 +00:00			`Gitlab::UrlBlocker.validate!(project.import_url, ports: Project::VALID_IMPORT_PORTS)`
Raise more descriptive errors when URLs are blocked 2018-03-28 17:27:16 +00:00			`rescue Gitlab::UrlBlocker::BlockedUrlError => e`
Externalize strings detected by rubocop-i18n - Externalize strings in milestones_helper - Externalize strings in app/services - Update PO file 2019-04-15 12:25:48 +00:00			`raise e, s_("ImportProjects\|Blocked import URL: %{message}") % { message: e.message }`
Raise more descriptive errors when URLs are blocked 2018-03-28 17:27:16 +00:00			`end`
Prefer polymorphism over specific type checks in Import service 2017-11-15 12:27:37 +00:00			`end`

			`# We should skip the repository for a GitHub import or GitLab project import,`
			`# because these importers fetch the project repositories for us.`
Support LFS objects when creating a project by import 2018-06-06 16:42:18 +00:00			`return if importer_imports_repository?`
Prefer polymorphism over specific type checks in Import service 2017-11-15 12:27:37 +00:00
adapted current services stuff to use new project import, plus fixes a few issues, updated routes, etc... 2016-06-14 18:32:19 +00:00			`if unknown_url?`
			`# In this case, we only want to import issues, not a repository.`
			`create_repository`
Check if repository already exists before trying to re-import it 2016-10-19 12:21:27 +00:00			`elsif !project.repository_exists?`
Refactoring Projects::ImportService 2017-04-03 18:48:09 +00:00			`import_repository`
adapted current services stuff to use new project import, plus fixes a few issues, updated routes, etc... 2016-06-14 18:32:19 +00:00			`end`
			`end`

Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`def create_repository`
			`unless project.create_repository`
Externalize strings detected by rubocop-i18n - Externalize strings in milestones_helper - Externalize strings in app/services - Update PO file 2019-04-15 12:25:48 +00:00			`raise Error, s_('ImportProjects\|The repository could not be created.')`
Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`end`
			`end`

			`def import_repository`
			`begin`
Rename fetch_refs to refmap 2017-11-23 15:51:55 +00:00			`refmap = importer_class.try(:refmap) if has_importer?`
Clean up repository fetch and mirror methods 2017-11-15 15:46:08 +00:00
Rename fetch_refs to refmap 2017-11-23 15:51:55 +00:00			`if refmap`
Clean up repository fetch and mirror methods 2017-11-15 15:46:08 +00:00			`project.ensure_repository`
Rename fetch_refs to refmap 2017-11-23 15:51:55 +00:00			`project.repository.fetch_as_mirror(project.import_url, refmap: refmap)`
Refactoring Projects::ImportService 2017-04-03 18:48:09 +00:00			`else`
Refactor use of Shell.import_repository for Wikis The previous behavior would pass in a list of parameters to Shell, but we can improve this by using the WikiFormatter and Project models to give us the same information. 2019-01-17 08:35:40 +00:00			`gitlab_shell.import_project_repository(project)`
Refactoring Projects::ImportService 2017-04-03 18:48:09 +00:00			`end`
Migrate add_remote, remove_remote, fetch_internal_remote to gitaly 2018-07-19 18:40:36 +00:00			`rescue Gitlab::Shell::Error => e`
fix broken repo 500 errors in UI and added relevant specs 2016-09-23 07:42:07 +00:00			`# Expire cache to prevent scenarios such as:`
			`# 1. First import failed, but the repo was imported successfully, so +exists?+ returns true`
			`# 2. Retried import, repo is broken or not imported but +exists?+ still returns true`
Refactoring Projects::ImportService 2017-04-03 18:48:09 +00:00			`project.repository.expire_content_cache if project.repository_exists?`
fix broken repo 500 errors in UI and added relevant specs 2016-09-23 07:42:07 +00:00
Refactoring Projects::ImportService 2017-04-03 18:48:09 +00:00			`raise Error, e.message`
Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`end`
			`end`

Support LFS objects when creating a project by import 2018-06-06 16:42:18 +00:00			`def download_lfs_objects`
			`# In this case, we only want to import issues`
			`return if unknown_url?`

			`# If it has its own repository importer, it has to implements its own lfs import download`
			`return if importer_imports_repository?`

			`return unless project.lfs_enabled?`

Refactored LfsImportService and ImportService In order to make `LfsImportService` more reusable, we need to extract the logic inside `ImportService` and encapsulate it into the service. 2019-04-30 08:21:21 +00:00			`result = Projects::LfsPointers::LfsImportService.new(project).execute`
Support LFS objects when creating a project by import 2018-06-06 16:42:18 +00:00
Refactored LfsImportService and ImportService In order to make `LfsImportService` more reusable, we need to extract the logic inside `ImportService` and encapsulate it into the service. 2019-04-30 08:21:21 +00:00			`if result[:status] == :error`
			`# To avoid aborting the importing process, we silently fail`
			`# if any exception raises.`
			`Gitlab::AppLogger.error("The Lfs import process failed. #{result[:message]}")`
Support LFS objects when creating a project by import 2018-06-06 16:42:18 +00:00			`end`
			`end`

Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`def import_data`
			`return unless has_importer?`

Refactoring Projects::ImportService 2017-04-03 18:48:09 +00:00			`project.repository.expire_content_cache unless project.gitlab_project_import?`
Flush repository cache before import project data GitHub Pull Requests importer handle with the repository while importing data, we need to make sure that the cached values are valid. 2016-04-04 22:35:39 +00:00
Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`unless importer.execute`
Externalize strings detected by rubocop-i18n - Externalize strings in milestones_helper - Externalize strings in app/services - Update PO file 2019-04-15 12:25:48 +00:00			`raise Error, s_('ImportProjects\|The remote data could not be imported.')`
Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`end`
			`end`

Replace old GH importer with the parallel importer 2017-10-18 19:46:05 +00:00			`def importer_class`
Prefer polymorphism over specific type checks in Import service 2017-11-15 12:27:37 +00:00			`@importer_class \|\|= Gitlab::ImportSources.importer(project.import_type)`
Replace old GH importer with the parallel importer 2017-10-18 19:46:05 +00:00			`end`

Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`def has_importer?`
Improve Gitlab::ImportSources Signed-off-by: Rémy Coutable <remy@rymai.me> 2016-12-16 08:15:30 +00:00			`Gitlab::ImportSources.importer_names.include?(project.import_type)`
Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`end`

			`def importer`
Rewrite the GitHub importer from scratch Prior to this MR there were two GitHub related importers: * Github::Import: the main importer used for GitHub projects * Gitlab::GithubImport: importer that's somewhat confusingly used for importing Gitea projects (apparently they have a compatible API) This MR renames the Gitea importer to Gitlab::LegacyGithubImport and introduces a new GitHub importer in the Gitlab::GithubImport namespace. This new GitHub importer uses Sidekiq for importing multiple resources in parallel, though it also has the ability to import data sequentially should this be necessary. The new code is spread across the following directories: * lib/gitlab/github_import: this directory contains most of the importer code such as the classes used for importing resources. * app/workers/gitlab/github_import: this directory contains the Sidekiq workers, most of which simply use the code from the directory above. * app/workers/concerns/gitlab/github_import: this directory provides a few modules that are included in every GitHub importer worker. == Stages The import work is divided into separate stages, with each stage importing a specific set of data. Stages will schedule the work that needs to be performed, followed by scheduling a job for the "AdvanceStageWorker" worker. This worker will periodically check if all work is completed and schedule the next stage if this is the case. If work is not yet completed this worker will reschedule itself. Using this approach we don't have to block threads by calling `sleep()`, as doing so for large projects could block the thread from doing any work for many hours. == Retrying Work Workers will reschedule themselves whenever necessary. For example, hitting the GitHub API's rate limit will result in jobs rescheduling themselves. These jobs are not processed until the rate limit has been reset. == User Lookups Part of the importing process involves looking up user details in the GitHub API so we can map them to GitLab users. The old importer used an in-memory cache, but this obviously doesn't work when the work is spread across different threads. The new importer uses a Redis cache and makes sure we only perform API/database calls if absolutely necessary. Frequently used keys are refreshed, and lookup misses are also cached; removing the need for performing API/database calls if we know we don't have the data we're looking for. == Performance & Models The new importer in various places uses raw INSERT statements (as generated by `Gitlab::Database.bulk_insert`) instead of using Rails models. This allows us to bypass any validations and callbacks, drastically reducing the number of SQL queries and Gitaly RPC calls necessary to import projects. To ensure the code produces valid data the corresponding tests check if the produced rows are valid according to the model validation rules. 2017-10-13 16:50:36 +00:00			`importer_class.new(project)`
			`end`

Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`def unknown_url?`
			`project.import_url == Project::UNKNOWN_IMPORT_URL`
			`end`
Support LFS objects when creating a project by import 2018-06-06 16:42:18 +00:00
			`def importer_imports_repository?`
			`has_importer? && importer_class.try(:imports_repository?)`
			`end`
Extract Projects::ImportService service from RepositoryImportWorker 2016-01-21 18:09:32 +00:00			`end`
			`end`