Commit Graph

18 Commits

Author SHA1 Message Date
John Cai 6c35fb59b7 Add GitDeduplicationService for deduplication housekeeping
GitDeduplicationService performs idempotent operations on deduplicated
projects.
2019-05-21 13:34:31 -07:00
Jan Provaznik d25239ee0b Use git_garbage_collect_worker to run pack_refs
PackRefs is not an expensive gitaly call - we want to
call it more often (than as part of full `gc`) because
it helps to keep number of refs files small - too many
refs file may be a problem for deployments with
slow storage.
2019-05-02 21:41:05 +00:00
Zeger-Jan van de Weg e03602e09d
Ensure pool participants are linked before GC
In theory the case could happen that the initial linking of the pool
fails and so do all the retries that Sidekiq performs. This could lead
to data loss.

To prevent that case, linking is done before Gits GC too. This makes
sure that case doesn't happen.
2019-01-14 16:09:47 +01:00
Zeger-Jan van de Weg 896c0bdbfb
Allow public forks to be deduplicated
When a project is forked, the new repository used to be a deep copy of everything
stored on disk by leveraging `git clone`. This works well, and makes isolation
between repository easy. However, the clone is at the start 100% the same as the
origin repository. And in the case of the objects in the object directory, this
is almost always going to be a lot of duplication.

Object Pools are a way to create a third repository that essentially only exists
for its 'objects' subdirectory. This third repository's object directory will be
set as alternate location for objects. This means that in the case an object is
missing in the local repository, git will look in another location. This other
location is the object pool repository.

When Git performs garbage collection, it's smart enough to check the
alternate location. When objects are duplicated, it will allow git to
throw one copy away. This copy is on the local repository, where to pool
remains as is.

These pools have an origin location, which for now will always be a
repository that itself is not a fork. When the root of a fork network is
forked by a user, the fork still clones the full repository. Async, the
pool repository will be created.

Either one of these processes can be done earlier than the other. To
handle this race condition, the Join ObjectPool operation is
idempotent. Given its idempotent, we can schedule it twice, with the
same effect.

To accommodate the holding of state two migrations have been added.
1. Added a state column to the pool_repositories column. This column is
managed by the state machine, allowing for hooks on transitions.
2. pool_repositories now has a source_project_id. This column in
convenient to have for multiple reasons: it has a unique index allowing
the database to handle race conditions when creating a new record. Also,
it's nice to know who the host is. As that's a short link to the fork
networks root.

Object pools are only available for public project, which use hashed
storage and when forking from the root of the fork network. (That is,
the project being forked from itself isn't a fork)

In this commit message I use both ObjectPool and Pool repositories,
which are alike, but different from each other. ObjectPool refers to
whatever is on the disk stored and managed by Gitaly. PoolRepository is
the record in the database.
2018-12-07 19:18:37 +01:00
Stan Hu 0c1eebe24c Fix ArgumentError in GitGarbageCollectWorker Sidekiq job
When the Gitaly call failed, the exception handling failed
because `method` is expected to have a parameter.

Closes #49096
2018-07-10 15:11:10 -07:00
gfyoung dfbe5ce435 Enable frozen string literals for app/workers/*.rb 2018-06-27 07:23:28 +00:00
Zeger-Jan van de Weg 0e2577229d
Move GC RPCs to mandatory
Closes https://gitlab.com/gitlab-org/gitaly/issues/354
2018-06-13 16:36:43 +02:00
Kim Carlbäcker cc9468e4fa Move GC/Repack to OptOut 2018-06-06 14:28:03 +00:00
Jacob Vosmaer (GitLab) c43e18fc49 Remove some easy cases of 'path_to_repo' use 2018-03-28 09:21:32 +00:00
Stan Hu 885998c220 Release libgit2 cache and open file descriptors after `git gc` run
Relates to #21879
2018-03-03 22:21:50 -08:00
Mario de la Ossa eaada9d706 use Gitlab::UserSettings directly as a singleton instead of including/extending it 2018-02-02 18:39:55 +00:00
Douwe Maan 0b15570e49 Add ApplicationWorker and make every worker include it 2017-12-05 11:59:39 +01:00
Tiago Botelho 39298575a8 Adds exclusive lease to Git garbage collect worker. 2017-09-07 18:52:04 +01:00
Kim "BKC" Carlbäcker 05f90b861f Migrate GitGarbageCollectWorker to Gitaly 2017-07-28 17:49:22 +02:00
Jacob Vosmaer 6bcc52a536 Refine Git garbage collection 2016-11-04 14:30:11 +01:00
Yorick Peterse 97731760d7
Re-organize queues to use for Sidekiq
Dumping too many jobs in the same queue (e.g. the "default" queue) is a
dangerous setup. Jobs that take a long time to process can effectively
block any other work from being performed given there are enough of
these jobs.

Furthermore it becomes harder to monitor the jobs as a single queue
could contain jobs for different workers. In such a setup the only
reliable way of getting counts per job is to iterate over all jobs in a
queue, which is a rather time consuming process.

By using separate queues for various workers we have better control over
throughput, we can add weight to queues, and we can monitor queues
better. Some workers still use the same queue whenever their work is
related. For example, the various CI pipeline workers use the same
"pipeline" queue.

This commit includes a Rails migration that moves Sidekiq jobs from the
old queues to the new ones. This migration also takes care of doing the
inverse if ever needed. This does require downtime as otherwise new jobs
could be scheduled in the old queues after this migration completes.

This commit also includes an RSpec test that blacklists the use of the
"default" queue and ensures cron workers use the "cronjob" queue.

Fixes gitlab-org/gitlab-ce#23370
2016-10-21 18:17:07 +02:00
Stan Hu 0d4b1bb752 Refresh branch cache after `git gc`
Possible workaround for #15392
2016-07-13 06:49:58 -07:00
Stan Hu 3dc6bf2b71 Expire the branch cache after `git gc` runs
Due to a stale NFS cache, it's possible that a branch lookup fails
while `git gc` is running and causes missing branches in merge requests.

Possible workaround for #15392
2016-07-12 05:42:19 -07:00