Note: this is the reverse of migrating two seperate disconnected directories to git annex.
I have a git annex repo for all my media that has grown to 57866 files and git operations are getting slow, especially on external spinning hard drives, so I decided to split it into separate repositories.
This is how I did it, with some help from #git-annex
. Suppose the old big repo is at ~/oldrepo
:
# Create a new repo for photos only
mkdir ~/photos
cd photos
git init
git annex init laptop
# Hardlink all the annexed data from the old repo
cp -rl ~/oldrepo/.git/annex/objects .git/annex/
# Regenerate the git annex metadata
git annex fsck --fast
# Also split the repo on the usb key
cd /media/usbkey
git clone ~/photos
cd photos
git annex init usbkey
cp -rl ../oldrepo/.git/annex/objects .git/annex/
git annex fsck --fast
# Connect the annexes as remotes of each other
git remote add laptop ~/photos
cd ~/photos
git remote add usbkey /media/usbkey
At this point, I went through all repos doing standard cleanup:
# Remove unneeded hard links
git annex unused
git annex dropunused --force 1-12345
# Sync
git annex sync
To make sure nothing is missing, I used git annex find --not --in=here
to see if, for example, the usbkey that should have everything could be missing
some thing.
Update: Antoine Beaupré pointed me to this tip about Repositories with large number of files which I will try next time one of my repositories grows enough to hit a performance issue.
This document was originally written by Enrico Zini and added to this wiki by anarcat.
Indeed it would be nice if there was an easy way to split a git annex repository into smaller ones, while those smaller ones also obtain all the git-annex branch availability/metadata information about the files they inherit. The situations comes up quite frequently whenever it is desired to modularize bigger repositories. The simplest use case is to make a specific subdirectory into a git/git-annex submodule. Is there a way/recipe to easily accomplish also moving all git-annex branch metadata. And the original repository should get those files removed within its git tree.
One possible way we see is to clone the original repository, remove all other files, move subdirectory files "up" needed number of directories, and then rewrite git history to forget and then use
annex forget
but that one wouldn't "forget" information about the files which are not in the current tree, so would also require some manual trimming ofgit-annex
branch beforeannex forget
.But may be there is a better way?
This is a simple way to split a repository, but the resulting split git repository will be larger than is really necessary.
When you
dropunused
all the hard links that are not present in the repository, git-annex will commit a log to the git-annex branch saying "I don't have this content" for each of them. That seems unnecessary since it probably does not have an earlier log saying it contained the content that was hard linked into it, and perhaps could be improved in git-annex to not record that unncessarily, but that's what it does currently.So I suggest running
git annex forget
after the dropunused or at some later point. That will delete all traces of those log files from the git-annex branch.