Just as git does not scale well with large files, it can also become painful to work with when you have a large number of files. Below are things I have found to minimise the pain.
Using version 4 index files
During operations which affect the index, git writes an entirely new index out to index.lck and then replaces .git/index with it. With a large number of files, this index file can be quite large and take several seconds to write every time you manipulate the index!
This can be mitigated by changing it to version 4 which uses path compression to reduce the filesize:
git update-index --index-version 4
NOTE: The git documentation warns that this version may not be supported by other git implementations like JGit and libgit2.
Personally, I saw a reduction from 516MB to 206MB (40% of original size) and got a much more responsive git!
It may also be worth doing the same to git-annex's index:
GIT_INDEX_FILE=.git/annex/index git update-index --index-version 4
Though I didn't gain as much here with 89MB to 86MB (96% of original size).
Packing
As I have gc disabled:
git config gc.auto 0
so I control when it is run, I ended up with a lot of loose objects which also cause slowness in git. Using
git count-objects
to tell me how many loose objects I have, when I reach a threshold (~25000), I pack those loose objects and clean things up:
git repack -d
git gc
git prune
File count per directory
If it takes a long time to list the files in a directory, naturally, git(-annex) will be affected by this bottleneck.
You can avoid this by keeping the number of files in a directory to between 5000 and 20000 (depends on the filesystem and its settings).
fpart can be a very useful tool to achieve this.
This sort of usage was discussed in Handling a large number of files and "git annex sync" synced after 8 hours. -- CandyAngel
Forget tracking information
In addition to keeping track of where files are, git-annex keeps a log that keeps track of where files were. This can take up space as well and slow down certain operations.
You can use the git-annex-forget command to drop historical location tracking info for files.
Note: this was discussed in scalability with lots of files. -- anarcat
Splitting the index (git update-index --split-index) doesn't work for me at all. While it may initially reduce the size of .git/index, making a commit inflates it back to its original size anyway.
I thought it might be some interaction with v4 index and its compression mechanics but it does the same if I set it to v3 index. For (manufactured) example:
@umeboshi: Odd that you report your machine freezes during commits.. I find the exact opposite.. waiting for a long time with no load at all.
My current setup for my sorting annex (where I import all the files off old HDDs) is to have the HDD plugged into my home server (Atom 330 @ 1.6Ghz) and import the files on a cloned (empty) annex. Doing so for 1.1M files (latest HDD) is a long wait, because 80% of the time is waiting for something to happen (but there being no load on the machine). Once that is done, the HDD is transferred to my desktop, where the annex is "joined" to the others and files are sorted in a dedicated VM[1], where commit times are reasonable.
[1] Fully virtualising my desktop is possibly the best thing I've ever done, in terms of setup. Locking up any VM affects none of the others (which is handy, as I discovered an issue that causes X to almost hardlock whenever libvo is used..).
I have been playing with tracking a large number of url's for about one month now. Having already been disappointed by how git performs when there are a very large amount of files in the annex, I tested making multiple annexes. I did find that splitting the url's into multiple annexes increased performance, but at the cost of extra housekeeping, duplicated url's, and more work needed to keep track of the url's. Part of the duplication and tracking problem was mitigated by using a dumb remote, such as rsync or directory, where a very large amount of objects can be stored. The dumb remotes perform very well, however each annex needed to be synced regularly with the dumb remote.
I found the dumb remote to be great for multiple annexes. I have noticed that a person can create a new annex and extract a tarball of symlinks into the repo, the
git commit
the links. Subsequently, executinggit-annex fsck --from dummy
would setup the tracking info, which was pretty useful.However, I found that by the time I got to over fifty annexes, the overall performance far worse than just storing the url's and file paths in a postgresql database. In fact, the url's are already being stored and accessed from such a database, but I had the desire to access the url's from multiple machines, which is a bit more difficult with a centralized database.
After reading the tips and pages discussing splitting the files into multiple directories, and changing the index version, I decided to try a single annex to hold the url's. Over the new year's weekend, I decided to write a script that generates rss files to use with importfeed to add url's to this annex. I have noticed that when using
git commit
the load average of the host was in the mid twenties and persisted for hours until I had to kill the processes to be able to use the machine again (I would really like to know if there is a git config setting that would keep the load down, so the machine can be used during a commit). I gave up ongit-annex sync
this morning, since it was taking longer than I was able to sit in the coffee shop and wait for it (~3 hrs).I came back to the office, and started
git gc
which has been running for ~1hr.When making the larger annex, I decided to use the hexidecimal value of uuid5(url) for each filename, and use the two high nybbles and the two low nybbles for a two state directory structure, with respect for the advice from CandyAngel. When my url's are organized in this manner, I still need access to the database to perform the appropriate
git-annex get
which impairs the use of this strategy, but I'm still testing and playing around. I suspended adding url's to this annex until I get at least one sync performed.The url annex itself is not very big, and I am guessing the average file size to be close to 500K. The large number of url's seems to be a problem I have yet to solve. I wanted to experiment with this to further the idea of the public git-annex repositories that seem to be a useful idea, even though the utility of this idea is very limited at the moment.
As writing the index file becomes the bottleneck, turning on split index mode might be a help as well. See git-update-index's man page.