How to remove binary large objects (BLOB) from git history and keep them available afterwards.
Disclaimer: This is not meant to be step-by-step instructions covering every possible case, but rather an overview over which tools you can use to solve this problem. Consult git and related documentation, if your case deviates from this example.
Public/Shared Repos
This is the more general option, since what is listed for private repos is not generally suitable for keeping your repository related data publicly available.
- Find out which binary files you have committed in your repo
- Commandline tools like
ncdu
help in finding large files. - For the purpose of this example let's say you found
data/my_training_set.bin
- Commandline tools like
- Backup the BLOB outside the repo
zip ../data.zip data/my_training_set.bin
- Check the BLOB's history
tig data/my_training_set.bin
- Remove the file from your history
- 4.1 In a small repository
git filter-branch --tree-filter 'rm -f data/my_training_set.bin'
- filter-branch performs poorly on larger numbers of commits
- 4.2 In a large repository
- Find the commit ID where
data/my_training_set.bin
was introduced, e.g. withtig data/my_training_set.bin
git rebase -i <commit-ID>~1
- e.g.
git rebase -i 272e7241548d564c3b13f15865cc5fb3c8058e82~1
- Follow the instructions in your editor to pick your commit ID for edit.
git rm data/my_training_set.bin
git commit --amend
git rebase --continue
- Repeat 4.2 if you commited changes to the blob throughout your history.
- Find the commit ID where
- 4.1 In a small repository
- Distribute your new history
git push --force
- In case of an error while pushing, make sure your repository settings in GitLab allow you to force push.
- Distribute your
data.zip
The university offers limited but nonetheless existing options to distribute large files publicly.- faubox
wwwcip.cs.fau.de/~<your_idm_username>
which points to~/.www/
in your cip user home, if you have access to the cip pools. You can also add symlinks to/proj/ciptmp/<your_idm_username>/
for more space, but beware that there is no backup for the ciptmp- If neither of these are suitable for you, ask your affiliated chair to provide a platform to distribute binary files.
Private Repos
In addition to the steps for public repos, git annex
provides useful functionality for
managing binary data in a git repository. See annex documentation.
The webdav remote can be used to integrate faubox as your annex storage.
See annex webdav documentation
and faubox webdav documentation.
However git annex
would need a GitLab like platform with annex support to be suitable
for public repositories. FAU currently offers no such platform.
Further Thought
- Place all BLOBs in a
./bindata
directory in your repository, add/bindata
to your.gitignore
and place symlinks likeln -s bindata/my_training_set.bin data/my_training_set.bin
so you don't have to change your pathes. To update your binary files just extract a newdata.zip
intobindata
. - If you have previously committed and subsequently deleted BLOBs, they are still part of your git history, but won't show up in a current checkout. Make sure to also properly remove those BLOBs from your history.