Dolt is for tabular data. It's like SQLite but with branching, versioning of the DB level. DVC is file-based. It saves large files, directories, etc to one of the supported storages - S3, GCP, Azure, etc. It's more like Git-lfs in that sense.
Another difference is that for DVC (surprisingly) data versioning itself is just one of the main fundamental layers that is needed to provide holistic ML experiments tracking and versioning. So, DVC has a layer to describe an ML project, run it, capture and version inputs/outputs. In that sense DVC becomes a more opinionated / high level tool if that makes sense.
* if you forget and do `git add` instead of `git annex add`, everything is fine, but you've now spoilt the nice thing that git annex does of de-duping files. (git annex only stores one copy of identical files)
* for our use case (which I'm sure is the wrong way of doing things) it's possible to overwrite the single copy of a file that git annex stores, which rather spoils the point of the thing. I do think it's down to the way we use it, though, so not specifically a git annex problem
The _great_ thing about git annex is it can be self-hosted. For various reasons we can't put our source data in one of the systems that uses git-lfs.
We've got about 800 GB of data in git annex and I've been happy with it despite the limitations.
If you configure annex.largefiles, git add should work with the annex. I start with something like
git annex config --set annex.largefiles 'largerthan=1kb and not (mimeencoding=us-ascii or mimeencoding=utf-8)'
> By default, git-annex add adds all files to the annex (except dotfiles), and git add adds files to git (unless they were added to the annex previously). When annex.largefiles is configured, both git annex add and git add will add matching large files to the annex, and the other files to git. —https://git-annex.branchable.com/git-annex/
Note that git add will add large files unlocked, though, since (as far as I understand) it’s assumed you’re still modifying them for safety:
> If you use git add to add a file to the annex, it will be added in unlocked form from the beginning. This allows workflows where a file starts out unlocked, is modified as necessary, and is locked once it reaches its final version. —https://git-annex.branchable.com/git-annex-unlock/
Yes it definitely serves a valid use-case, I feel like someone should try and bring some competition there. A modern equivalent with fewer gotchas, maybe in Rust/Go, maybe using a fuse mount and content-defined chunking (borg/restic/...-style) would be amazing.
I'd love to see a well-supported git-lfs compatible client/proxy (so you could more easily move backends) that could run on top of S3/object storage. Yes, and written in a modern language like golang/rust for performance / parallelism. There's some node.js and various other git-lfs proxies out there, but not well enough maintained that I could count on them being around and working in another 5 years. git-annex at least has been around for a while, even though it has its issues.
Huggingface uses git-lfs for large datasets with good success. git-lfs on GitHub gets very pricey at higher volumes of data. Would love the affordability of object storage, just with a better git blob storage interface, that will be around in the future.
Most of these systems do their own hash calculations and are not interchangeable with each other. I feel like git-lfs has the momentum at the momentum in data-science at the moment, but needs some better options for people who want a low cost storage option that they can control.
Huggingface is great, but it's one more service to onboard if you're in an enterprise. And data privacy/retention/governance means that many people would liek their data to reside on their own infrastructure.
If AWS were to give us a low cost git-lfs hosted service on top of S3 it would be very popular.
If anyone knows of some good alternatives, please let us know!
I work with a lot of uncompressed structured binary files so I finally broke down and wrote my own system based on the Restic chunker: https://github.com/akbarnes/dupver
It's pretty basic, but it works for me and will hopefully inspire someone to make a "real" data VCS based on content-defined chunking.
It lives in this weird wiki that seems to be read-only most of the time. I don't think it's alive. Its use of hard links also causes too many problems, of the silent corruption variety.
Ikiwiki’s definitely a bit weird, but I’ve been experimenting with git-annex recently and it worked fine every time I commented. Seems like it’s chugging along: https://git-annex.branchable.com/recentchanges/
When does it use hard links? As far as I remember it used symlinks unless you used something like annex.hardlink (described in the man page: https://git-annex.branchable.com/git-annex/)
Symlinks are just as problematic honestly, an app writing to it will change the object in the persistent "immutable" storage. The way the "check out" feature works is also weird, causing a change in the shared version history.
> Symlinks are just as problematic honestly, an app writing to it will change the object in the persistent "immutable" storage.
Well, anything stored by git-annex has read-only file permissions. Apps will follow the symlink, yes, but they will fail to write to the location if they try.
> The way the "check out" feature works is also weird, causing a change in the shared version history.
Unlocking a file changes it from a symlink to a git-annex pointer file from git’s perspective (git-annex accomplishes this via git’s smudge filter interface), but you don’t have to commit the unlock. You can unlock, modify locally, re-lock, and commit the new changed version in one go. It’s nice that you can commit the unlocking action itself if you want a file to be unlocked in all clones of the repository. You can choose whether to commit the unlock depending on if it fits your use case.
[https://www.datalad.org/]