Can anyone compare this to DataLad [1], which someone introduced to me as "git f...

benhurmarcel · on Oct 2, 2022

And what about Dolt?

https://docs.dolthub.com/introduction/what-is-dolt

shcheklein · on Oct 2, 2022

Dolt is for tabular data. It's like SQLite but with branching, versioning of the DB level. DVC is file-based. It saves large files, directories, etc to one of the supported storages - S3, GCP, Azure, etc. It's more like Git-lfs in that sense.

Another difference is that for DVC (surprisingly) data versioning itself is just one of the main fundamental layers that is needed to provide holistic ML experiments tracking and versioning. So, DVC has a layer to describe an ML project, run it, capture and version inputs/outputs. In that sense DVC becomes a more opinionated / high level tool if that makes sense.

remram · on Oct 2, 2022

Doesn't use git-annex like DataLad. That alone is a huge benefit given the state of that tool.

imiric · on Oct 2, 2022

I'm curious, what's the problem with git-annex?

I've considered using it before as an alternative to Git LFS.

niccl · on Oct 2, 2022

things that I don't like about it:

* git diff doesn't work in any sensible way

* if you forget and do `git add` instead of `git annex add`, everything is fine, but you've now spoilt the nice thing that git annex does of de-duping files. (git annex only stores one copy of identical files)

* for our use case (which I'm sure is the wrong way of doing things) it's possible to overwrite the single copy of a file that git annex stores, which rather spoils the point of the thing. I do think it's down to the way we use it, though, so not specifically a git annex problem

The _great_ thing about git annex is it can be self-hosted. For various reasons we can't put our source data in one of the systems that uses git-lfs.

We've got about 800 GB of data in git annex and I've been happy with it despite the limitations.

hpfr · on Oct 2, 2022

If you configure annex.largefiles, git add should work with the annex. I start with something like

    git annex config --set annex.largefiles 'largerthan=1kb and not (mimeencoding=us-ascii or mimeencoding=utf-8)'

> By default, git-annex add adds all files to the annex (except dotfiles), and git add adds files to git (unless they were added to the annex previously). When annex.largefiles is configured, both git annex add and git add will add matching large files to the annex, and the other files to git. —https://git-annex.branchable.com/git-annex/

Note that git add will add large files unlocked, though, since (as far as I understand) it’s assumed you’re still modifying them for safety:

> If you use git add to add a file to the annex, it will be added in unlocked form from the beginning. This allows workflows where a file starts out unlocked, is modified as necessary, and is locked once it reaches its final version. —https://git-annex.branchable.com/git-annex-unlock/

remram · on Oct 2, 2022

Yes it definitely serves a valid use-case, I feel like someone should try and bring some competition there. A modern equivalent with fewer gotchas, maybe in Rust/Go, maybe using a fuse mount and content-defined chunking (borg/restic/...-style) would be amazing.

kernelsanderz · on Oct 2, 2022

I'd love to see a well-supported git-lfs compatible client/proxy (so you could more easily move backends) that could run on top of S3/object storage. Yes, and written in a modern language like golang/rust for performance / parallelism. There's some node.js and various other git-lfs proxies out there, but not well enough maintained that I could count on them being around and working in another 5 years. git-annex at least has been around for a while, even though it has its issues.

Huggingface uses git-lfs for large datasets with good success. git-lfs on GitHub gets very pricey at higher volumes of data. Would love the affordability of object storage, just with a better git blob storage interface, that will be around in the future.

Most of these systems do their own hash calculations and are not interchangeable with each other. I feel like git-lfs has the momentum at the momentum in data-science at the moment, but needs some better options for people who want a low cost storage option that they can control.

Huggingface is great, but it's one more service to onboard if you're in an enterprise. And data privacy/retention/governance means that many people would liek their data to reside on their own infrastructure.

If AWS were to give us a low cost git-lfs hosted service on top of S3 it would be very popular.

If anyone knows of some good alternatives, please let us know!

kernelsanderz · on Oct 2, 2022

Did some more research to see if anything had changed in this space. I found two interesting projects (haven't used them myself yet though):

One in C# (with support for auth)

https://github.com/alanedwardes/Estranged.Lfs

One in Rust (but no Auth, have to run reverse proxy)

https://github.com/jasonwhite/rudolfs

Both seem interesting. Anyone use these?

kumarsw · on Oct 3, 2022

I work with a lot of uncompressed structured binary files so I finally broke down and wrote my own system based on the Restic chunker: https://github.com/akbarnes/dupver It's pretty basic, but it works for me and will hopefully inspire someone to make a "real" data VCS based on content-defined chunking.

remram · on Oct 2, 2022

It lives in this weird wiki that seems to be read-only most of the time. I don't think it's alive. Its use of hard links also causes too many problems, of the silent corruption variety.

hpfr · on Oct 2, 2022

Ikiwiki’s definitely a bit weird, but I’ve been experimenting with git-annex recently and it worked fine every time I commented. Seems like it’s chugging along: https://git-annex.branchable.com/recentchanges/

When does it use hard links? As far as I remember it used symlinks unless you used something like annex.hardlink (described in the man page: https://git-annex.branchable.com/git-annex/)

remram · on Oct 3, 2022

Symlinks are just as problematic honestly, an app writing to it will change the object in the persistent "immutable" storage. The way the "check out" feature works is also weird, causing a change in the shared version history.

hpfr · on Oct 3, 2022

> Symlinks are just as problematic honestly, an app writing to it will change the object in the persistent "immutable" storage.

Well, anything stored by git-annex has read-only file permissions. Apps will follow the symlink, yes, but they will fail to write to the location if they try.

> The way the "check out" feature works is also weird, causing a change in the shared version history.

Unlocking a file changes it from a symlink to a git-annex pointer file from git’s perspective (git-annex accomplishes this via git’s smudge filter interface), but you don’t have to commit the unlock. You can unlock, modify locally, re-lock, and commit the new changed version in one go. It’s nice that you can commit the unlocking action itself if you want a file to be unlocked in all clones of the repository. You can choose whether to commit the unlock depending on if it fits your use case.

For curious readers, https://git-annex.branchable.com/tips/unlocked_files/ discusses these topics in more detail.

jefurii · on Oct 4, 2022

What's wrong with git-annex? My work has been using it for almost 10 years to manage 40TB+ of data. It's always been rock solid.