Is there anywhere I can read more about Tarsnap's features/architecture? You're obviously a bright guy, and you say you've been working on it for over 2 years, so I'm just curious about what exactly you built ...
For example, what advantages does tarsnap have over some bash scripts I write in a few hours that give me off-site encrypted backups with the help of GPG and rsync? (That's pretty close to the system I use now). I'm sure there are some, but I just don't see them enumerated tarsnap.com ...
For example, what advantages does tarsnap have over some bash scripts I write in a few hours that give me off-site encrypted backups with the help of GPG and rsync?
It's hard to say without knowing exactly how your scripts work, but I'd guess that one big advantage tarsnap has is that it works with a snapshotted model of backups.
As it happens, tarsnap's snapshots work via reference counting -- fortunately, it works better for tarsnap than it does for garbage collection. (Reference counting breaks if you have circular references; this is a problem for garbage collection, but not for tarsnap.)
Why did you choose to do reference counting instead of more sophisticated techniques?
I can imagine that ref. counting is much easier to implement and the drawbacks are unimportant in your domain. However I'd still like to read about the reasons for your decision, since you will have thought about that issue much longer and clearer.
Oh, now I understand. Methods such as reachability analysis require reading lots of memory locations; but for tarsnap, "memory locations" are blocks of data stored remotely, so this gets expensive (and slow) very quickly. With reference counting, the counts can be stored locally and no extraneous data needs to bs transferred to or from the server.
I guess there's a precedent there in the fact that unix file systems use reference counting for file links, presumably for the same reasons. Considering how well-architected tarsnap seems to be, I suspect you've taken steps to avoid losing data via inconsistencies in the reference counter?
... you've taken steps to avoid losing data via inconsistencies in the reference counter?
There are some sanity checks built in, but in the extreme case what you're suggesting is impossible. Reference counts are managed on the client side, and the client has the keys necessary to delete blocks from the server; if the client is functioning correctly, it won't get the reference counts wrong, but if the client is malfunctioning then it could go berserk and delete blocks without even looking at the reference counts.
I have taken care of the obvious issues, though -- as long as the OS implements fsync() properly, there's no way that tarsnap or the client system crashing will result in corruption.
For example, what advantages does tarsnap have over some bash scripts I write in a few hours that give me off-site encrypted backups with the help of GPG and rsync? (That's pretty close to the system I use now). I'm sure there are some, but I just don't see them enumerated tarsnap.com ...