point taken. I have had xfs_check run out of memory because $reasons. This has reminded me of that. But then we also had lustre, so you know, by comparison XFS is a paragon of stability (We also had clustered XFS....)
But, mostly it happens because the fileservers dont have UPS's (The pipeline tools are almost exclusively COW, and backups are tested many times a week.)
I think we paid for support on XFS from either redhat or SGI, but I can't remember, I left that place a year or so ago.
I've never had the balls to run ZFS on large(100tb+) arrays. Last time I tried the way slab handling was translated from solaris cause many problems (but that was more than 4 years ago. ) Plus the support is a bit odd, you either go with oracle(fuck that), or one of the openZFS lot.
To get the best performance/stability you ideally need to let ZFS do everything, instead of letting the enclosure do the raid 6 (4 * 14 disk raid 6 with 4 spares). This of course is a break from the norm and to be treated with utmost suspicion.
Have you seen the GPFS raid replacement? thats quite sexy.
I run ZFS at home, because it works, and has bitrot checking.
1. filesystem is corrupted, once again
2. try to repair it
3. oh right, the repair tool can not replay the journal
4. try to mount it
5. admit after 3 hours of nothing that the journal-replay code triggered on mount actually can really not deal with corruption
6. reboot the server to get the filesystem unstuck
7. rerun repair, this time throwing away the journal
8. look at the empty filesystem with everything in lost+found
9. restore from backup
The team I'm in only runs 1'400 servers, yet this happens regularly.