Hello everyone. I'm going to build a new PC soon and I'm trying to maximize its reliability all I can. I'm using Debian Bookworm. I have a 1TB M2 SSD to boot on and a 4TB SATA SSD for storage. My goal is for the computer to last at least 10 years. It's for personal use and work, playing games, making games, programming, drawing, 3d modelling etc.
I've been reading on filesystems and it seems like the best ones to preserve data if anything is lost or corrupted or went through a power outage are BTRFS and ZFS. However I've also read they have stability issues, unlike Ext4. It seems like a tradeoff then?
I've read that most of BTRFS's stability issues come from trying to do RAID5/6 on it, which I'll never do. Is everything else good enough? ZFS's stability issues seem to mostly come from it having out-of-tree kernel modules, but how much of a problem is this in real-life use?
So far I've been thinking of using BTRFS for the boot drive and ZFS for the storage drive. But maybe it's better to use BTRFS for both? I'll of course keep backups but I would still like to ensure I'll have to deal with stuff breaking as little as possible.
This might be controversial here. But if reliability is your biggest concern, you really can't go wrong with:
A proper hardware RAID controller
You want something with patrol read, supercapacitor- or battery-backed cache/NVRAM, and a fast enough chipset/memory to keep up with the underlying drives.
LVM with snapshots
Ext4 or XFS
A basic UPS that you can monitor with NUT to safely shut down your system during an outage.
I would probably stick with ext4 for boot and XFS for data. They are both super reliable, and both are usually close to tied for general-purpose performance on modern kernels.
That's what we do in enterprise land. Keep it simple. Use discrete hardware/software components that do one thing and do it well.
I had decade-old servers with similar setups that were installed with Ubuntu 8.04 and upgraded all the way through 18.04 with minimal issues (the GRUB2 migration being one of the bigger pains). Granted, they went through plenty of hard drives. But some even got increased capacity along the way (you just replace them one at a time and let the RAID resilver in-between).
Edit to add: The only gotcha you really have to worry about is properly aligning the filesystem to the underlying RAID geometry (if the RAID controller doesn't expose it to the OS for you). But that's more important with striping.
How many hardware RAID controllers have you had fail? I have had zero of 800 fail. And even if one did, the RAID metadata is stored on the last block of each drive. Pop in new card, select import, done.
I am sorry that you had to personally experience data loss from one specific hardware failure. I will amend the post to indicate that a proper hardware RAID controller should use the SNIA Common RAID DDF. Even mdadm can read it in the event of a controller failure.
Any mid- to high-tier MegaRAID card should support it. I have successfully pulled disks directly from a PERC 5 and imported them to a PERC 8 without issues due to the standardized format.
ZFS is great too if you have the knowledge and know-how to maintain it properly. It's extremely flexible and extremely powerful. But like most technologies, it comes with its own set of tradeoffs. It isn't the most performant out-of-the-box, and it has a lot of knobs to turn. And no filesystem, regardless of how resilient it is, will ever be as resilient to power failures as a battery/supercapacitor-backed path to NVRAM.
To put it simply, ZFS is sufficiently complex to be much more prone to operator error.
For someone with the limited background knowledge that the OP seems to have on filesystem choices, it definitely wouldn't be the easiest or fastest choice for putting together a reliable and performant system.
If it works for you personally, there's nothing wrong with that.
Or if you want to trade anecdotes, the only volume I've ever lost was on a TrueNAS appliance after power failure, and even iXsystems paid support was unable to assist. Ended up having to rebuild and copy from an off-site snapshot.