So just getting around to checking my logs on my backup server, and it says that I have a permanently damaged file that’s un-repairable.
How is this even possible on a raidz2 volume where each member shows zero problems and no dead drives? Isn’t that whole point of raidz2, so that if one (er, two) drives have a problem the data is recoverable? How can I figure out why this happened and why it was unrecoverable, and most importantly, prevent it in the future?
It’s only my backup server and the original file is still A-OK, but I’m really concerned here!
zpool status -v:
3-2-1-backup@BackupServer:~$ sudo zpool status -v
pool: data_pool3
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 06:59:59 with 1 errors on Sun Nov 12 07:24:00 2023
config:
NAME STATE READ WRITE CKSUM
data_pool3 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx1 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx2 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx3 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx4 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx5 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx6 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx7 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx8 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
data_pool3/(redacted)/(redacted)@backup_script:/Documentaries/(redacted)
Does it have ECC memory?
This is my backup server, so no. Primary does.
That might be the culprit
Well, two steps forwards, one step back. The scrub I ran yesterday at least showed some errors, but I’m having trouble identifying exactly what is the actual problem. I think I’ll sleep on it and form a new plan in the morning.
Controller failure? RAM failure? Dmesg shows absolutely nothing, no panics no anything so I’m not thinking it’s ram. Hmmmm… maybe I’ll run mtest after I get some sleep.
3-2-1-backup@BackupServer:~$ sudo zpool status -vx pool: data_pool3 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 40K in 07:07:07 with 4 errors on Tue Nov 28 22:39:33 2023 config: NAME STATE READ WRITE CKSUM data_pool3 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 wwn-0x5000ccax1 ONLINE 0 0 8 wwn-0x5000ccax2 ONLINE 0 0 10 wwn-0x5000ccax3 ONLINE 0 0 8 wwn-0x5000ccax4 ONLINE 0 0 8 wwn-0x5000ccax5 ONLINE 0 0 8 wwn-0x5000ccax6 ONLINE 0 0 8 wwn-0x5000ccax7 ONLINE 0 0 8 wwn-0x5000ccax8 ONLINE 0 0 8 errors: Permanent errors have been detected in the following files: data_pool3/(redacted)/downloads@backup_script-2023-11-28-0901:/(redacted).mkv data_pool3/(redacted)@backup_script-2023-11-28-2001:/ISOs/Ubuntu/23.10/ubuntu-23.10.1-desktop-amd64.iso data_pool3/(redacted)@backup_script-2023-11-07-0901:/(redacted).mkv
Hey wow, even though my problem is getting worse (maybe), an actual honest-to-god ISO showed up in the problem file list!