Recently, two SSDs from a ZFS pool in TrueNAS actually came out one after the other. Of course, this then leads to a freeze of the respective files and also some of the services, as no more writing operations are carried out on the pool and only data in the ARC cache can be read.
Since there were a lot of things running on the server during this time, I wanted to avoid completely restarting the machine, so I looked for a solution to get the pool working again without having to restart it.
The procedure described here was carried out with TrueNAS SCALE Cobia, but should also work natively on Ubuntu or another Linux distribution.
Attention: We are definitely “operating” directly at the heart of ZFS. We sometimes have to bypass the security functions that normally occur in ZFS at short notice, which can also lead to data loss. A backup of the data is essential!
In my case, both devices in the RAID Z1 pool failed with too many errors. Some checksum errors, some read/write errors.
A “normal” reset of the errors and cleaning up the status with
zpool clear <POOLNAME>
therefore didn't work.
It is also possible in ZFS to take a pool offline. But also this attempt using
zpool offline <POOLNAME>
as well as
zpool online <POOLNAME>
was not crowned with success.
So I physically removed both SSDs from the system (fortunately they were all hot-swappable) and pushed them back into the slot so that they could be recognized by the system again.
The call of
zpool status
brought now showed the pool as "offline". Now I tried using the pool again
zpool online <POOLNAME>
online, but that didn't work. The error message was displayed that the pool had uncorrectable errors.
Now comes the dangerous part: In order for the pool to function correctly again without a restart, we have to assign it an "error-free" state again. There are some command line parameters for this, some of which are documented, some of which are less so. Essentially, this completely resets the error counter for any I/O errors as well as checksum errors:
zpool clear -nFX <POOLNAME>
However, this does not automatically bring the pool back online in the system; we now have to do this manually:
zpool online <POOLNAME>
The pool has now been listed as online again.
However, some services were still hanging and the TrueNAS Kubernetes system was also stored on the exited pool or the Docker payload data such as configuration files.
To do this, we now need to use the TrueNAS middleware with the command
service middleware restart
start anew. This may take a few minutes. This also restarts all other services, such as SMB or NFS, which may have been frozen or no longer worked properly due to the faulty pool.
Kubernetes should now work again after a few minutes. In my case, all containers or pods were automatically recreated, which briefly increased the system load.
The affected pool should now be processed as quickly as possible with the command
zpool scrub <POOLNAME>
be subjected to a scrub in order to at least partially rule out hardware errors. If I/O errors occur again, either the SSD itself or the HBA controller is defective.
But it could also be a slowly dying power supply that no longer supplies enough power under peak load and therefore the SSDs are no longer active for a short period of time.