Home Server Part 2: Do as I say not as I do
As I continued building and testing my new home server, I quickly discovered the importance of choosing the right hardware for the job. While I initially wanted to try out the new and affordable AMD Ryzen platform, my decision came back to haunt me. In this second part of my home server journey, I will share the challenges I faced with my hardware selection and the errors that persisted despite various troubleshooting attempts.
Operating System Choice: TrueNAS
I decided to use TrueNAS as the operating system for my new server due to its ZFS support and user-friendly web interface for easy management. Additionally, TrueNAS can run Virtual Machines (using bhyve) and offers Jails. And these things are pretty cool! They are a form of lightweight virtualization used in FreeBSD systems to create isolated environments for running applications. It’s almost like the old-school version of containerization. They make it easy to compartmentalize services and enhance security – especially when these services are made available outside of your home network.
One of the key reasons for choosing TrueNAS though was its stability. TrueNAS is based on FreeBSD, a Unix-like operating system known for its robustness and reliability. I wanted a system that wouldn’t require frequent troubleshooting or maintenance, as I prefer to spend my time using the server for its intended purposes rather than fixing issues.
Hardware Selection: The Temptation of AMD Ryzen
When it came to choosing hardware, TrueNAS had a hardware guide that recommended specific components for optimal stability. They suggested using Intel processors with well-established chipsets, ECC memory, and server-grade motherboards, typically from manufacturers like Supermicro.
However, the lure of AMD’s new Ryzen processors was too strong to resist. The Ryzen CPUs offered impressive performance at a relatively affordable price, and I was eager to explore this new platform. Normally, I would recommend using enterprise-grade hardware for servers due to their durability and suitability for 24/7 operations. Such mainboards often come with great features like IPMI for remote management, redundant networking, and redundant power supplies.
Also, I would always recommend using ECC RAM for critical data storage systems. ECC (Error-Correcting Code) memory can detect and correct single-bit errors, reducing the chances of data corruption. But in this case, I decided to be frugal (it’s a home server after all!) and opted for consumer-grade hardware instead. Compared to the old machine, this would still be a great step forward in every direction.
My hardware configuration included:
- AMD Ryzen 5 3400G processor
- ASRock X570 Pro4 motherboard with BIOS version 3.20 (AMD AGESA Combo-AM4 V2 1.0.8.0 Patch A)
- 32GB G.Skill Fortis DDR4-2400 non-ECC memory
- Hard drives: 3x 8000GB WD Red Plus WD80EFAX, and an older Samsung HD502HJ 500GB for testing
- An old Crucial M4 SATA SSD as the boot device (which was problem-free)
Building and Testing the Server: Here Come the Problems
I built the server like a regular desktop computer, and the installation of TrueNAS went smoothly. I created a new pool of hard drives with a much larger total capacity than before, and I started transferring my data from the old server to the new one.
However, after a few hours of transferring over 1TB of data, I encountered a persistent error that disrupted the process. The error message was as follows:
(ada0:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 68e4 0a 40 1c 00 00 01 00 00
(ada0:ahcich5:0:0:0): CAM status: CCB request was invalid
(ada0:ahcich5:0:0:0): Error 22, Unretryable error
(ada0:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 f8 93 4b 40 01 00 00 01 00 00
(ada0:ahcich5:0:0:0): CAM status: CCB request was invalid
(ada0:ahcich5:0:0:0): Error 22, Unretryable error
(ada0:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 48 cf b8 40 01 00 00 01 00 00
(ada0:ahcich5:0:0:0): CAM status: CCB request was invalid
(ada0:ahcich5:0:0:0): Error 22, Unretryable error
I tried various troubleshooting steps to isolate the issue. I ruled out the operating system, the problem occured on FreeNAS 11.3-U5 as well as TrueNAS CORE 12.0-RELEASE. I checked the memory by running multiple passes of memtest86 (without any errors). I changed SATA cables, tried multiple new ones and an one from another working machine.
The error still occured with all 3 new hard drives and with an older one that I know is working fine. Despite mounting the drives as securely as possible and eliminating vibrations as another potential cause, the errors persisted. SMART readings for all drives were fine and I suspected that it was not 4 drives failing at the same time. I even had one of the new drives run a long SMART test that completed without errors as well to be sure.
I swapped the PSU including the cables with one from another working machine. I went through the BIOS many times and tried different explicit and AUTO settings for the onboard SATA controller. All power saving features are off and it is definitely running in AHCI mode.
After days of trial and error, I was still stuck.
Seeking Help
I reached out to the TrueNAS and FreeBSD communities for assistance, but the responses were very limited and didn’t provide a definitive solution. Some users pointed to potential issues with the old FreeBSD versions used or recommended BIOS updates for the still new processor generation. To rule out the possibility of operating system related problems, I tested the server with an up-to-date FreeBSD as well as Debian Linux. Both of which produced similar errors, indicating that the issue was not specific to TrueNAS.
At this point, I was pretty sure that the issue was related to the mainboard and the it’s SATA controller. Upgrading the BIOS and replacing the mainboard with a new one did not completely resolve the errors. While the frequency of errors decreased, they still occurred under substantial write load. Further benchmarking with fio showed that around 800GB of sequential writes combined with random 4k IOPS induced the write failures.
This simple command was enough to break my new, shiny server:
fio --rw=write --name=test --size=800G --filename=testfile
fio --rw=randwrite --name=IOPS-write --bs=4k --direct=1 --filename=iopstest --size=800G --numjobs=4 --iodepth=32 --refill_buffers --group_reporting --runtime=60 --time_based
Is the Platform the Issue?
After these extensive hardware swaps and testing, I concluded that the problem was most likely related to the AMD Ryzen platform, specifically the motherboard’s SATA controller.
I reached out to AsRock, the manufacturer of the motherboard. They informed me that they do not officially support anything other than Windows, which could be a contributing factor to the issue.
So while the AMD Ryzen platform may offer excellent performance for the price, my experience highlights the importance of adhering to the recommended hardware configurations for critical data storage systems. Choosing enterprise-grade components, such as Intel processors with supported chipsets and server-grade motherboards, might indeed provide greater stability and reduce the risk of encountering hardware-related issues. Especially when using an operating system like TrueNAS/FreeBSD that is specifically targeted to this type of hardware.
Apparently, I had to learn the hard way that sometimes it’s best to just follow the recommended guidelines, even if the allure of new and affordable technology is tempting. But there was only one way to find out if they were truly right. Get the good server stuff and try it.