ZFS has served me well at home and at work for the past ~20 years, but I’m starting to hit scaling limits with just single nodes. 2024 will likely see my office deploy our first 4-5 node Ceph Cluster and I’d like to prepare for that day in my homelab. To that end, I’ve assembled a hardware template for a 4-node system below and would welcome any thoughts or recommendations on design (or anything else).
Ideally, if we come up with a useful cost-effective hardware template, others in the community will be able to use it to build out their own clusters.
Priorities
- Minimize upfront hardware costs
- Minimize recurring operational costs (i.e., power consumption)
- Performance (my only use case: streaming 100-200GB files to 1 client at 80-120 MB/sec. Largely WORM workload. Goal is to saturate a single 1GBe link.)
Current Hardware
- Data HDDs: 24x 8TB SATA, Seagate BarraCuda (ST8000DM008)
- Data HDDs: 24x 12TB SATA, Western Digital Red Pro NAS (WD121KFBX)
- Data HDDs: 24x 16TB SATA, Western Digital Gold Enterprise (WD161KRYZ)
- Data HDDs: 24x 20TB SATA, Seagate Exos X20 (ST20000NM007D)
- Chassis: 4x 24-bay 4U Hotswap NAS Case w/ 6x 6Gbps SF-8087 backplanes, Innovision (S46524)
- Motherboard: 4x Supermicro X11SSL-F
- CPU: 4x Intel Xeon E3-1230 v6 (4c/8t) @ 3.50GHz
- RAM: 4x 64GiB (4x16 kit) Supermicro DDR4 2400 VLP ECC UDIMM Memory RAM
- HBA: 4x LSI 9201-16i 6Gbps 16-lane + 4x AOC-USAS2-L8i 6Gbps 8-Lane
- OS SSDs: 8x 250GB SATA, Samsung 870 EVO (MZ-77E250B) (2x mirror per chassis)
Current Plan
Make 4x OSD nodes using the above HDDs, chassis, motherboards, CPUs, RAM, HBAs, and SSDs. Distribute the HDDs such that 6x of each drive goes into each 24-bay chassis like so: 6x 8TB, 6x 12TB, 6x 16TB, and 6x 20TB. This would ensure that each node is equally sized.
Buy 4x 10GbE PCIe cards, 1 for each node + switch + cables.
I have not yet spec’d out the Monitor and Manager Nodes and was considering running them on the same hardware as the OSD nodes. Thoughts on this are welcome.
Questions
-
Is the above hardware capable of fully saturating a 1GbE link to 1 client? Use case: streaming a single 100-200GB file to a speciality piece of lab equipment at a rock-steady rate above 80MB/sec, without any buffering. This client has some truly awful firmware and can crash if its buffers run empty, so I’m trying to design for a constant 80+ MB/sec. Read behavior is largely linear. At most 10-15 jumps throughout the file, for each file, as 10-20GB sections are streamed in sequentially.
-
Are these CPUs (4-cores/8-threads at 3.5 GHz) enough to handle 24 OSD daemons per chassis?
-
Is 64GiB per chassis enough for 24 OSD deamons per chassis?
-
I’d like this design to be able to withstand the failure of any 12 drives anywhere in the cluster. It’s not clear to me how I’d specify that from a CRUSH failure domain perspective. Guidance here welcome.
In your experience, what have you found to be the bare minimum? a 4c/8t CPU at 3.5Ghz does indeed sound a bit undersized for 24 HDD-based OSDs, so I’d be curious to read what others are running.