Hyperconverged Infrastructure in Your Home Lab: Proxmox + Ceph
Hyperconverged infrastructure (HCI) is one of those enterprise concepts that sounds way too serious for a home lab. Three or more nodes, shared distributed storage, automatic failover — this is data center stuff. But Proxmox makes it genuinely accessible, and if you have the hardware, it's one of the most rewarding builds you can do.
The question isn't whether HCI is cool. It is. The question is whether it makes sense for your situation, because it's a significant step up in complexity, cost, and power consumption from a single-node setup.
What Hyperconverged Actually Means
In a traditional setup, you have separate compute servers and separate storage systems. Your VMs run on the compute nodes and store their data on a SAN or NAS over the network. Two different systems, two different management planes.
Hyperconverged infrastructure combines compute and storage on the same nodes. Every node runs VMs and contributes its disks to a shared storage pool. The storage is distributed and replicated across all nodes, so if one node dies, the data is still available on the others and VMs can restart elsewhere.
In the Proxmox world, this means:
- Compute: Proxmox VE runs KVM virtual machines and LXC containers on each node
- Storage: Ceph runs on the same nodes, pooling local disks into a distributed, replicated storage cluster
- Networking: A dedicated network connects the nodes for both Ceph replication traffic and VM live migration
The result: a cluster where you can lose an entire node — pull the power cord — and your VMs come back up on the surviving nodes with no data loss. That's the pitch.
Minimum Hardware Requirements
HCI has a hard floor: three nodes. Ceph needs a minimum of three nodes to maintain data availability when one fails (it uses a quorum model, and with replication factor 3, losing one node still leaves two good copies). You can technically run Ceph on fewer nodes, but you lose the fault tolerance that's the whole point.
Per Node (Minimum Viable)
| Component | Minimum | Recommended |
|---|---|---|
| CPU | 4 cores / 8 threads | 8+ cores / 16+ threads |
| RAM | 32 GB | 64 GB+ |
| Boot drive | 1x 256 GB SSD | 1x 512 GB NVMe |
| Ceph OSD drives | 2x HDD or SSD | 3-4x SSD or mix |
| Ceph WAL/DB | (optional) | 1x NVMe for WAL/DB |
| Network | 1 Gbps (minimum 2 NICs) | 10 Gbps dedicated Ceph network |
Why So Much RAM?
Ceph is hungry. Each OSD (Object Storage Daemon — one per disk) wants 4-8 GB of RAM by default. With 3 OSDs per node, that's 12-24 GB just for Ceph. Your VMs need the rest. 32 GB is tight. 64 GB is comfortable. 128 GB lets you breathe.
Three-Node Cluster Total
For a budget build with used enterprise hardware:
| Item | Per Node | 3x Total |
|---|---|---|
| Used Dell R730xd / HP DL380 Gen9 | $200-400 | $600-1,200 |
| 64 GB RAM (upgrade if needed) | $50-100 | $150-300 |
| 3x 1 TB SATA SSD (Ceph) | $150-250 | $450-750 |
| 1x 256 GB NVMe (boot) | $30 | $90 |
| 10 GbE NIC (Mellanox ConnectX-3) | $15-25 | $45-75 |
| 10 GbE switch (used) | — | $50-150 |
| Total | $1,385-2,565 |
That's a real range. You can build a functional three-node HCI cluster for under $1,500 if you're patient with used hardware. But you can also spend $3,000+ easily if you go with newer gear or more storage.
Network Requirements
Networking is where HCI builds succeed or fail. Ceph replicates every write to multiple nodes across the network. If your network is slow, your storage is slow. Period.
The Minimum: Two Networks
You need at least two separate networks:
- Management/Public network: Carries regular traffic — VM network access, Proxmox web UI, API calls. Your normal LAN.
- Ceph/Cluster network: Carries Ceph replication traffic and OSD heartbeats. This should be separate and ideally faster.
┌─────────────────┐
│ LAN Switch │ (1 Gbps management)
└─┬─────┬─────┬──┘
│ │ │
┌────┴┐ ┌─┴───┐ ┌┴────┐
│Node1│ │Node2│ │Node3│
└────┬┘ └─┬───┘ └┬────┘
│ │ │
┌─┴─────┴─────┴──┐
│ Ceph Switch │ (10 Gbps storage)
└─────────────────┘
1 Gbps: It Works, Barely
You can run Ceph over 1 Gbps. It functions. But your maximum sequential write throughput is capped at around 100 MB/s, and with replication factor 3, every write generates 3x the network traffic. You'll hit the network ceiling before your SSDs break a sweat.
For HDDs, 1 Gbps is usually fine — the disks are the bottleneck, not the network. For SSDs, it's painful.
10 Gbps: The Sweet Spot
10 GbE is where Ceph starts to feel right. Used Mellanox ConnectX-3 cards are $15-25 each on eBay, and a used 10 GbE switch (Mikrotik CRS305 or similar) runs $50-150. For the price of a pizza dinner, you've removed your biggest bottleneck.
# Verify 10 GbE link speed
ethtool enp4s0 | grep Speed
# Speed: 10000Mb/s
25/40 Gbps: Overkill (But Fun)
ConnectX-4 and ConnectX-5 cards for 25 GbE are getting cheap on the used market. If you're running all-NVMe Ceph, the extra bandwidth helps. For most home labs, 10 GbE is plenty.
Setting Up the Proxmox Cluster
Step 1: Install Proxmox on All Three Nodes
Install Proxmox VE on each node from the ISO. Use the NVMe or SSD as the boot drive. During installation:
- Set a unique hostname for each node (pve1, pve2, pve3)
- Assign management IPs on the same subnet (e.g., 192.168.1.101-103)
- Use the same DNS settings on all nodes
Step 2: Configure Networking
On each node, set up the Ceph network. Assuming your second NIC is enp4s0:
# /etc/network/interfaces (add to existing config)
auto enp4s0
iface enp4s0 inet static
address 10.10.10.101/24 # .101 for node 1, .102 for node 2, etc.
mtu 9000 # Jumbo frames for better throughput
# Apply and verify
ifreload -a
ping -M do -s 8972 10.10.10.102 # Test jumbo frames between nodes
Enable jumbo frames (MTU 9000) on the Ceph network and the switch. This reduces CPU overhead and improves throughput. Make sure the switch supports jumbo frames and has them enabled on the relevant ports.
Step 3: Create the Cluster
On the first node (pve1):
pvecm create homelab-cluster
On the second and third nodes:
# On pve2
pvecm add 192.168.1.101
# On pve3
pvecm add 192.168.1.101
Verify:
pvecm status
# Should show 3 nodes, quorate
Step 4: Install Ceph
Proxmox has Ceph integration built in. From the web UI or command line on each node:
pveceph install --repository no-subscription
Step 5: Initialize Ceph
On the first node:
pveceph init --network 10.10.10.0/24
This tells Ceph to use the 10.10.10.0/24 network for cluster/replication traffic, keeping it off your management LAN.
Step 6: Create Monitors and Managers
Ceph needs monitors (MONs) for cluster state and managers (MGRs) for metrics. Create one of each on every node:
# On each node (or from the web UI)
pveceph mon create
pveceph mgr create
Step 7: Create OSDs
An OSD (Object Storage Daemon) manages one physical disk. Create one OSD per data disk on each node:
# On each node, for each disk
pveceph osd create /dev/sda
pveceph osd create /dev/sdb
pveceph osd create /dev/sdc
If you have a fast NVMe you want to use as a WAL/DB device (write-ahead log and metadata database), which dramatically improves performance for HDD-backed OSDs:
pveceph osd create /dev/sda --db_dev /dev/nvme0n1
pveceph osd create /dev/sdb --db_dev /dev/nvme0n1
pveceph osd create /dev/sdc --db_dev /dev/nvme0n1
Step 8: Create a Storage Pool
# Create a replicated pool with size 3 (data on 3 nodes) and min_size 2 (survives 1 failure)
pveceph pool create vm-storage --size 3 --min_size 2 --pg_autoscale_mode on
This pool is now available in Proxmox as a storage target for VM disks and container volumes.
Storage Tiering
Not all data needs the same performance. Ceph supports multiple pools with different backing hardware, which lets you tier your storage.
Example Tiering Strategy
| Tier | Hardware | Use Case | Approximate Cost/TB |
|---|---|---|---|
| Fast | NVMe SSDs | Database VMs, active workloads | $80-120/TB |
| Standard | SATA SSDs | General VM storage, containers | $50-80/TB |
| Cold | HDDs with NVMe WAL/DB | Media, backups, archives | $20-30/TB |
# Create CRUSH rules to separate tiers
# (This uses device classes — Ceph auto-detects SSD vs HDD)
# Pool using only SSDs
pveceph pool create fast-storage --size 3 --min_size 2 --crush_rule ssd
# Pool using only HDDs
pveceph pool create bulk-storage --size 3 --min_size 2 --crush_rule hdd
In practice, most home labs use a single tier. Mixing tiers adds complexity, and the benefit only matters if you have genuinely different performance requirements. If all your disks are SSDs, just make one pool.
When HCI Makes Sense (And When It Doesn't)
HCI Makes Sense When
- You need high availability: You run services that genuinely can't tolerate downtime, and you want automatic failover.
- You already have multiple servers: If you have three machines sitting around, clustering them is a great use of existing hardware.
- You want to learn enterprise concepts: HCI, distributed storage, clustering, and failover are valuable skills. Building it at home is the best way to learn.
- You're consolidating: Three old machines might be better as a cluster than three standalone boxes.
HCI Does NOT Make Sense When
- You have one server: A single Proxmox node with local ZFS storage will outperform a three-node Ceph cluster in raw I/O every time. Ceph adds latency (network round trips for replication). If you don't need multi-node failover, don't add the complexity.
- Your budget is tight: $1,500+ for hardware, plus ongoing power costs for three machines. A single well-specced node costs $500-800 and does 90% of what most home labs need.
- Your power bill matters: Three servers running 24/7 consume 300-600 watts. A single server uses 80-150 watts. That's an extra $150-400/year in electricity depending on your rates.
- You want simplicity: A single Proxmox node with ZFS is dramatically simpler to manage. No quorum issues, no Ceph degraded states, no cluster networking. It just works.
Honest Performance Comparison
| Metric | Single Node (ZFS) | 3-Node HCI (Ceph) |
|---|---|---|
| Sequential read | 500+ MB/s (local) | 200-400 MB/s (network) |
| Sequential write | 400+ MB/s (local) | 100-300 MB/s (network) |
| Random 4K IOPS | 50,000+ (NVMe) | 10,000-30,000 |
| Latency | <1 ms (local) | 1-5 ms (network) |
| Fault tolerance | None (unless ZFS mirror) | Survives node failure |
| Capacity scaling | Limited by chassis | Add nodes |
Ceph is slower than local storage. Always. The network adds latency to every I/O operation. You build HCI for resilience and scalability, not raw performance.
Operating Your Cluster
Monitoring Health
# Overall cluster status
pveceph status
# Detailed OSD status
ceph osd tree
# Check for degraded objects
ceph health detail
# Watch cluster events in real time
ceph -w
Ceph will tell you when things are wrong. A healthy cluster shows HEALTH_OK. Common warnings:
HEALTH_WARN: Something needs attention but data is safeHEALTH_ERR: Data redundancy is compromised, fix immediately
Handling Node Failures
When a node goes down:
- Ceph detects the missing OSDs (within ~30 seconds)
- The cluster marks the OSDs as
down - After a timeout (default 10 minutes), it starts rebalancing data to maintain replication factor
- Proxmox HA (if configured) restarts affected VMs on surviving nodes
When the node comes back:
- OSDs rejoin the cluster
- Ceph rebalances data back (this is called "backfilling")
- Everything returns to normal
Maintenance: Taking a Node Offline
Before rebooting or maintaining a node, tell Ceph to not panic:
# Set noout flag (prevents Ceph from rebalancing during maintenance)
ceph osd set noout
# Do your maintenance (reboot, upgrade, etc.)
reboot
# When the node is back, unset the flag
ceph osd unset noout
This prevents unnecessary data movement during planned maintenance.
Cost Analysis: Is It Worth It?
Let's be brutally honest with the numbers.
Three-Node HCI Cluster
| Cost | Amount |
|---|---|
| Hardware (used enterprise) | $1,500-2,500 |
| 10 GbE switch + NICs | $100-200 |
| Power (300W avg, $0.12/kWh) | ~$315/year |
| Total first year | $1,900-3,000 |
| Ongoing annual (power) | ~$315/year |
Single Node Alternative
| Cost | Amount |
|---|---|
| Hardware (used enterprise or new mini PC) | $400-800 |
| Power (100W avg, $0.12/kWh) | ~$105/year |
| Total first year | $500-900 |
| Ongoing annual (power) | ~$105/year |
The HCI cluster costs 3-4x more in hardware and 3x more in power. You get fault tolerance and the ability to do live migrations, rolling upgrades, and capacity scaling. Whether that's worth it depends entirely on whether you need those things or just want to learn about them.
If the answer is "I want to learn" — that's a perfectly valid reason. Building and operating a Ceph cluster teaches you more about distributed systems than any course or certification. Just go in knowing it's an investment in education, not a practical necessity for running Pi-hole and Jellyfin.
Final Thoughts
Hyperconverged infrastructure in a home lab is a step into real infrastructure engineering. Proxmox and Ceph make it accessible — no VMware licenses, no proprietary hardware, no vendor lock-in. But accessible doesn't mean simple. You're running a distributed storage system, and distributed systems have failure modes that single-node setups don't.
Start with the basics: three nodes, a dedicated Ceph network (10 GbE if you can), all-SSD storage. Get the cluster healthy, create some VMs, practice failover. Then add complexity — tiered storage, HA groups, Proxmox Backup Server integration.
And if you read all of this and think "that's way more than I need" — that's a valid conclusion. A single Proxmox node with ZFS and good backups handles most home lab workloads beautifully. HCI is there when you're ready for it.