r/platform_engineering • u/erezroz • 13d ago
Balancing Capacity Forecasting Against Performance Risk in Overcommitted Infrastructure
We’ve been evaluating workload right-sizing behavior in heavily overcommitted OpenStack environments running on Platform9.
One thing that became interesting operationally:
From a pure MSP revenue perspective, aggressive overcommit ratios can make VM downsizing feel counterintuitive.
But oversized workloads also make capacity forecasting much less predictable when multiple tenants spike simultaneously.
To better understand the operational boundary, I added a background rightsizing engine into a Day-2 operations platform I’ve been building around Platform9/OpenStack.
Instead of reacting to short spikes, it analyzes a rolling 30-day window and classifies workloads as:
- idle
- over_provisioned
- under_provisioned
The more interesting part ended up being the operational workflow rather than the recommendation itself:
- snooze states
- suppression windows
- avoiding alert fatigue
- tenant-specific pricing deltas
- tracking recommendations as lifecycle objects instead of alerts
One thing we noticed:
Under-provisioned detection may actually be more operationally valuable than cost optimization in highly overcommitted clusters.
Curious how other teams handle balancing:
- overcommit ratios
- forecasting confidence
- tenant performance isolation
- rightsizing recommendations
- alert fatigue
Especially in MSP/multi-tenant OpenStack environments.
Project reference:
https://github.com/erezrozenbaum/pf9-mngt
1
u/cailenletigre 9d ago
This is just AI slop. The image and the code.