r/platform_engineering 13d ago

Balancing Capacity Forecasting Against Performance Risk in Overcommitted Infrastructure

Post image

We’ve been evaluating workload right-sizing behavior in heavily overcommitted OpenStack environments running on Platform9.

One thing that became interesting operationally:

From a pure MSP revenue perspective, aggressive overcommit ratios can make VM downsizing feel counterintuitive.

But oversized workloads also make capacity forecasting much less predictable when multiple tenants spike simultaneously.

To better understand the operational boundary, I added a background rightsizing engine into a Day-2 operations platform I’ve been building around Platform9/OpenStack.

Instead of reacting to short spikes, it analyzes a rolling 30-day window and classifies workloads as:

  • idle
  • over_provisioned
  • under_provisioned

The more interesting part ended up being the operational workflow rather than the recommendation itself:

  • snooze states
  • suppression windows
  • avoiding alert fatigue
  • tenant-specific pricing deltas
  • tracking recommendations as lifecycle objects instead of alerts

One thing we noticed:
Under-provisioned detection may actually be more operationally valuable than cost optimization in highly overcommitted clusters.

Curious how other teams handle balancing:

  • overcommit ratios
  • forecasting confidence
  • tenant performance isolation
  • rightsizing recommendations
  • alert fatigue

Especially in MSP/multi-tenant OpenStack environments.

Project reference:
https://github.com/erezrozenbaum/pf9-mngt

2 Upvotes

1 comment sorted by

View all comments

1

u/cailenletigre 9d ago

This is just AI slop. The image and the code.