r/Network • u/JimmyDry2 • 16d ago
Text alert fatique is making our monitoring system less trustworthy
our monitoring setup technically ctaches issues but the amount of noisy alert has become oberwhelming. minor spikes, temporary disconnects and dublicated notifications constantly trigger incidents that nobody reacts to anymore. the worst part is that real problems now get buried under all the noise. we spent months tuning thresholds and dependencies but every adjustment seems to create another edge case somewhere else. looking for ways to simplify alerting logic while still keeping proper visibility into infrastructure health.
3
u/chickibumbum_byomde 16d ago
This happens when monitoring becomes “useful signal” to “constant activity.” once people stop trusting alerts...forget about it, it becomes redunant, even good monitoring becomes ineffective because real issues get ignored with the noise.
the fix usually isn’t more complicated logic, t’s fewer, less is more, more meaningful alerts. many eventually realize they should alert on actual impact and sustained problems, not every short spike or transient event. If the answer is unclear, or the issue usually resolves itself, it probably shouldn’t page anyone.
most healthy monitoring setups are actually quieter than people expect. They still collect lots of data, but they alert on far less of it.
1
1
u/XxTh3g04txX 15d ago
Solarwinds NPM for almost 20 years. 4000 nodes. Its supported one of the larger hospitals in NorCal.
very easy to write, tune, and mute alerts via filters.
1
u/Dmelvin 15d ago
It honestly sounds like you're at the point where redesigning your alerting makes sense.
I'd go to the drawing board, and determine what sensor logic you want, draw it out in a flowchart, check for things that may conflict or cause false positives, then tear all of your existing logic out and replace it.
1
u/NPMGuru 14d ago
the duplicate notification problem is usually a correlation issue. One root cause firing 10 alerts. static thresholds make it worse because every transient spike looks like an incident.
I recommend continuous synthetic monitoring with baseline-aware alerting. way fewer false positives. You can use Obkio for that.
4
u/Alfred20367 16d ago
we faced very similar problem and eventually realized the issue was too much customization layered over time. Prtg helped because the default sensor logic and dependency handling were much cleaner than what we had built manually before. once we standardized monitoring templates and alert thresholds false positives dropped significantly and operators started trusting alerts again. having all monitoring in one interface also made root cause analysis much faster during incidents.