5

Overnight, our networking dept patched some systems, which unexpectedly caused a connecting system unable to work. That system was our alerting layer, which didn't/couldn't send out the alerts (phone calls, Teams messages, emails, etc) that alerting wasn't working.

This morning when networking came in, they saw the issue (our backup alerting system was sending emails all night long).

Instead of "Oh no, maybe we should have a process in place to verify patching X systems doesn't degrade Y systems", the various teams are dog-piling on alerting (my responsibility). VPs are now getting involved. They are saying things like "There should have been a monitoring system to monitor the alerting!!!". Which there is, the email back up alerting. Must be a dozens of messages in the team chat all pointing the finger that 'alerting should have worked', even though *those server clusters were all down*. My boss tried to chime in with common sense saying "If our infrastructure team can't guarantee 100% uptime on the clusters, then this will happen again. The issue happened once in the 5+ years we've been using this framework. We can spend time and money creating yet another monitoring system, which could fail too, or accept the reality that sometimes things break. We fix it and do what is reasonable so the issue doesn't happen again. In my opinion, paying for another solution isn't feasible in this situation."

Team chat is silent right now, but my spidey sense is tingling.

Comments
  • 4
    external ping
  • 4
    Alerting should not be an internal system. Neither should the incident response team's ticketing, messaging and knowledge management system, for that matter.
  • 4
    Good luck, but yes a simple external monitor is good, even if it's just a "hey I couldn't contact monitoring for an hour, have a look" that gets sent to just you/your department (like backups actually)
  • 3
    @lorentz for sure.
  • 1
    UPDATE

    Senior leaders decided to do nothing.

    I suggested to my boss (which he was already contemplating) that we allow a 'cooling off' period and then suggest a somewhat rube-goldberg solution (outside ping). When they had time to actually think, it was a one time fluke that didn't warrant the complexity, time, and the most important piece, money.

    #winning
Add Comment