When systems crash unexpectedly and users moan that the internet is slow, the next thing that admins often hear is a request for historical statistics that might take half a day to reach baseline, according to SolarWinds vice president of product strategy Craig McDonald.
In a guest post on the Carahsoft Community blog, McDonald offers a few insights on how software monitoring can ensure that systems are optimised and the main mission uninterrupted at such times. It’s all about paying attention to the fundamental concepts of system monitoring.
“If you’ve been part of a federal IT team for longer than 15 minutes, this is ‘situation normal’,” he says. “[Yet] the answer to these challenges lies in monitoring your environment effectively.”
This means that users should know what to look for and where, and how to retrieve it without affecting the monitored system. They also need to know where to store the values, which thresholds indicate a problem, and how to alert the right stakeholders about said problem at the right time. Step one is to familiarise the team with the correct terminology, he says.
Only then can users address the ‘how’.
McDonald writes: “There are various monitoring techniques, from classic pinging and using the Simple Network Management Protocol (SNMP) to vendor-specific methods. Additionally, some offerings use agents for monitoring while others use agentless technology. None of these are right or wrong; it’s important to choose based on your own system and agency demands.”
The key considerations are ease of deployment, configuration, and maintenance; flexibility; availability of the data to external systems and other modules within the solution once it’s collected; and intelligently filtering alert noise, he says.
“On the one hand, you want to be alerted when an issue occurs. On the other hand, you don’t want to create alert rules capable of drowning you in noise and ultimately masking real issues. Machine learning shows promise in solving this problem,” McDonald adds.