Archive for June 2014
A while back I wrote a quick article about network management and tools you can rely on. As you could imagine network management is a quite a broad topic and definitely one definitely can talk about till I turn blue. So I figured I would throw out some other interesting considerations when dealing with Network Monitoring.
- The location of the NMS or more particularly the device that performs the device polling. When you get into larger networks and more robust network monitoring applications (Nagios & SolarWinds for example) you can expand into a distributed model. That distributed model sounds awesome at first but there could be some interesting caveats with this approach.
- The location of that distributed poller, did I say that twice? Why is the location important though? Well any health statistics gathered from the poller will be reported in the NMS from the perspecitve of the poller. At that point you must ask yourself is that information valuable, and does that provide the level of monitoring you are expecting to capture?
- Running a networking monitoring system in a distributed model creates additional overhead and requires more upkeep to simply monitor and maintain the NMS itself. On top of verifying the health of the network/systems/services in your environment you also need to account for the health of the distributed pollers as well.
- Are you putting an additional load on the monitored devices? Believe it or not but monitoring a device via SNMP can actually create issues. A device might be susceptible to a software caveat/bug when certain processes are monitor or polled. When polling a device; that monitored device must process and respond to a number of different SNMP Get request you definitely want to make sure the other more important devices functions/process are not hindered as the device process SNMP data from the process/control-plane perspective.
- Consider the additional load of monitored services. In my first post I mentioned monitors tools such as Syslog, NetFlow & IP SLA’s depending on the environment it may worth considering the amount of traffic those tools can generate sure from a dozen devices it might not add up much but what if you had hundreds of devices at that point syslog messages from a few hundred device might become a bit overwhelming. In some cases management related traffic might need to be marked down in QoS policies to avoid management traffic affecting the production traffic of your network.
- Polling Intervals, this is an easy one to overlook but an extremely important factor to take into account. It may take some time to find that sweet spot but:
- Poll a device too aggressively and you risk creating false alarms.
- Poll a device too infrequently and you risk missing important events in the network and/or network outages.
- A ‘very aggressive’ polling interval may also crash the NMS application itself, remember at the end of the day these are simple applications running some type of database on the back-end with limitations.
- Understand the data that the NMS is presenting to you. Anyone can open a web browser and look at a nice fancy but what does that fancy graph really tells us? Let’s say your NMS calculates latency and response time, well how does it do that? Is it polling a particular OID or is it simply pinging the device and looking at the response time, and if so how often does it issue that ping? The same can be asked with interface utilization graph, is it simply grabbing the Rxload/TxLoad statistics from the specific interfaces at a set point or are other tools like NetFlow taken into account? The more you know about how your NMS works the easier and quicker it will be for you to interpret, analyze, and diagnose issued presented to you via your NMS.