Archive for the ‘Network Management’ Category
Configuration management, without a doubt this is probably one of those things we all do, but to what end do we perform configuration management though? Usually configuration begins and sometimes ends with backing up and storing those configuration files but how much further can or should we go with configuration?
- Backing and storing configuration files – As mentioned before, this is step one. Keeping a copy of your device configuration is one the easiest and quickest ways to restore a device back into service after a physical failure.
That one task, as simple as it sounds it the heart of good configuration management. From that original point we gain the ability to perform many other tasks:
- Configuration comparison – Having the ability to compare two of your previously saved configuration can prove to be an invaluable resource. Let’s say you have dozens or hundreds of remote “cookie-cutter” devices out there and you have a single device acting up, comparing the configuration to a known good configuration can save you much troubleshooting time. Especially when you consider many configuration managers are convenient enough to highlight discrepancies making possible errors stand out like a sore thumb.
- Configuration / Version archival – This point follows suit with comparing configurations, because in most cases you may want to compare configurations from the same device that are from different points in time. Making it easy, to pin-point any recent configurations that could be causing the issue at hand.
- Reacting to configuration changes – Changes are a way a life, if nothing is changing then something is wrong, the tougher part is dealing with a way to manage those changes:
- How do you know when a change occurs? – Many configuration managers will re-act to a device generated syslog or SNMP trap, do you have this configured, are you made aware of changes?
- What do you do when a change occurs? – When this mysterious change occurs how do you re-act, do you acknowledge some type of alert or review some daily/weekly report, perhaps there are some checks and balances to ensure the altered configuration has been backed up after the change or to verify that the change was committed to memory?
- Can you correlate network events with changes? – Do you have any capability of correlating network faults with configuration changes? There are definitely a few of those ‘eye in the sky’ type NMS platforms that are capable of linking device level events to any correlating configuration events on that device. Now, if only we can get a more abstract ability to monitor device groups holistically linking events. In my opinion event correlation exist to some degree with many NMS products, but there is definitely a lot of room for improvement.
- Do you enforce a baseline or set configuration standard? – This one in my mind, is where many tools have lack-luster performance, when you find yourself in a situation with many nodes, you definitely want some ‘checks and balances’ out there. How upsetting is it when you find out that the issue you have been troubleshooting is all because a recent firewall change or routing policy change never got deployed to a particular device? If you can enforce some type of configuration ‘compliance’ or ‘baseline’ I guarantee you, that you solve many headaches before they occur.
- Software Management – While, this might seem outside of the scope of configuration management keeping your software versions in sync can be quite important and detrimental for keeping configurations in Sync
- Software versions – Different software versions may implement different command syntax, this in itself can make configuration & troubleshooting tedious in itself. Different versions may also suffer from different software bugs/caveats into your environment.
- Software licenses – Depending on the vendor and the device, if the incorrect license is applied or the incorrect software train is installed on the device some features may not even be configurable. Nothing is more upsetting then when you are trying to configure a device and you find out a simple license (non-technical) issue is the root cause.
So, there we have some groundwork for configuration management do you have other considerations in regards to configuration management?
A while back I wrote a quick article about network management and tools you can rely on. As you could imagine network management is a quite a broad topic and definitely one definitely can talk about till I turn blue. So I figured I would throw out some other interesting considerations when dealing with Network Monitoring.
- The location of the NMS or more particularly the device that performs the device polling. When you get into larger networks and more robust network monitoring applications (Nagios & SolarWinds for example) you can expand into a distributed model. That distributed model sounds awesome at first but there could be some interesting caveats with this approach.
- The location of that distributed poller, did I say that twice? Why is the location important though? Well any health statistics gathered from the poller will be reported in the NMS from the perspecitve of the poller. At that point you must ask yourself is that information valuable, and does that provide the level of monitoring you are expecting to capture?
- Running a networking monitoring system in a distributed model creates additional overhead and requires more upkeep to simply monitor and maintain the NMS itself. On top of verifying the health of the network/systems/services in your environment you also need to account for the health of the distributed pollers as well.
- Are you putting an additional load on the monitored devices? Believe it or not but monitoring a device via SNMP can actually create issues. A device might be susceptible to a software caveat/bug when certain processes are monitor or polled. When polling a device; that monitored device must process and respond to a number of different SNMP Get request you definitely want to make sure the other more important devices functions/process are not hindered as the device process SNMP data from the process/control-plane perspective.
- Consider the additional load of monitored services. In my first post I mentioned monitors tools such as Syslog, NetFlow & IP SLA’s depending on the environment it may worth considering the amount of traffic those tools can generate sure from a dozen devices it might not add up much but what if you had hundreds of devices at that point syslog messages from a few hundred device might become a bit overwhelming. In some cases management related traffic might need to be marked down in QoS policies to avoid management traffic affecting the production traffic of your network.
- Polling Intervals, this is an easy one to overlook but an extremely important factor to take into account. It may take some time to find that sweet spot but:
- Poll a device too aggressively and you risk creating false alarms.
- Poll a device too infrequently and you risk missing important events in the network and/or network outages.
- A ‘very aggressive’ polling interval may also crash the NMS application itself, remember at the end of the day these are simple applications running some type of database on the back-end with limitations.
- Understand the data that the NMS is presenting to you. Anyone can open a web browser and look at a nice fancy but what does that fancy graph really tells us? Let’s say your NMS calculates latency and response time, well how does it do that? Is it polling a particular OID or is it simply pinging the device and looking at the response time, and if so how often does it issue that ping? The same can be asked with interface utilization graph, is it simply grabbing the Rxload/TxLoad statistics from the specific interfaces at a set point or are other tools like NetFlow taken into account? The more you know about how your NMS works the easier and quicker it will be for you to interpret, analyze, and diagnose issued presented to you via your NMS.
I’ve worked with many different network engineering departments at many different companies and I must say one of the biggest trends I typically see is the fact management capabilities are typically always lacking, and usually it is due to one of the following reasons:
1. A complete lack of management tools, while this is usually the rarest issue out there, there are some places that don’t even rely or have any type of network management tools and you see some type of excel spread sheet or network share containing copies of device configurations. Now there is nothing wrong with this especially if you are a real small environment however it is definitely not ideal for larger environments and should be avoided.
2. Outdated network management tools, this is only somewhat better then not having any type of management tools. That is relying on tools that have been EoL for years, to the point you either need to maintain the network management application or worry about it failing. As with any type of network device the network management software needs to evolve with the network, as more and more technologies are rolled out to the network you need to ensure the management of those technologies scale just as well.
3. Too many network management applications, while you wouldn’t think this is a bad thing. It can be very easy to get carried away with network management. For example look at Cisco, they practically have a flavor of ‘Prime’ for everything CX-Modules, Wireless networks, wired networks, Voice/Video, which in itself can get overwhelming because usually on top of those platforms are additional platforms for configuration or performance management (whether it be SolarWinds, PRTG, WhatsUpGold) and your management turns out to be very de-centralized sometimes leading to confusion in itself and in some cases causing companies to purchase duplicate licensing that they don’t need.
4. Not knowing what to actually monitor. Granted efficient management techniques come over time and experience. to be honest typically the first time many people setup any type of NMS they are instantly ‘wowed’ at the sheer amount of information they get by default (typically historical performance information, NetFlow stats, configuration management) that they do not realize what they don’t see until they find themselves in a troubleshooting situation or outage and begin wishing they had just a little bit more information. For example look at SolarWinds NPM only recently did it start adding support to viewing routing tables and see routing neighbors, in the past custom pollers would have to be setup to see this type of information. However you still need to rely on custom pollers to pull specific MIBs for FHRP status, which in my mind is just as important as monitoring a routing protocol.
Now, we do have a very large arsenal of tools to choose from when designing our network management environment and it can be intimidating at first, but the important thing is to understand what we ‘should look at’ depending on the situation we are attempting to troubleshoot. A few great tools are:
Historical performance records are always great, since those type of tools will passively (and automatically) establish a baseline for us allowing us to quickly determine if a network device or segment is experiencing any abnormal performance.
Syslog/traps, remember syslogs and traps are basically the equivalent of error logs in the Windows event viewer and are able to quickly tells us if the router is experiencing any type of issue. Of course logging needs to be properly configured and possibly filtered to ensure the logs give us the information we need to see quickly without having to filter thousands of events!
NetFlow data is an amazing resource especially when teamed up with NBAR these can quickly tells us what traffic types and patterns are going through our router, so let’s say a particular remote site is experiencing performance issues NetFlow and easily tell if we have some specific traffic over utilizing the bandwidth or flooding the interface.
Configuration management, while this one is a given for any large network it can also be used to quickly identify any network changes that could be causing any negative impact, and pretty much all of the configuration management tools out there today include the ability to automatically compare previous configuration sets highlighting the differences.
Software management, you might not consider this one at first, but knowing what type and version of software is running in your network is a very important aspect to be aware of, especially if you are unlucky enough to stumble upon and a bug within the software. In those events you want to be able to quickly identify what other devices in your network will be affected by this software bug and you will also in turn want a simple and manageable way to upgrade and replace that software.