In my previous article we reviewed the overall topic of management interfaces to Network Taps and Network Packet Brokers. In this chapter we will focus on the fault management topic. Subsequent chapters will cover other management topics including configuration, accounting, performance monitoring, security, and remote access.
Fault Management Scope
Fault management in network devices is the area of detecting, identifying cause and resolving problems with the device or its functionality. In my discussion I will focus on the interfaces and methods used to perform fault management, primarily from the network device perspective but I will also discuss the role of an EMS or NMS where it is needed for a given function.
It is generally accepted that the fault management topic for network equipment includes the following capabilities
- Detection – which includes the ability to record and report fault events
- Correlation and aggregation – identifying which faults are related to the same cause
- Diagnosis and Isolation – how we identify the cause of the issue
- Restoration – recovering from the condition so the device can return to normal service
- Resolution – addressing the cause of the issue to limit or prevent recurrence
Faults Causes
In Taps and NPBs, the source of the fault could be any of the following:
- Device hardware issue – including processors, memory, power, cooling, etc.
- Network cabling issue – including breakout cables
- Network issue – traffic flow, traffic rate, traffic type
- Device Software issue – firmware, OS, driver, application,
- Configuration issues – device or network configuration issue
Fault Detection
In my experience there are 3 main ways that faults may be detected:
- By the software running on the device – which detects the cause or symptom of the fault and reports it
- By a management system (NMS or EMS) – which through analysis of telemetry from one or more devices or applications in the network identifies a fault
- By a user or system using the network – which identifies a loss of connectivity, loss of service or reduction in service quality that affects the user/system
The latter two methods are out of scope for what the Tap or NPB can do on-board but they do indicate that faults can occur that are not detected locally. This will drive the requirement for Diagnostic and Restoration capabilities even in the absence of an on-board reported issue which we will discuss later.
Considering the first detection method, once the device detects an issue, we then must consider how to record and report the issue to the device operator.
Fault Recording – Fault recording is the storage of and the ability to retrieve the fault history from the device. While there are many individual protocols used, there are two general methods utilized in network equipment:
- Log – A log file containing a running record of each fault event or alarm state change (see my write up on management interfaces for a discussion of events vs alarms). This is usually stored in syslog but can also be a log stored in a device specific file format or within a database table.
Syslog (RFC 5424) is preferred since it is a standard and well understood interface. It is recommended that the facility and severity fields be used as accurately as possible to identify the fault event. Syslog can be used to drive user interface presentation of the events (command line, GUI or shell interface), it serves as a storage mechanism, and can be used to relay the fault events to other systems such as an EMS/NMS.
The syslog RFC defines a basic mechanism for transmission but there a number of RFCs that discuss more reliable and secure mechanisms including:
RFC 3195 – Reliable delivery for syslog
RFC 5425 – TLS Transport mapping for syslog
RFC 5426 – Transmission of Syslog messages over UDP
RFC 5848 – Signed Syslog Messages
RFC 6587 – Transmission of Syslog messages over TCP
Syslog-ng – a recommended implementation of syslog that incorporates many of the recommended extensions.
- Database – Since a standard log file such as syslog does not provide the ability to augment the event (see aggregation, correlation, and resolution topics below), I generally recommend that a fault database be provided as the chief method to interact with the device fault list. A database not only stores the events but also will be responsible for:
- recording changes to event over time (historical view of a fault event)
- mapping multiple event to an alarm state
- recording subsequent updates to an event as it is processed by the device, by users, or by other systems over time (such as acknowledging the event, indicating it is resolved or annotating the issue)
- mapping the event to other related events or to the root cause (correlation/aggregation)
Fault Reporting
There are a significant number of accepted management interfaces to report fault events, not to mention the dozens of general-purpose database access methods, hence I’m only going to outline a few that are or have been in common usage:
- SNMP – Simple Network Management Protocol is a widely used and supported interface for monitoring and in some cases configuration of network devices. For Taps and NPBs I generally recommend supporting SNMPv2c and SNMPv3, supporting the standard system and networking MIBs, and using SNMP traps for a fault reporting channel (most devices should support multiple trap destinations). I typically don’t recommend using SNMP as a primary configuration interface, although some products in the industry use SNMP for that purpose. See the wiki page for a listing of the applicable RFCs.
- Syslog – probably the simplest and most widely use fault logging interface. Syslog can also be used for other log types including configuration change logs, user activity logs, debug logs, etc. See the description of syslog above in the fault recording section. Syslog, and its recommended extensions defined in Syslog-ng, provides a simple interface for filtering, forwarding, and monitoring events. Syslog can be carried over UDP but there are extensions that include encryption and congestion control (TLS over TCP).
- X.25 – X.25 was commonly used in telephony products prior to widespread reliable internet networks. It provided a communication network for monitoring and controlling network equipment. It is no longer in mainstream use and is not recommended for modern products.
- NETCONF – Network Configuration Protocol, is a modern network configuration protocol suitable for Network Taps and NPBs. We recommend using NETCONF for both fault notification, fault and alarm/state table syncing (to EMS and NMS systems).
- CORBA – Common Object Request Broker Architecture is an object-oriented architecture which enables cross platform communication. While I have use CORBA in some advanced device management solutions, I do not recommend CORBA for the small network device market (Network TAPs and NPBs) due to the overhead and complexity which is not required for these device types.
- CMIP – Common Management Information Protocol is another cross-vendor standard for managing complex network elements. This protocol was meant to be a successor to SNMP but proved to be more complex that was needed for most applications. I do not recommend CMIP for Taps and NPBs.
- RMON – Remote Network Monitoring, is another standards-based alternative to SNMP that focuses on flow based network monitoring (SNMP is optimized more for device based monitoring). For simple devices such as Taps and NPBs I recommend sticking with SNMP or, if RMON is used, only implementing the basic fault functionality (alarms, events, history). (Note: for flow monitoring I recommend IPFix and SFlow which will be discussed in a separate series of papers.)
- Email/SMTP – Simple Mail Transfer Protocol is probably the easiest to implement remote notification mechanism. The device would send an email to one or more users each time a significant fault event occurs. This avoids the need for specialized SNMP, NETCONF or even Syslog manager/servers and utilizes a user’s inbox as a reporting and recording system. SMTP can also be read by machines as an input to a fault EMS/NMS. If the network has an EMS or NMS that is already receiving syslog, SNMP, NETCONF or other fault protocol, then those systems will typically support their own SMTP interface as well.
- REST – Representational State Transfer is a style of software architecture. It is based on principles describing how networked resources are defined and addressed
- CLI – Command Line Interface, is a typical text line by line interface supported by most devices. In some devices fault messages may be echoed to the command line interface of some or all users or a specific command can be issued to list, query or activate/deactivate notifications of fault events. This is probably the simplest interface and can be used by humans or machines. The interactive nature makes it ideal for human interaction with the device.
- GUI – Graphical User Interface is, as the name implies, an interface using graphics, icons, buttons, labels, etc. to provide a user interface. In modern systems of this type (Taps and NPBs) this is usually provided as a Web GUI –
- Other interfaces – There are many other standard protocols, both modern and historical that are in-use across the industry. My general recommendation is to stick to one or two of the most common interfaces used by your customers, such as Syslog, SNMP and NETCONF.
- Proprietary interfaces – I recommend against creating or utilizing non-standard and proprietary fault reporting interfaces. While this might be the easiest to implement in some devices, it will be difficult to get any EMS or NMS vendors to support the interface and will make it difficult for your customers to easily deploy and monitor your device in their networks. Some customers will prohibit use of device in their network that do not have at least basic fault management interfaces connected to their monitoring systems.
Fault Correlation and aggregation
An important capability of a fault management system, even for simple devices, is to correlate and aggregate fault events in a way that reduces the symptoms of a fault and assists the user in identifying the cause of the fault.
Consider a case where the CPU of a device is overloaded… this may manifest itself as faults indicating dropped packets, blocked flows, loss of monitoring interfaces, failure of statics recording to occur, failure to respond to user interfaces, etc. Within this cacophony of faults, the device should also be reporting a CPU overload event, which is the actual cause. It is important that the system provide fault data to the user so that the user is able to discern sequence of events and timing of events to aid in his correlation. In an ideal system the secondary faults would either be correlated to the fault indicating the cause by the system itself or, at a minimum, the faults that occurred at the same time would be grouped together as a “probable” related event. In this example the system should have raised fault and warning events prior to the overload event warning the user that CPU load was “near” capacity (i.e., 90% utilize, 95% utilized, 100% utilized, etc.).
Best practices for fault correlation and aggregation include:
- Accurate timestamp on all events (and when multiple devices are involved, utilize a synchronized clock across devices)
- Accurate and clear naming of devices and components
- Distinguish fault events from other types of events (info, debug, trace, user activity, etc.)
- Accurate use of event severity – i.e. not everything is critical. Follow industry standards for fault severity classifications: “Fatal, Error, Warning” or “Critical, Major, Minor, Warning”. I prefer this convention:
- Critical – a service affecting error that affects the whole unit or most of the unit’s functionality – i.e. memory exhaust, power failure, etc.
- Major – a service affecting error that affects a large subset of the unit’s functionality (multiple interfaces or flows in Taps an NPBs) or that affects some traffic types or that affects a major required subsystem such as IPFix or user interfaces
- Minor – a non-service affecting issue that may indicate symptoms that future issue may occur such as high memory or CPU usage, or traffic on a port nearing the capacity limit
- Warning – a non-service affecting issue that probably should be investigated but is not currently and is not likely to cause a service disruption in the near term. An example might be a warning that software is out of date or that a port was left in a partially configured state.
- Group repeated events from the same issue under a single item particularly if these events are generated rapidly. This consolidates the data the user needs to visualize and itself can be used to raise/lower the severity of an event (for example if packet loss event occurs more than a prescribed number of times per hour (i.e. 10) on an interface raise an event indicating the packet flow is unreliable.)
- Group related fault types together (in the example, all of the CPU fault warnings should be grouped together so the user can see the trend toward overload).
- Provide an alarm representation rather than just an event list. An alarm state table would show the CPU overload state while many of the symptoms we listed above are not states but are transient events and would not appear in the table.
- Where possible include rules or allow the user to provide rules for common recurring behaviors where correlation can simplify the display and the response (i.e. if the packet dropped message indicated – packets dropped due to unavailable CPU cycles – then it should be possible to map that event to a corresponding CPU overload state automatically)
Diagnosis and Isolation
Once the user has been notified of a fault by the device (Fault event or alarm), or by an outside system or user that is experiencing trouble with their service which utilizes the device, then the next step is to diagnose the issue and determine the cause. To support this phase of fault management the device must support several basic interfaces for data collection and analysis by the user. Here is a partial list:
- Fault display – as discussed above, but including fault history – if a given port or flow is having an issue then listing the history of fault events for that port may indicate a causation – such as intermittent packet loss, unusually high traffic rates, etc.
- Non-fault logs – The diagnosis system should be able to display other types of activity logs since these may also aid in the identification of the cause of a fault or service failure. For example, perhaps a recent configuration change was made that may have affected the service. This should be visible as part of the diagnostic query on a port/flow. The diagnostic/status display should show any fault conditions but also the history of fault and non-fault activity affecting the port/flow/interface in question (i.e. configuration change records).
- Statistic display – both historic and in real time for a port, a flow or even a specific traffic type within a flow can provide significant insight when trying to determine why a service is not working. A table or graph form should be provided to aid the user in visualizing the statistics over time, to identify any abnormal metrics, or changes that could result in new issues.
- Test mode – the ability to place a port, flow or interface into a loopback mode in order to test if the issue is with the underlying hardware. The system should also be able to self-generate a traffic test stream so basic traffic flow can be verified while in this mode.
- Relocate – the ability to shift the traffic to an alternate port/interface to determine if the issue is related to the port itself
Restoration
There are two types of restoration – temporary and permanent.
- Temporary restoration – may include resetting a port, rebooting a device, temporarily restricting maximum capacity, reversing a service impacting configuration change, stopping other processes, or simplifying the flow filtering, etc. All of these, and similar actions, allow the user to restore the service but they typically will not permanently address the problem. With temporary restoration, if the device is properly configured and connected, then the issue may be with a device defect or a device limitation. For a defect, the issue would not be resolved until a fix is issued by the vendor. For a device limitation, the issue may not be supported by the device or may require additional feature development be performed and an upgrade issued by the device vendor.
- Permanent restoration – If the device or flow was improperly configured, then correcting that configuration would be a permanent restoration. Issues with cabling, properly seating the cable or using proper cables, and other external hardware issues once resolved are also considered a permanent restoration.
Resolution
Resolution of an issue is a permanent fix. The specific fix depends on the cause of the issue. Here are some examples and their resolution:
- Hardware issue – replace the affected hardware component (i.e. power supply, fan, network interface, CPU, memory, etc.) – contact the vendor to determine which components can be replaced in the field and which ones must be returned for replacement or repair
- Cabling issue – replace or repair the cable or cable connection
- Capacity Overload – reconfigure the network or network service to limit traffic, utilize a higher capacity transport/interface, or split the traffic into multiple flows.
- Software defect – apply the most update to date software and any fixes (patches) as provided by the product vendor. If this does not resolve the issue, then report the issue to the vendor so it can be addressed in a future update. The vendor may suggest an alternate solution to the issue as a temporary or permanent resolution.
- Configuration issue – correct the configuration issue in the device (or in the network connected to the device). This may not always be obvious, so pay attention to the stated capabilities and options that are supported by the vendor including expected/known limitations. If there is an issue with the documentation, procedures or specification of the product that led to the configuration issue, then report those to the vendor for resolution.
- Unsupported functionality – this is a special case of configuration issue where the user is configuring the device to do something that is not supported, including configuring a capability that exceeds the specification or capacity of the device. In these cases, the temporary solution would be to configure the device and network to limit the traffic, traffic type or flow/routing to remain within the supported device envelope. Notify the vendor of the request for feature so the limitation can be better understood, and the vendor can communicate whether the limitation will be resolved in a future release or in an upgraded product. In some cases the vendor may have configuration options or licensing options to enable additional features and/or capacity on an existing device.