Network Taps are a crucial part of the tool kit used by engineers responsible for network security, performance, and capacity management. Taps are robust devices that rarely ever exhibit issues, but as with any network equipment – it may occasionally be necessary to troubleshoot them. In some cases this is an inherent part of a multi-layered process of elimination, but in certain circumstances the process might begin with the Tap itself. The following procedures are presented with the Datacom Systems SINGLEstream aggregation Taps in mind, but the general principles are applicable to all brands of Taps.
We will begin with a general overview of Copper Aggregation Taps – which provides the context required for troubleshooting.
Fiber Taps will be discussed in the next article of this two part series
How Do Copper Taps Work?*
Copper Taps are a Layer 2 device. Each Tap has two “Network Ports” which comprise the Tap assembly (multi-link taps have multiple assemblies, which operate independently of one another.)
When the Tap is powered and inserted into a network link each link endpoint device independently establishes link with the Tap. Once the two endpoint devices have established link, traffic then passes through the Tap – which is invisible to the network. Traffic flows through the Tap, with copies being sent to one or more Monitor Ports. In the event of power loss to the Tap, a passive relay assembly moves into position, providing a passive bypass, which allows traffic to continue flowing on the link. In such instances, the link is momentarily interrupted. When power is lost, the two endpoints must re-establish link with one another, as the links they had established with the Tap have been lost. When power is restored, this occurs again, as the endpoints must now re-establish link with the Tap. The duration of the interruption upon power down or power up is a function of the time required for the ports to approximately ~3.2 seconds on power down and ~3 seconds on power up.
*(Datacom Systems CTP-1000 Taps use a slightly different design but general principles remain the same. A special truly passive copper Tap is also offered for 10/100 traffic only – contact Support for technical details)
Symptoms Indicating That Troubleshooting Is Required
- Link has “bounced” for unknown reasons
- Link is down and not passing traffic
- Link is running very slowly and appears to be a bottleneck for traffic
- Monitor ports are not sending out data
- Monitor ports are showing only broadcast or link negotiation traffic
- Monitor ports are sending data but monitoring tool indicates that some traffic (packets) is missing/dropped
Steps to Troubleshooting
*(the Tap assemblies are always able to carry all traffic on the tapped link at up to 100% utilization – the issues surrounding dropped packets relate to the process of capturing and monitoring copies of the link data.)
It’s also important to note that, even when equipped with robust internal components, a conventional PC/server with 1G capture interfaces is not capable of capturing high utilization levels of traffic for a sustained duration of time.
- Check to ensure that both external power supplies are securely attached, and verify that the Power LEDs on front of Tap are both illuminated. The power supplies are load sharing and redundant – units will operate with one of both power supplies attached. Momentary power outages to the tap will cause the link to bounce.
- Verify the integrity of the cables connecting the Tap to the endpoint devices and the monitoring tools. Many problems can quickly be identified and eliminated by starting at the Physical Layer.
- Check the port status of the two endpoint devices. If either of them is down, it will result in the link not passing traffic. If one endpoint is down, but the other still has valid link with the tap, traffic will not pass.
- If the link is running slowly, then it is likely to be caused by a duplex mismatch. If one endpoint of a link is capable of 10/100 only, but the other endpoint is capable of 10/100/1000, then each of the two ports of the tap assembly will negotiate separately, to the highest speed available on the endpoint device port. Duplex mismatch occurs when two devices connected by Ethernet fail to properly negotiate their connection. Ethernet has the option of running at different speeds and can run at half duplex or full duplex. When one endpoint is only 100 Mbps capable and the other can run at 1000 Mbps (1G), the Tap ports and the endpoint must all be set to 100 Mbps Full Duplex. Otherwise, the link will default to 10 Mbps half Duplex, resulting in packet loss and severe performance impact.
- Datacom SINGLEstream aggregation Taps are shipped with default settings. The tap assembly ports are always set to Auto; these two ports always copy or send data to each other. This is mandatory, as these ports are in-line and must send data between the endpoints. The remaining “Any-to-Any” ports are typically used as Monitor ports, which send data to attached monitoring tools. The default settings for the “Any-to-Any” ports have no data copied to them from the Tap assembly. The user must decide whether to use aggregated or non-aggregated output of the data copies from the tapped link(s,) and decide which ports will be used as Monitor ports. The CLI commands to accomplish this are in the SINGLEstream user documentation. Assistance is also offered by Datacom Systems support.
- If the Monitor ports have not been configured to send data, but a monitoring tool has been attached, then that tool will see only the link negotiation traffic and handshakes that are occurring between the monitor port of the Tap and the capture NIC of the monitoring tool.
- If some of the traffic from the tapped link is visible but some packets appear to be missing, then there are multiple possible causes. Note: Taps are not selective – they will not prioritize or send one type of traffic while dropping other types (unless hardware based filtering is being used – which is outside the scope of this article.)
Likely causes of dropped traffic are all related to potential oversubscription in the Tap or in the monitoring tool itself.
Gigabit Ethernet links operate at full duplex. In other words, a 1G link, if operating at maximum capacity, has 2G of traffic – 1G in each direction. If the aggregate utilization of the tapped link exceeds 50%, and the output to the Monitor port(s) is aggregated, then a volume of traffic greater than 1G will be passing through the switching chipset of the tap. Brief utilizations spikes are accommodated by a buffer on the chipset, but any sustained duration utilization exceeding 50% will result in random packet loss before the data copies reach the tool.
If examination of historical utilization records for the link suggest that it is routinely exceeding 1G in overall utilization, then the first step is to change Tap configuration from aggregated to non-aggregated output. This eliminates the possibility of dropped packets within the tap.* It will then be necessary to use a monitoring tool with dual capture NICs, hardware robust enough to capture on both NICs simultaneously as well and to handle higher throughout of longer duration> It is also necessary to use software that can recombine the two captured data streams for post capture display. In this instance the PC/server hardware being used as the capture device should be equipped with multiple CPU’s and a hard drive and a fast write to disc speed.
Independent consultant Chris Greer of Packet Pioneer https://packetpioneer.com/ said this in a 2017 interview:
“The ubiquitous use of Wireshark on a laptop translates to a high degree of packet loss which can significantly decrease the accuracy of any analytics run on that data. In a 1 Gb data stream, I had 82 percent packet loss,” Greer says. “One gig went out and I was only able to capture 18 percent of the packets. As I turned down the volume, I had to go down to about 50 Mb per second on my laptop to capture all the packets. Fifty megabits per second, are you kidding me? We should never be using a laptop in a data center to capture packets. We’re going to drop packets and that’s going to affect our analysis. If we just take Wireshark and install it on a laptop, and put it on a link, we can’t capture forever,” Greer says. “We can ring buffer, we can get captures, to the limit that the laptop can capture. But today we need long-term capture because problems are intermittent. We need to be able to catch it in the act. I think that’s the most difficult thing, just being there, on the link, when the problem is occurring. How many times have you heard of a problem, someone complains of something, by the time you run out there to analyze it, the problem’s gone? For me, an organization has flown me up and they’ve had an issue. They fly me up and the problem disappears. I leave and the problem comes back. Long-term, stream-to-disk packet capture is what we need.”*
*(Datacom Systems does not sell or recommend specific brands or models of stream to disk data capture storage solutions, but it’s crucial for all users to recognize where the potential bottlenecks for data capture might appear in visibility solutions.)
For those interested in capturing from multiple 1G copper links, with the intention of aggregating data copies of asynchronous or dynamically load balanced traffic occurring in a network, Datacom Systems offers the new G Series multi-link copper 1G taps, all of which offer a minimum of four SFP+ monitor ports capable of supporting 10G capture tools. This can be a cost effective way of accommodating increased network speeds with an existing set of monitoring tools. https://www.datacomsystems.com/singlestream-g-series-link-aggregation-taps/