Virtual LAN, or VLAN, technology can fail for various reasons. VLAN connectivity problems can occur due to physical connections, improper Layer 2 data link configuration errors or problems with the Layer 3 routed network configuration.
Here are some steps to take when troubleshooting VLAN connectivity issues.
Basic physical connectivity must exist for the network to function. Typical problems include broken wires or optical cables, dust or dirt on optical connectors, bad connectors, interference from electrical systems or pinched cables.
Many of these problems manifest themselves as unidirectional links, where packets go in one direction but not the other. Network devices can frequently detect unidirectional links, making it easier to diagnose with simple commands, like show interface. Admins need to check the interface status and error counters in the output to identify the specific type of problem.
On slow-speed Ethernet links, check the duplex setting. Both sides of a link must be configured for the same duplex setting -- auto, full or half -- and speed. A duplex mismatch can work at low packet rates and fails at higher packet rates, so don't rely on a simple ping test. An interface that shows late collisions is in half-duplex communicating with a full-duplex interface. A full-duplex interface shows runt frames if the connected device is in half-duplex. The recommended setting for most devices is auto.
Incorrect VLAN configuration
The most common data link errors are incorrect configuration of a port's VLAN ID or omitting the voice VLAN ID on ports that connect IP phones. The link looks good, and packet counters increment. But there's no connectivity. In this case, admins should run a simple check of the configuration.
On trunking links, admins need to set the native VLAN, which tells the switch which VLAN to use for any frame that doesn't carry a VLAN ID. This ID is normally consistent across the entire network, and admins need only perform a simple configuration check.
Switch-to-switch links often use trunking to pass multiple VLANs over a single link. The permitted VLAN list must match on both ends of the link. A mismatch can result in isolated instances of a VLAN. Connectivity works for some endpoints and not for others. Here, run simple configuration checks on the switch trunk interfaces.
The above configuration checks are ideal places to apply configuration validation automation. These checks don't need to apply changes -- they simply need to highlight potential problems to the networking staff.
Forwarding loops in a switched network
Switched networks traditionally rely on Spanning Tree Protocol (STP) to prevent forwarding loops. But, in some cases, loops occur even with STP to prevent them. A loop rapidly forwards Ethernet frames around the loop, consuming interface bandwidth and switch CPUs. It quickly causes a network to become so congested it ceases to function. Unfortunately, because the CPUs and network links are saturated, it is impossible to use the network to diagnose the problem.
To troubleshoot, admins should break the network into successively smaller domains to identify the loop's location. Divide the network in the middle, and identify which half contains the loop. Admins can repeat the subdivision process until they identify the switches on which the loop is located and the interfaces that are interconnected. It's a good idea to practice this in a lab environment to learn the process. Vendors have also created functions, such as Unidirectional Link Detection, Loop Guard, Root Guard and BPDU Guard, to prevent different types of loops.
In rarer cases, a switch might forget where an endpoint is located within a VLAN, resulting in a situation known as unicast flooding. This happens when the switch's media access control address-to-port cache timer is different than the VLAN's router IP address-to-MAC address cache. (An example is described in "Unicast Flooding in Switched Campus Networks"). The switch forgets which port a given MAC address is on, causing the switch to flood any frame destined for the MAC address to all ports in the VLAN. Several network topologies and scenarios can cause this flooding. If the affected systems send a lot of data, like doing a disk backup, all systems on the VLAN will experience a large load.
Admins can identify this problem when the end systems on the affected VLAN become sluggish and the packet counters on all interfaces in the VLAN increment at the same rate. One option is to set the MAC address-to-port timer slightly higher than the IP address-to-MAC timer. Alternatively, switch vendors have implemented features that help avoid the high load by limiting the number of unknown unicast flooding operations. These are vendor-specific commands, so admins should check with their provider.
Layer 3 (routed network) problems
Another class of problems affects a VLAN's connectivity to the rest of a Layer 3 network. In these cases, the VLAN operates correctly, but its external connectivity doesn't work. If admins can ping at least one other system on the subnet, basic Layer 2 connectivity is working, and it's likely a Layer 3 problem. There are exceptions, so be open to alternative scenarios.
If the problem is with a single endpoint, check that its IP address is in the right subnet and has the right subnet mask. An incorrect configuration could result from a typo in the configuration process or a wrong VLAN ID configuration on the endpoint's switch interface, which puts it in the wrong VLAN/subnet.
Admins should be able to ping the default gateway on the subnet, as well as adjacent systems on the same subnet. If adjacent systems respond to ping but the default gateway doesn't, then two possible scenarios are causing the issue.
The first option is the default gateway isn't properly configured. This could be a missing switch virtual interface (SVI), or the router that connects the VLAN to the Layer 3 routed network is missing, misconfigured or not in an "up" operational state. Admins should diagnose the SVI or router connection next and, when it is validated, go back to the failing endpoint. Further testing may require admins to return to the Layer 2 testing scenarios described above.
The second possibility is the endpoint's default gateway subnet mask is wrong. The symptom of this scenario is the endpoint can ping some, but not all, other endpoints within the VLAN/subnet. Whether it can reach the default gateway and have packets properly routed back depends on the specific addresses involved. Again, this is a case where network validation automation is a great help.
Network troubleshooting is always best using a divide-and-conquer approach. Observe the symptoms, and determine if the problem is at the physical layer, data link layer, routed layer or application layer. Determine where connectivity fails and why, and then start checking specific items related to that layer. Test for each potential failure to identify where the problem lies, and identify what needs to be corrected. VLAN troubleshooting is a valuable skill learned through experience.