Part 1, The basics of network troubleshooting
Part one of a two-part series
As an IBM AIX® systems administrator, it’s inevitable that at some point in your role you will encounter a problem that’s linked to or directly caused by an issue within the LAN or WAN. In such instances, its good practice to make an initial diagnosis of the problem before engaging the help of a suitable network administrator to help identify root cause, or at least give the administrator a general direction in which to start his or her investigation.
Once engaged, you may be required to assist in the analysis, so it’s essential that you come armed with a relevant diagnostic toolkit. This article provides you with a set of commands available on AIX, many of which are also available on other flavors of UNIX®, which can help you to troubleshoot TCP/IP network-related issues.
For the purposes of this article, the target host system used in all sample commands and output is called testhost.
Is anybody there?
The first step in diagnosing any network-related issue is to verify whether the target host is running. You can use
ping to test whether a host is reachable across a network (see Listing 1). This command sends an Internet Control Message Protocol (ICMP) echo request packet to the host and waits for an echo reply.
ping means that:
- Your host has an active network adapter that can be used to send out the request.
- The target host is running and has an active network adapter configured with the IP address that you used.
- Name resolution is working for that host if a name was used rather than an IP address.
- There is a route from your host to the target and back.
- No firewalls on the route between hosts or running on either host are blocking ICMP traffic.
The output from a successful
ping can also be useful in helping to determine network latency, as it reports on the time taken to receive the echo reply. Long response times are likely to mean poor performance for any applications that exchange data with the target host.
Listing 1. Pinging a responsive host
#ping testhost PING testhost: (10.217.1.206): 56 data bytes 64 bytes from 10.217.1.206: icmp_seq=0 ttl=253 time=0 ms 64 bytes from 10.217.1.206: icmp_seq=1 ttl=253 time=0 ms 64 bytes from 10.217.1.206: icmp_seq=2 ttl=253 time=0 ms 64 bytes from 10.217.1.206: icmp_seq=3 ttl=253 time=0 ms ‑‑‑‑testhost PING Statistics‑‑‑‑ 4 packets transmitted, 4 packets received, 0% packet loss round‑trip min/avg/max = 0/0/0 ms #
If no echo reply is received, then one or more of the conditions described above hasn’t been met and the
ping fails (see Listing 2). A
ping fails when the number of packets received is less than the number sent and packet loss is greater than 0 percent.
Listing 2. Pinging an unresponsive host
#ping testhost PING testhost.testdomain.com: (10.216.122.12): 56 data bytes ‑‑‑‑testhost.testdomain.com PING Statistics‑‑‑‑ 5 packets transmitted, 0 packets received, 100% packet loss #
ping was unsuccessful, you can check whether the adapter used to send the request is up by using
You can use the
ifconfig command to display the status of an individual adapter (for example,
en1 shown in Listing 3) or all adapters using the
-a switch, also shown in Listing 3. You should ensure that the adapter used to send packets out to your host is showing as
RUNNING. If it’s not, then you need to investigate further.
Listing 3. Displaying network adapter status
#ifconfig en1 en1: flags=7e080863,40<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT, CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG> inet 10.216.163.37 netmask 0xffffff00 broadcast 10.216.163.255 tcp_sendspace 131072 tcp_recvspace 65536 #ifconfig ‑a en2: flags=7e080863,40<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT, CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG> inet 10.203.35.14 netmask 0xffffff80 broadcast 10.203.35.127 en1: flags=7e080863,40<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT, CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG> inet 10.216.163.37 netmask 0xffffff00 broadcast 10.216.163.255 tcp_sendspace 131072 tcp_recvspace 65536 en0: flags=7e080822,10<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST,GROUPRT,64BIT, CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG> lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT> inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255 inet6 ::1/0 tcp_sendspace 65536 tcp_recvspace 65536 #
You can display Ethernet statistics on an adapter with
entstat (see Listing 4). The example shown uses the
-d switch to display all statistics, including device-specific statistics, for the adapter
en2. This command can also be useful for telling you the link status (up or down) and media speed (for example, 100Mbps Full Duplex). The media speed is useful if you need to verify the setting based on the link partner and network that the adapter is connected to, as a speed or duplex mismatch can cause problems.
Listing 4. Displaying Ethernet statistics for a network adapter
#entstat ‑d en2 ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ ETHERNET STATISTICS (en2) : Device Type: 10/100/1000 Base‑TX PCI‑X Adapter (14106902) Hardware Address: 00:02:55:d3:37:be Elapsed Time: 114 days 22 hours 48 minutes 20 seconds Transmit Statistics: Receive Statistics: ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ Packets: 490645639 Packets: 3225432063 Bytes: 9251643184881 Bytes: 215598601362 Interrupts: 0 Interrupts: 3144149248 Transmit Errors: 0 Receive Errors: 0 Packets Dropped: 0 Packets Dropped: 0 Bad Packets: 0 Max Packets on S/W Transmit Queue: 109 S/W Transmit Queue Overflow: 0 Current S/W+H/W Transmit Queue Length: 0 Broadcast Packets: 442 Broadcast Packets: 10394992 Multicast Packets: 0 Multicast Packets: 349 No Carrier Sense: 0 CRC Errors: 0 DMA Underrun: 0 DMA Overrun: 0 Lost CTS Errors: 0 Alignment Errors: 0 Max Collision Errors: 0 No Resource Errors: 0 Late Collision Errors: 0 Receive Collision Errors: 0 Deferred: 0 Packet Too Short Errors: 0 SQE Test: 0 Packet Too Long Errors: 0 Timeout Errors: 0 Packets Discarded by Adapter: 0 Single Collision Count: 0 Receiver Start Count: 0 Multiple Collision Count: 0 Current HW Transmit Queue Length: 0 General Statistics: ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ No mbuf Errors: 0 Adapter Reset Count: 0 Adapter Data Rate: 200 Driver Flags: Up Broadcast Running Simplex 64BitSupport ChecksumOffload PrivateSegment DataRateSet 10/100/1000 Base‑TX PCI‑X Adapter (14106902) Specific Statistics: ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ Link Status: Up Media Speed Selected: 100 Mbps Full Duplex Media Speed Running: 100 Mbps Full Duplex PCI Mode: PCI‑X (100‑133) PCI Bus Width: 64‑bit Jumbo Frames: Disabled TCP Segmentation Offload: Enabled TCP Segmentation Offload Packets Transmitted: 260772859 TCP Segmentation Offload Packet Errors: 0 Transmit and Receive Flow Control Status: Disabled Transmit and Receive Flow Control Threshold (High): 32768 Transmit and Receive Flow Control Threshold (Low): 24576 Transmit and Receive Storage Allocation (TX/RX): 16/48 #
If the adapter is up, you can establish whether the route from your host to the target is correct by using
route get (see Listing 5). If there’s no route at all, then
ping will inform you, but if there is, you will need to establish what it is to verify with the network administrator that it’s correct. Based on the information in the routing table that your host uses,
route get will tell you the gateway the packets will be routed to when leaving your host on the way to the target.
Listing 5. Getting routing table information for a host
#route get testhost route to: testhost destination: 10.203.35.128 mask: 255.255.255.128 gateway: 10.203.35.1 interface: en2 interf addr: myhost flags: <UP,GATEWAY,DONE,PRCLONING> recvpipe sendpipe ssthresh rtt,msec rttvar hopcount mtu expire 0 0 0 0 0 0 0 ‑9751026 #
If the route is correct, then you can use
traceroute to determine the exact route that packets will take across the network to the target host. The output of a successful
traceroute (see Listing 6) shows each router the packets travel through to reach the target host along with the minimum, average, and maximum response time taken to get to that router.
Listing 6. Tracing a successful route to a host
#traceroute testhost trying to get source for testhost source should be 10.216.163.37 traceroute to testhost (10.217.1.206) from 10.216.163.37 (10.216.163.37), 30 hops max outgoing MTU = 1500 1 10.216.163.2 (10.216.163.2) 1 ms 0 ms 0 ms 2 10.217.189.6 (10.217.189.6) 0 ms 0 ms 0 ms 3 testhost (10.217.1.206) 1 ms 1 ms 1 ms #
traceroute (see Listing 7) has asterisks (
*) in the time fields, as they cannot be determined because the probe to the next router timed out. The example also shows the use of the
-n switch, which prints numeric host addresses, thereby avoiding name lookup and resolution and speeding up the trace.
Listing 7. Tracing an unsuccessful route to a host
#traceroute ‑n testhost traceroute testhost trying to get source for testhost source should be 10.216.163.37 traceroute to 10.216.122.12 from 10.216.163.37, 30 hops max outgoing MTU = 1500 1 10.216.163.2 1 ms 0 ms 0 ms 2 10.216.191.238 1 ms 1 ms 1 ms 3 10.216.143.10 2 ms 2 ms 2 ms 4 ∗ ∗ ∗ 5 ∗ ∗ ∗ 6 ∗ ∗ ∗ #
Services running at the application layer of a TCP/IP network listen on one or more ports that are used to exchange data between clients and the host server as managed by the transport layer. If a valid route exists to the host, and it’s responding to pings but the application service is failing to respond, then you can check connectivity to the relevant ports using
telnet command, used in its basic form, establishes a terminal connection to a host. However, you can also use it to establish a connection to a specific port on the host (the default being 23, the telnet service). For a list of standard ports, look in /etc/services.
If the connection is successful, a message indicating the telnet escape sequence is shown (see Listing 8). You need to enter this key sequence (typically, Control-]) to escape back to a
telnet> prompt and enter
quit to return to a shell prompt.
Listing 8. Testing port 80 (HTTP) on a host (successful)
#telnet testhost 80 Trying... Connected to testhost. Escape character is '^]'. ^] telnet> quit Connection closed. #
Depending on the type of connection you’re making, the remote service you’re connecting to may generate a message similar to Listing 9.
Listing 9. Testing port 25 (SMTP) on a host (successful)
#telnet testhost 25 Trying... Connected to testhost. Escape character is '^]'. 220 testhost.testdomain.com ESMTP Sendmail Wed, 10 Feb 2010 15:52:28 GMT ^] telnet> quit Connection closed. #
If the connection fails, then either a connection timeout or a connection refused message will be displayed (see Listing 10). This message can mean that the service on the target host isn’t running (and therefore nothing is listening on the port), or that a firewall running on the host (or somewhere en route) is blocking connections to the port.
Listing 10. Testing port 515 (remote printing) on a host (unsuccessful)
#telnet testhost 515 Trying... telnet: Unable to connect to remote host: Connection timed out #
Do I know you?
When using a host name in an application or any of the diagnostic commands covered in this article, it’s imperative that the host name can be resolved to an IP address. An IP address is what the Internet layer of a TCP/IP network uses when handling data packets.
A host name must resolve through one of the name-resolution services specified in /etc/irs.conf and /etc/netsvc.conf. The
hosts record determines the order the name resolution is performed. Only local and BIND/DNS resolution is covered here; the remaining options are outside the scope of this article.
local is specified, the /etc/hosts file is used to resolve host names. So, check to see whether there’s an entry for the target host (see Listing 11).
Listing 11. Looking for a host in /etc/hosts
#grep testhost /etc/hosts 10.217.1.206 testhost testhost.testdomain.com aixserver #
If you specify
dns, then DNS is used to resolve host names, and you can use
nslookup to check whether the host name resolves (see Listing 12).
Listing 12. Resolving a host name via DNS
#nslookup testhost Server: testdns.testdomain.com Address: 184.108.40.206 Name: testhost.testdomain.com Address: 10.217.1.206 #
A more powerful DNS interrogation tool is
dig. This command has a much richer set of options and arguments than
nslookup. The latter has an interactive mode that provides the additional functionality. So, for more complex queries—particularly where the output will be parsed by a script—
dig is preferred (see Listing 13).
Listing 13. Reverse lookup of an IP address in DNS
#dig ‑x 10.217.1.206 ; <<>> DiG 9.2.0 <<>> ‑x 10.217.1.206 ;; global options: printcmd ;; Got answer: ;; ‑>>HEADER<<‑ opcode: QUERY, status: NOERROR, id: 21351 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;220.127.116.11.in‑addr.arpa. IN PTR ;; ANSWER SECTION: 18.104.22.168.in‑addr.arpa. 3600 IN PTR testhost.testdomain.com. ;; Query time: 11 msec ;; SERVER: 10.217.1.206#53(10.217.1.206) ;; WHEN: Fri Feb 12 13:28:16 2010 ;; MSG SIZE rcvd: 82 #
A host uses the Address Resolution Protocol (ARP) table or arp cache to keep track of the media access control (MAC) address of other network devices alongside their IP addresses. The link layer of a TCP/IP network uses a device’s MAC address, so the ARP table is used to translate a MAC address to an IP address and back. If your host has communicated successfully with another host, it’s likely that there is an entry in the ARP table. You can use
arp to display the entry for a particular host if one exists (see Listing 14).
Listing 14. Displaying a host entry in the ARP table
#arp testhost testhost (10.217.1.206) at 0:c:29:44:90:28 [ethernet] stored in bucket 0 #
You can also display the entire table using the
-a switch (see Listing 15). The
-n switch specifies that host name-to-IP address resolution shouldn’t be performed.
Listing 15. Displaying the contents of the ARP table
#arp ‑an ? (10.217.1.206) at 00:c:29:44:90:28 [ethernet] stored in bucket 0 ? (10.203.35.1) at 0:10:db:27:d9:8 [ethernet] stored in bucket 4 ? (10.216.163.40) at 0:1b:78:59:88:d8 [ethernet] stored in bucket 4 ? (10.216.163.250) at 0:11:25:a6:20:78 [ethernet] stored in bucket 14 ? (10.216.163.25) at 0:1b:78:57:a:d0 [ethernet] stored in bucket 14 ? (10.216.163.1) at 0:0:c:7:ac:0 [ethernet] stored in bucket 15 ? (10.216.163.4) at 0:d:65:e2:4c:c2 [ethernet] stored in bucket 18 ? (10.216.163.60) at 0:11:25:a6:d7:9a [ethernet] stored in bucket 24 bucket: 0 contains: 1 entries bucket: 1 contains: 0 entries bucket: 2 contains: 0 entries bucket: 3 contains: 0 entries bucket: 4 contains: 2 entries . . . . . There are 8 entries in the arp table. #
Can you hear me?
To establish a connection, TCP uses a three-way handshake. A client initiates a connection to a host (and a specific port) by sending a SYN synchronize packet. After successfully receiving it, the host responds with a SYN-ACK acknowledgement. If the client successfully receives this acknowledgement, the client completes the handshake with an ACK acknowledgement. All of this assumes that the host server is listening on the specified port, that a route exists from the client to the host and back again, and that no firewalls are blocking this kind of traffic.
You can use
netstat to display existing connections from your host to other hosts and the current state of each. Using the command with the
-a switch (show the state of all sockets) and the
-n switch (show addresses numerically, avoiding lookup), you can pipe the output to a suitable
grep to look for connections in a particular state (for example, ESTABLISHED for post-handshake, active connections) or connections to a particular host or port.
Listing 16 shows all connections and their state to a particular IP address (two connections to 10.217.1.206, both fully established), to a particular port at a particular IP address (to 10.217.1.206 at port 22), and all fully established connections to any host respectively.
Listing 16. Displaying the status of connections to hosts
#netstat ‑an | grep 10.217.1.206 tcp4 0 0 10.203.35.14.22 10.217.1.206.1023 ESTABLISHED tcp4 0 0 10.203.35.14.46183 10.217.1.206.22 ESTABLISHED #netstat ‑an | grep 10.217.1.206.22 tcp4 0 0 10.203.35.14.46183 10.217.1.206.22 ESTABLISHED #netstat ‑an | grep ESTABLISHED tcp4 0 0 10.203.35.14.22 10.217.1.206.1023 ESTABLISHED tcp4 0 0 10.203.35.14.46183 10.217.1.206.22 ESTABLISHED tcp4 0 0 10.216.163.37.1521 10.216.163.37.44122 ESTABLISHED tcp4 0 0 10.216.163.37.44122 10.216.163.37.1521 ESTABLISHED tcp4 0 0 127.0.0.1.199 127.0.0.1.32769 ESTABLISHED tcp4 0 0 127.0.0.1.32769 127.0.0.1.199 ESTABLISHED tcp4 0 0 10.203.35.14.46183 10.203.35.170.22 ESTABLISHED tcp4 0 0 10.216.163.37.32770 10.216.163.37.32771 ESTABLISHED #
You can monitor outgoing data sent from a particular adapter using
tcpdump, which displays the content of each packet as it is sent. The command takes various options to allow you to display more or less of the packet either in descriptive or raw form and allows a number of Boolean expressions to filter the type of data you want to see. For example, monitoring packets on adapter
en2, you can show only data being sent to a specific host (see Listing 17).
Listing 17. Display packets destined for a specific host
#tcpdump ‑i en2 dst host testhost tcpdump: listening on en2 10:08:24.912057892 myhost.46183 > testhost.22: P 1299060979:1299061027(48) ack 3373421618 win 17520 (DF) tos 0x1010:08:25.009291439 myhost.46183 > testhost.22: P 1:49(48) ack 48 win 17520 (DF) tos 0x1010:08:25.093832676 myhost.46183 > testhost.22: . ack 96 win 17520 (DF) tos 0x1010:08:25.249319253 myhost.46183 > testhost.22: P 1299061075:1299061123(48) ack 3373421714 win 17520 (DF) tos 0x10^C 53 packets received by filter 0 packets dropped by kernel #
You can show only packets coming from a specific host (see Listing 18).
Listing 18. Display packets sent by a specific host
#tcpdump ‑i en2 src host testhost tcpdump: listening on en2 10:10:38.505848354 testhost.22 > myhost.46183: . ack 130 win 24820 (DF) tos 0x1010:10:38.505916972 testhost.22 > myhost.46183: F 529:529(0) ack 225 win 24820 (DF) tos 0x1010:10:43.855153846 testhost > myhost: icmp: echo reply 10:10:44.855224394 testhost > myhost: icmp: echo reply ^C 102 packets received by filter 0 packets dropped by kernel #
You can show only packets sent to or coming from a specific port (see Listing 19).
Listing 19. Display packets destined for or sent by a specific host on a specific port
#tcpdump ‑i en2 host testhost port 22 12:15:38.033833162 myhost.47216 > testhost.22: . ack 610148954 win 17520 (DF) tos 0x1012:15:38.113807903 myhost.47216 > testhost.22: P 145:193(48) ack 192 win 17520 (DF) tos 0x1012:15:38.114291921 testhost.22 > myhost.47216: P 192:240(48) ack 193 win 24820 (DF) tos 0x1012:15:38.241718122 myhost.47216 > testhost.22: P 193:241(48) ack 240 win 17520 (DF) tos 0x1012:15:38.242344703 testhost.22 > myhost.47216: P 240:288(48) ack 241 win 24820 (DF) tos 0x1012:15:38.243844593 myhost.47216 > testhost.22: . ack 288 win 17520 (DF) tos 0x1012:15:38.497817604 myhost.47216 > testhost.22: P 241:289(48) ack 288 win 17520 (DF) tos 0x1012:15:38.503088328 testhost.22 > myhost.47216: P 288:336(48) ack 289 win 24820 (DF) tos 0x1012:15:38.503154802 testhost.22 > myhost.47216: P 336:432(96) ack 289 win 24820 (DF) tos 0x10^C 145 packets received by filter 0 packets dropped by kernel #
You can stop the trace by pressing Control-C. The
tcpdump command is much more feature rich than the simple examples shown here, so I recommend that you familiarize yourself with its man pages.
As you can see from the output in these three examples, traffic is shown with:
- A timestamp
- Source Host.Source Port
- Destination Host.Destination Port
- Packet flags
- Other packet information
You can use the command to establish whether traffic is leaving your host destined for the target host and whether traffic is making its way back. If no inbound traffic appears, it may be that the host isn’t responding or there’s no valid route from your host to the target or vice versa. If a particular service (TCP port) isn’t responding or a firewall is blocking packets of the type you are sending, you will typically see an
R in the packet flags field, indicating that the connection has been reset. For more information on the exact layout and format of a TCP packet, refer to RFC 793: Transmission Control Protocol.
Depending on the nature of the problem, it is sometimes good practice to run a
tcpdump for a period of time while capturing packet information to a file using the
-w switch. Once you feel you have captured enough data, press Control-C to stop the trace. At this point, you can process the file using the
-r option to read the packet data captured. You can then use the vast array of switches, options, and Boolean arguments to analyze the data. Listing 20 shows an example of this process.
Listing 20. Capture packet data to a file and analyze it
#tcpdump ‑w /var/tmp/tcpdump.out ‑i en1 tcpdump: listening on en1 ^C 305 packets received by filter 0 packets dropped by kernel #tcpdump ‑r /var/tmp/tcpdump.out host testhost 13:10:12.017777365 testhost.22 > myhost.47216: P 790304:790352(48) ack 1110769 win 24820 (DF) tos 0x1013:10:12.129146164 myhost.47216 > testhost.22: P 135249:135297(48) ack 126560 win 17520 (DF) tos 0x1013:10:12.129992465 testhost.22 > myhost.47216: P 790352:790416(64) ack 1110817 win 24820 (DF) tos 0x1013:10:12.203827965 myhost.47216 > testhost.22: . ack 790416 win 17520 (DF) tos 0x1013:11:35.707809458 myhost > testhost: icmp: echo request (DF) 13:11:35.709883978 testhost > myhost: icmp: echo reply (DF) #tcpdump ‑r /var/tmp/tcpdump.out not port 22 13:11:35.707809458 myhost > testhost: icmp: echo request (DF) 13:11:35.709883978 testhost > myhost: icmp: echo reply (DF) 13:11:36.579874114 arp who‑has 10.203.35.59 tell 10.203.35.57 13:11:37.077504208 0:2:16:9e:20:a 1:80:c2:0:0:0 0026 38: 4242 0300 0000 0000 8000 0002 1695 aecb 0000 0026 8000 0002 169e 2008 8017 0200 1400 0200 0f00 13:11:38.065119802 oraclehost.testdomain.com.2175 > myhost.tnslsnr: P 502:591(89) ack 421 win 64056 13:11:38.071526597 oraclehost.testdomain.com.2175 > myhost.tnslsnr: P 591:606(15) ack 548 win 63929 13:11:38.896664820 10.203.35.37.netbios‑ns > 10.203.35.127.netbios‑ns: udp 50 13:11:39.071526597 10.203.35.20.netbios‑ns > 10.203.35.127.netbios‑ns: udp 50 #
This article covered some of the AIX tools you can use to test connectivity to a host, extract useful network-related information about a host, and analyze data sent to and from a host. In the next article, you’ll get under the covers to see what is really going on when your host has problems communicating with another. The article will conclude with a step-by-step guide to logical problem diagnosis when encountering network-related issues.