We are transfering PBs of our HDFS data from one data center to another via a router, we never thought the performance of a router will becomes the bottleneck until we find the below statistic:
#show interfaces Gi0/1
GigabitEthernet0/1 is up, line protocol is up
Hardware is iGbE, address is 7c0e.cece.dc01 (bia 7c0e.cece.dc01)
Internet address is 10.143.67.18/30
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 250/255, rxload 3/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full Duplex, 1Gbps, media type is ZX
output flow-control is unsupported, input flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:06, output 00:00:00, output hang never
Last clearing of "show interface" counters 1d22h
Input queue: 0/75/0/6 (size/max/drops/flushes); Total output drops: 22559915
Queueing strategy: fifo
Output queue: 39/40 (size/max)
The output queue is full, hence the txload is obviously high.
How awful it is. At the beginning, we found there were many failures or retransmissions during the transfer between two data centers. After adding some metrics, everything is clear, the latency between two data centers is quite unstable, sometimes around 30ms, and sometimes reaches 100ms or even more which is unacceptable for some latency sensitive application. we then ssh into the router and found the above result.
After that, we drop it and replace it with a more advanced one, now, everything returns to normal, latency is around 30ms, packet drop is below 1%.