Router Matters

We are transfering PBs of our HDFS data from one data center to another via a router, we never thought the performance of a router will becomes the bottleneck until we find the below statistic:

#show interfaces Gi0/1
GigabitEthernet0/1 is up, line protocol is up
  Hardware is iGbE, address is 7c0e.cece.dc01 (bia 7c0e.cece.dc01)
  Description: Connect-Shanghai-MSTP
  Internet address is 10.143.67.18/30
  MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
     reliability 255/255, txload 250/255, rxload 3/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full Duplex, 1Gbps, media type is ZX
  output flow-control is unsupported, input flow-control is unsupported
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 00:00:06, output 00:00:00, output hang never
  Last clearing of "show interface" counters 1d22h
  Input queue: 0/75/0/6 (size/max/drops/flushes); Total output drops: 22559915
  Queueing strategy: fifo
  Output queue: 39/40 (size/max)

The output queue is full, hence the txload is obviously high.

How awful it is. At the beginning, we found there were many failures or retransmissions during the transfer between two data centers. After adding some metrics, everything is clear, the latency between two data centers is quite unstable, sometimes around 30ms, and sometimes reaches 100ms or even more which is unacceptable for some latency sensitive application. we then ssh into the router and found the above result.

After that, we drop it and replace it with a more advanced one, now, everything returns to normal, latency is around 30ms, packet drop is below 1%.