tcp_keepalive_time and rst Flag in NAT Environment

Here, I'm not going to explain the details of what is TCP keepalive, what are the 3 related parameters tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes mean.
You need to know, the default value of these 3 parameters with net.ipv4.tcp_keepalive_time = 7200, net.ipv4.tcp_keepalive_intvl = 75, net.ipv4.tcp_keepalive_probes = 9, especially the first one, tcp_keepalive_time may and usually will cause some nasty behavior to you in some environment like NAT.

Let's say, your client now wants to connect to server, via one or more nat in the middle way, whatever SNAT or DNAT. The nat server, no matter its' a router or a Linux based device with ip_forward on, need to maintain a TCP connections mapping table to track each incoming and outcoming connection, since the device's resource is limited, the table can't grow as large as it wants, so it needs to drop some connections which are idle for a period of time with no data change between client and server.

This phenomenon is quite common, not only in consumer oriented/low end/poor implementation router, but also in data center NAT servers. For most of these NAT server, their tcp_keepalive_time is usually set to 90s or around, so, if your client set the parameter larger than 90, and have no data communications for more than 90s, the NAT server will send a TCP package with RST flags to disconnect the both end.

In this situation, one workaround is to lower the keepalive interval than that, like 30s, if you can't control the NAT server.

Recently, when we sync data from MongoDB Master/Slave, note, the architecture is a little old, not the standard replica. We often observe a large amount of errors indicating the failure of synchronisation during creating index process after finishing synchronizing the MongoDB data. Below log is get from our slave server:

Fri Nov 14 12:52:00.356 [replslave] Socket recv() errno:110 Connection timed out
Fri Nov 14 12:52:00.356 [replslave] SocketException: remote: error: 9001 socket exception [RECV_ERROR] server []
Fri Nov 14 12:52:00.356 [replslave] DBClientCursor::init call() failedFri Nov 14 12:52:00.356 [replslave] repl: AssertionException socket error for mapping query
Fri Nov 14 12:52:00.356 [replslave] Socket flush send() errno:104 Connection reset by peer
Fri Nov 14 12:52:00.356 [replslave]   caught exception (socket exception [SEND_ERROR] for in destructor (~PiggyBackData)
Fri Nov 14 12:52:00.356 [replslave] repl: sleep 2 sec before next pass
Fri Nov 14 12:52:02.356 [replslave] repl: syncing from host:
Fri Nov 14 12:52:03.180 [replslave] An earlier initial clone of 'sso_production' did not complete, now resyncing.
Fri Nov 14 12:52:03.180 [replslave] resync: dropping database sso_production
Fri Nov 14 12:52:04.272 [replslave] resync: cloning database sso_production to get an initial copy is DNAT IP address for MongoDB master. As you can see, the slave receives reset from

After some package capture and analysis, we get the conclusion with the root cause of tcp_keepalive_time related. If you are sensitive, you may consider TCP stack related issues when see "reset by peer" keyword at first glance.

Be careful, the same issue also occurs in some IMAP proxy, IPVS envionment, etc.  After many years hands-on online operations experience, I have to say, DNS and NAT are two potential threats of various issues, even in some so-called 100% uptime environment.

Migrating GitHub Enterprise From Beijing To Shanghai

We need to migrate our GitHub Enterprise from one data center located in Beijing to one in Shanghai. The whole process is not complex but time consuming, it takes us more than one week to finish the migration. I will share some pieces of practice for you.

The current version of GitHub Enterprise running version is 11.10.331, a little bit out of date. it's running on a VirtualBox with 4 CPU cores, 16G memory, the total code takes about 100GByte disk space. According to GitHub staff says, VirtualBox has some performance issues and sometimes even data corruption, so since version 2.0, GitHub no longer support VirtuBox, and they recommend VMware's free vSphere hypervisor, AWS or OpenStack KVM as a replacement.

Considering our new environment in Shanghai, it has quite strict restrictions of choosing platforms from the aspect of security, it's impossible to install something that hasn't been comprehensively investigated and tested, VMware or KVM just out. Since AWS is not so widely spreaded in China and most importantly, it's not possible to host it in a private network, also out. The only choice for us is stilling using VirtualBox.

Now, we need to export the code data from VirtualBox, it has at least two ways, the most straightforward way is to copy the vmdk file, or using VirtualBox's export feature, which you can put the mirror into another OS environment. We choose the first one, copy the vmdk file directly, remember to shutdown the VirtualBox before doing this, or your data will be corrupted.

Originally, We plan to rsync the 100G data from Beijing data center to Shanghai directly via wlan, although both ends are private networks, we can open a DNAT connection from Beijing temporally, so the server hosted in Shanghai can send a active connection to Beijng and begin to transfer, this seems good except that the outbound bandwidth is only 20Mbps, that means, in theory, the tranfer time is 100G*8*1024/20 = 40960s, almost half a day. Actually, there's another issue we can't control, as the data is transfered through the a long distance on the public network, the stability is not guaranteed, what if the connection lost due to some unknown issue? the previous time is in vain.

So, we use a quite traditional way to get it done, take a USB HDD to the data center and copy the data to the HDD, the whole process takes about 2 hours. Later, we rsync the data from HDD to our data center in Shanghai through a leased line.

Now, the raw data is ready for Shanghai, next step it to setup a new VirtualBox instance, it's quite easy, here, we recommend you to install a VNC or something others like ssh X11 forwarding, so you can use a GUI to do the left operations. If you're quite familiar with VBoxManage, that's also ok. The only thing you need to do is to create a new instance, mount the vmdk file as the storage disk.

After booting up the GitHub Enterprise, we found that we can't enter into the even we uploaded the license, after opening a ticket, GitHub engineer confirmed that there exists a bug in this version which occasionnlly prevent from entering the management console. In order to continue, we first needed to ssh into the GitHub using admin user, after that, we should remove the previous license file via sudo rm /data/enterprise/enterprise.ghl, then run god restart enterprise-manage. But we were stuck in the fisrt step because we had no sudo permission.

Known the issue, we now require to root the instance. To gain root access we need to boot into recovery mode, just shutdown the VM and power it back on. inside the hypervisor console, hold left shift down while it boots, this will take us to the GRUB menu where we can choose to boot into recovery mode. As the boot process goes so fast, we think up a way to delay the booting by execute the below command outside VirtualBox:
$ VBoxManage modifyvm YOUR_VM_INSTANCE_NAME –bioslogodisplaytime 10000

After that, we can have enough time to press down the button. Note, it's *shift* key, not tab key for VirtualBox. We made a mistake by pressing down the tab unconsciously until we found that stupid accident.

When entering the recovery mode, do as this documentation says. Change the root passwd, and add ssh public key to root/.ssh/authorized_keys. After the server has rebooted normally, we can now use the PubkeyAuthentication method to log in as root.

If everything goes well, we can now upload the license, enter into the management console.

Why we want to enter into the management console? Besides the normal user add/delete management, we also need to change the current authentication process from the built in auth to LDAP-backend auth. The setup is straightforward, we only need to fill out the form with some section values like Host, Port, User ID field, etc. our IT support team provide us. When we click the "Save settings" button, waiting about 10 minutes for it took effect, something weird happened, we  can't login with our LDAP account, everytime we try to login, it returned a 500 error. After some troubleshooting, we found that the LDAP setting didn't work because the configuration run wasn't properly triggered when the setting were saved, so the default authentication was still active which we can see from the login welcome portal, it should be something like "Sign in via LDAP", but the fact we saw was "Sign in". Later, we ran enterprise-configure command as our technical support suggested, this time, it worked. Why? We can only suspect that, there were some customizations made to the VirtulBox that caused the configuration process to fail.

After Signing into the new LDAP-backend portal, say, before that it's username is barackobama, the new account name is with the new account, we saw a empty page, that make sense, since it's a new account, which doesn't have any connection with the old account barackobama. If so, all of our code are lost, or we need to git clone with the old account then git push with the new account manually to make the code alive. This can't beat GitHub, just rename the user to the new username with the[username]/admin URL. Note that the dot "." will be changed to a hyphen, so remember to rename the user to "barackobama-bo".

Our GitHub Enterprise comes back to life again within a pre-scheduled downtime.

Finally, thanks to @michaeltwofish, @b4mboo, @buckelij, @donal. Your guys professional skills and behavior really affect me a lot, impressive!

totobobo 运动版(supercool) 口罩简单测评

13 年冬买了第一个 totobobo 的口罩,用起来非常爽,完爆 3M,为此我专门写了一篇博客,后来到了 14 年夏天,大热天的带着口罩走路不大舒服(这个问题现在来看是世界性难题没有一个口罩能彻底解决),像下面这样

于是对比了下几种比较适合在夏天戴的,最终买了 totobobo 的运动款(supercool),像下面这个样子

可以看到跟普通 totobobo 不一样的地方是它需要口吸鼻出,跟游泳一样。买回来之后发现 HEPA 滤网跟贴合它的两个塑料片之间有很大的间隙

这也就意味着过滤的性能可能大打折扣,因为脏空气不经过滤网直接进呼吸系统了。为此特意跟官方的工作人员交流下此事的意见,答复是这是正常的,不会影响过滤效果,如果实在需要贴合的话,可以用热水重新塑形,于是我就用热水把两边的重新塑形了一遍,发现效果还是有的,确实贴紧了。接下来该到路上走两步了,于是每天带着 supercool 走着上下班(单程 3 公里左右),虽然我游泳能适应口吸鼻出的情况,但是在马路上也这么搞,还真有些不大习惯,好在一周之后基本适应。奇怪的是,这段时间的空气质量并不好,但回去肉眼观察到 HEPA 几乎还是白白的,如果是之前戴的非 supercool 的 totobobo,一天就开始发灰了,本着科学的态度,继续带了接近三周,最终的结论是:连续带了一个月,HEPA 的颜色如上图第二章所示,也就是说,这一个月我人肉替它吸了不少烟尘雾霾 ;-(

以上实践证明,supercool 在我这儿的效果非常差。除此之外,由于北方干燥气候加上嘴吸气,所以不管空气好不好,呼吸一段时间时间后嗓子都会明显感受到发干发痒,这是我后来抛弃 supercool 的第二个原因。

根据科学测评(1, 2),totobobo 的舒适性没话说,但是实际的过滤效果在众多的口罩型号中只能说可以接受,大概是中等稍稍偏上的水平,综合考虑的话,还是 3M 的更加靠谱些。

去年底,带了一年多的口罩(上第一张图)遗失在了欧美汇的某个角落里,于是现在我又开始消耗家里之前屯了一大盒子物美价廉的 3M 9501 了 ;-)

stdio 的 buffer 问题

一段代码,通过 tail -f 看打的 log,发现很长时间都没有输出,然后突然一下子输出了好多条,猜想可能跟 buffer 之类的有关系。这个问题其实很早就遇到过,最初以为是什么 bug,直到看到自己写的代码也出现类似的现象之后才决定看看是怎么回事。

$ cat
import time, sys
for i in range(50):

$ python
可以看到,这堆 test 字符串是等了若干秒之后一下子输出的。

如果我们把 sys.stdout.write("test") 改为 sys.stdout.write("test\n") 即加上换行符号,或者使用 print 函数来输出,发现现象不一样了:
$ cat
import time, sys
for i in range(50):

$ python

$ cat
import time, sys
for i in range(50):
  print "test"

发现不管是 demo2 还是 demo3,屏幕上均以平均 0.2s 的频率输出 test 字符。
把 demo3 的 print "test" 换成 print "test",(结尾加一个半角逗号)再看看是什么现象。
再用 python3 的 print("test") 试试,尝试加上 end 参数比如,print("test", end="\n"), print("test", end="\t"),print("test", end="") 再试试有什么不同的结果。

再来看一个 demo:
$ cat
import time, sys
for i in range(50):

加上 sys.stdout.flush() 看看跟上面的比有什么不同的效果。

$ python > output
注意实时观察 output 文件的大小,发现并没有随时间而增大,而是 运行结束了之后才变化的。

上面就是之前遇到的一些现象,这里面涉及到其实是 UNIX 下面的 STDIO buffer 问题。下面会深入现象揭开本质,没时间的看最后的结论即可。

IOS C 标准定义了一套叫标准 I/O 的库,也叫 buffered I/O,这套库被包括 UNIX 在内的系统所实现,包括我们日常使用的众多发行版本。而大家熟知的 open, read, write, lseek, close 这些 I/O 系统调用函数则是 POSIX 定义的,他们通常称为 unbuffered I/O,就是为了跟标准 I/O 库作出区分。这些底层的系统调用函数,大多都是围绕 fd 展开,而标准 I/O 则是围绕着 STREAM 展开,标准 I/O 库其实可以理解为对系统 I/O 函数的封装,因为标准 I/O 库最终还是要调用对应的这些系统 I/O 函数,可以通过 fileno(FILE *FP) 获取到 STREAM 对应的 fd。
为什么说标准 I/O 库是 buffered I/O 了,因为他会自动的帮你处理 buffer 分配以及 I/O chunks 的选择,这样就不再需要为选择 block size 而操心了,这个在使用系统 I/O 调用的时候无法避免,比如 read/write 都需要考虑 buffer 地址以及读取写入的 buffer size,通常你需要在调用 read 时候定义一个 buffer size 的宏:
# define BUFFSIZE 4096

buffered I/O 的主要目的就是为了降低 read/write 这类的系统调用以及自动的为程序分配 buffer。但是他分为了下面三种类似的 buffering:
1. full buffer,当标准 I/O buffer 满了时候发生一次 flush 操作,可以调用 fflush() 来完成,他将 buffer 里面的数据 flush 到内核缓冲区中。
2. line buffer,遇到换行符(一般就是 "\n") 也就是写完一行的时候发生一次 flush,
3. unbuffered,有多少读写多少。

Linux 一般是这样实现的:
1. stderr 是 unbuffered,这会让错误信息及时的出现。
2. stdin/stdout stream 如果不跟终端相关联,比如 pipe,redirect,fopen 打开的文件,则是 full buffer;如果跟终端相关联,则是 line buffer
上面这两条规则其实就是速度跟系统之间的一个 tradoff,很好理解。

可以通过 setbuf/setvbuf 来修改 buffer 的模式,具体的使用方式 man 2,需要注意的是,这两个函数要在 stream 打开之后其余 I/O 操作之前调用,让然,如果你需要做一些特殊的事情,完全可以在昨晚某些 I/O 操作之后再调用,比如下面要举的第二个 demo。setvbuf 比 setbuf 有更大的优势,比如可以修改 buffer 的大小等等。

关于 STREM 对应的 buffer 类型,其大小可以通过这段代码来做一个验证,比如我的机器的几个 buffer size 都是 8KB。

而 int fflush(FILE *fp) 这个函数就是解决我们上面问题的核心了,该函数会将当前 STREAM 中的数据 flush 到内核缓冲区,如果 fp 是 NULL,则 stdout 流被 flush 一次。准确的说,fflush 只能用于输出流,而不能用于输入流,具体的原因见这里
这里的一个 demo 很好的解释了 fflush/setvbuf 做的事情,尝试把 setvbuf 中的 size_t size 参数从原先的 1024 调小到 20 试试看。
很明显,通过这种 buffer 的方式,把一部分的写先 buffer 起来然后统一调用一次系统调用,可以大量的减少 user space 跟 kernel space 之间的切换。

可能会有人想到 fsync 这个系统调用,它跟 fflush 做的事情好像是一样的,其实仔细辨别的,二者做的事情根本不在一个平面上
fflush(FILE *stream) 作用的是 FILE*,对于 stdout 来说,他是将标准 IO 流的 buffer 从用户空间 flush 到内核缓存中。这也是调用 exit 要做的事情。
fsync(int fd) 控制的是何时将 data&metadata 从内核缓冲区 flush 到磁盘中,他的传入参数是一个 fd。对 fsync 来说,FILE* 是透明的也就是所他并不知道 FILE* 的存在,一个是在 user space 一个是在 kernel space。

所以,如果我们不想有 full/line buffer 而是尽可能快的获取到输出流的话,就需要通过调用 fflush(stdout) 指明。

上面解释的仅仅是 C 的,对于 Python 而言,底层调用的东西几乎一样,Python 它自己通过 C 实现了 fflush(),具体的代码可以看这里。其实不单单是 fflush,不少包括 read/write 在内的底层调用 Python 都是用 C 实现的。
对用到 Python 的 fflush 则是 sys.stdout.flush()。

不管是 fflush() 还是 sys.stdout.flush(),都需要对立即返回的 stdout 手动的调用,比较麻烦。所幸的,上面提到的 setvbuf 就可以直接帮我们做这件事,在 stream 打开后调用 setvbuf() 即可,其 mode 参数可以选择下面三种:
1. _IOLBF,line buffer
2. _IOFBF, full buffer
3. _IONBF,no buffer
setvbuf(stdout, 0, _INNBF, 0);

对应到 python 的,至少还有下面的几种方式可以避免此类问题:
1. 直接关闭 stdout 的 buffer,类似 setvbuf:
sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)

2. 有个比较 ugly 的方式,把输出流改到 stderr 上,这样不管什么时候都能保证尽快的输出。

3. 直接脚本的时候加上 -u 参数。但是需要注意下,xreadlines(), readlines() 包含一个内部 buffer,不受 -u 影响,因此如果通过 stdin 来遍历会出现问题,可以看看这两个连接提供的内容(1, 2)。

4. 将其 stream 关联到 pseudo terminal(pty) 上,script 可以做这事情的:
script -q -c "command1" /dev/null | command2
或者通过 socat 这个工具实现,

再来看个跟 pipe 相关的问题, 这个命令常常回车之后没有反应:
$ tail -f logfile | grep "foo" | awk {print $1}

tail 的 stdout buffer 默认会做 full buffer,由于加上了 -f,表示会调用 fflush() 对输出流进行 flush,所以 tail -f 这部分没什么问题。关键在 grep 的 stdout buffer,因此它存在一个 8KB stdout buffer,要等该 buffer 满了之后 awk 才会接收到数据。awk 的 stdout buffer 跟终端相关联,所有默认是 line buffer。怎么解决这个问题了,其实 grep 提供了 –line-buffered 这个选项来做 line buffer,这会比 full buffer 快的多:
tail -f logfile | grep –line-buffered  "foo" | awk {print $1}

除了 grep,sed 有对应的 -u(–unbuffered),awk(我们默认的是 mawk) 有 -W 选项,tcpdump 有 -l 选项来将 full buffer 变成 line 或者 no buffer。

不仅仅是 stdin/stdout/stderr 有 buffer 问题,pipe 同样有 buffer 的问题,相关的文档可以看这里(1, 2)。

上面的方式都涉及到了具体的函数调用,修改参数的不具有普遍原理,对于普通用户来说,不大可能这么操作。其实 coreutils 已经给我们提供了一个叫 stdbuf 的工具。expect 还提供了一个叫 unbuffer 的工具,通过它可以将输出流的 buffer 给禁止掉,另外,在 pipe 的应用中,可能会出现一些问题,具体的 man 一下。因此,上面的问题可以更具有普遍性:
tail -f logfile | stdbuf -oL grep "foo" | awk {print $1}