Migrating GitHub Enterprise From Beijing To Shanghai

We need to migrate our GitHub Enterprise from one data center located in Beijing to one in Shanghai. The whole process is not complex but time consuming, it takes us more than one week to finish the migration. I will share some pieces of practice for you.

The current version of GitHub Enterprise running version is 11.10.331, a little bit out of date. it's running on a VirtualBox with 4 CPU cores, 16G memory, the total code takes about 100GByte disk space. According to GitHub staff says, VirtualBox has some performance issues and sometimes even data corruption, so since version 2.0, GitHub no longer support VirtuBox, and they recommend VMware's free vSphere hypervisor, AWS or OpenStack KVM as a replacement.

Considering our new environment in Shanghai, it has quite strict restrictions of choosing platforms from the aspect of security, it's impossible to install something that hasn't been comprehensively investigated and tested, VMware or KVM just out. Since AWS is not so widely spreaded in China and most importantly, it's not possible to host it in a private network, also out. The only choice for us is stilling using VirtualBox.

Now, we need to export the code data from VirtualBox, it has at least two ways, the most straightforward way is to copy the vmdk file, or using VirtualBox's export feature, which you can put the mirror into another OS environment. We choose the first one, copy the vmdk file directly, remember to shutdown the VirtualBox before doing this, or your data will be corrupted.

Originally, We plan to rsync the 100G data from Beijing data center to Shanghai directly via wlan, although both ends are private networks, we can open a DNAT connection from Beijing temporally, so the server hosted in Shanghai can send a active connection to Beijng and begin to transfer, this seems good except that the outbound bandwidth is only 20Mbps, that means, in theory, the tranfer time is 100G*8*1024/20 = 40960s, almost half a day. Actually, there's another issue we can't control, as the data is transfered through the a long distance on the public network, the stability is not guaranteed, what if the connection lost due to some unknown issue? the previous time is in vain.

So, we use a quite traditional way to get it done, take a USB HDD to the data center and copy the data to the HDD, the whole process takes about 2 hours. Later, we rsync the data from HDD to our data center in Shanghai through a leased line.

Now, the raw data is ready for Shanghai, next step it to setup a new VirtualBox instance, it's quite easy, here, we recommend you to install a VNC or something others like ssh X11 forwarding, so you can use a GUI to do the left operations. If you're quite familiar with VBoxManage, that's also ok. The only thing you need to do is to create a new instance, mount the vmdk file as the storage disk.

After booting up the GitHub Enterprise, we found that we can't enter into the http://example.com/setup even we uploaded the license, after opening a ticket, GitHub engineer confirmed that there exists a bug in this version which occasionnlly prevent from entering the management console. In order to continue, we first needed to ssh into the GitHub using admin user, after that, we should remove the previous license file via sudo rm /data/enterprise/enterprise.ghl, then run god restart enterprise-manage. But we were stuck in the fisrt step because we had no sudo permission.

Known the issue, we now require to root the instance. To gain root access we need to boot into recovery mode, just shutdown the VM and power it back on. inside the hypervisor console, hold left shift down while it boots, this will take us to the GRUB menu where we can choose to boot into recovery mode. As the boot process goes so fast, we think up a way to delay the booting by execute the below command outside VirtualBox:
$ VBoxManage modifyvm YOUR_VM_INSTANCE_NAME –bioslogodisplaytime 10000

After that, we can have enough time to press down the button. Note, it's *shift* key, not tab key for VirtualBox. We made a mistake by pressing down the tab unconsciously until we found that stupid accident.

When entering the recovery mode, do as this documentation says. Change the root passwd, and add ssh public key to root/.ssh/authorized_keys. After the server has rebooted normally, we can now use the PubkeyAuthentication method to log in as root.

If everything goes well, we can now upload the license, enter into the management console.

Why we want to enter into the management console? Besides the normal user add/delete management, we also need to change the current authentication process from the built in auth to LDAP-backend auth. The setup is straightforward, we only need to fill out the form with some section values like Host, Port, User ID field, etc. our IT support team provide us. When we click the "Save settings" button, waiting about 10 minutes for it took effect, something weird happened, we  can't login with our LDAP account, everytime we try to login, it returned a 500 error. After some troubleshooting, we found that the LDAP setting didn't work because the configuration run wasn't properly triggered when the setting were saved, so the default authentication was still active which we can see from the login welcome portal, it should be something like "Sign in via LDAP", but the fact we saw was "Sign in". Later, we ran enterprise-configure command as our technical support suggested, this time, it worked. Why? We can only suspect that, there were some customizations made to the VirtulBox that caused the configuration process to fail.

After Signing into the new LDAP-backend portal, say, before that it's username is barackobama, the new account name is barackobama.bo. with the new account, we saw a empty page, that make sense, since it's a new account, barackobama.bo which doesn't have any connection with the old account barackobama. If so, all of our code are lost, or we need to git clone with the old account then git push with the new account manually to make the code alive. This can't beat GitHub, just rename the user to the new username with the http://example.com/stafftools/[username]/admin URL. Note that the dot "." will be changed to a hyphen, so remember to rename the user to "barackobama-bo".

Our GitHub Enterprise comes back to life again within a pre-scheduled downtime.

Finally, thanks to @michaeltwofish, @b4mboo, @buckelij, @donal. Your guys professional skills and behavior really affect me a lot, impressive!

谈谈开源软件的选择

11/05/2014 update:
博客写完之后根据往常直接同步到了 weibo 上,结果反响比较大,一小部分是敢跳出来喊支持的(毕竟软件作者也在 weibo 上,站队要正确),非常难能可贵,一半是完全中庸的态度,另外一半是大喷特喷,或者说是满嘴跑火车更恰当些。
写这篇博客的目的有二:

  1. 回答题目的问题,如何选择可能靠谱的开源产品
  2. 毫不避讳,结合上面的几点,谈谈最近用的 tcpcopy 遇到的诸多问题,证明 tcpcopy 还有很大的提升空间

对于那些满嘴跑火车的同学,建议你,仅仅是建议而已:

  1. 把全文好好看完
  2. 多看看外面的世界,别把自己限死在狭小的空间里面

还想继续乱喷的,理解了@ayanamist 说的话咱们再继续。

很意外的是,最近两天收到了几封邮件,让满嘴跑火车的同学失望的是,他们不是来继续喷的,相反,他们都表达了非常正面的观点。邮件的结尾都差不多,让我加个 QQ 交个朋友,交朋友本是件好事,但是,我这篇博客已经说的很清楚,相比 QQ(基本不用),我更偏向使用 gmail、gtalk、twitter 交流。所以,善意的提醒下,RTFB(read the friendly blog:)还是很重要的。如果有什么想交流,about/ 页面有我的联系方式,非常欢迎。


先说点背景,对一款开源软件评估,也就是最终用还是不用,无外乎从这几个方面入手。
1. 技术层面,这也是最基础的一层,即这款工具、软件是否符合你的需求,比如是否需要跨平台,是否对性能有很高的要求,是否对安全有很高的要求等等。
2. 文档,现在不少的开源软件动不动就冒出高性能(high performance),简单易用(easy to use),比竞争对手更强大的字眼,试问连个详细的文档说明都没有,跑的个 demo 连背景环境配置都没有,你敢用吗?
3. 代码整洁,有注释,尤其是后者,github 上不少代码都是从头写到尾,一行注释都没有,这个不但给自己挖坟,也给用户带来了非常多的不便。
4. 更新频率,如果最近一次更新是 5 年前,是不是要考虑下有没有用的必要。当然这个并不绝对,不少最新的稳定版本都是 3 年前甚至 5 年前的,对应的开发版本倒是蛮活跃的。另外,如果是刚刚仅仅出来了几个月的新产品,还是慎重考虑一下。
5. 社区,这个是重中之重。
5.1. 用户数量,github 某个项目如果没几个人 star,是不是需要慎重考虑下。
5.2. 参与程度,半年才提一个 issue,mail list, irc 几天都没条消息的还是暂时不要考虑了。
5.3. 用户参与方式,使用 QQ、旺旺这种娱乐软件参与开源社区确实是件荒唐的事情,这类参与方式无法将信息公开的被搜索引擎收录,并且基本上将用户群体限制在了一个非常狭小的中文社区里面。mail list、google group 之类的比他们好用更重要的是开放的多。
5.4. commit 的人数,如果常年就一人更新,想想万一哪天他不在了,你们线上的服务怎么办?
5.5. 开源软件许可,这个一般自己用都没什么大问题。

Continue reading

阿里云靠不靠谱

偶然的机会看到博客园在用阿里云的服务,大致跟进了一下,惨不忍睹。

从工程的角度来说,博客园这里分享的每一篇博客都蛮有价值的,大家不妨学习下。博客园也是个悲剧,正如他自己说的「正确的糟糕选择」。整的来说,阿里云这几年肯定是在不断的进步的,这个是毋庸置疑的,但是绝对值有多大我不清楚。AWS 前几年也是故障频出,aliyun 出点故障也算正常,但是我是肯定不会用的。

维护 github enterprise 版本

作为高富帅的公司,我们毫不手软的购买了 github 的 enterprise 版本,截止到 3 月底,我们累计的投入已经接近 6 位数的 $$$ 了。我作为维护者,从管理的角度说说使用的感受。总的来说,四个字 – 「物超所值」。如果你们公司的工程师在 500 人以下,不妨试试,码农们的心情绝对会因为使用这么好的产品而屁颠的合不拢嘴,间接的提高的生产效率,最终受益的还是公司。
目前这种 "out of box" 的产品越来越多,github 是一个典型,包括我之前提到的 elasticsearch 同样是一个典型(github enterprise 重度依赖 es)。好处不必多说,维护起来工作量会小的多,你没有必要也不大可能了解到产品内部的运行的机制,这个从 github 提供的 ssh 登录账号就能看出:
To preserve the integrity of the appliance and ensure it remains in a consistent state, we have the following limitations in place:

    Root access is not provided.
    The admin user password is not provided.
    Installation and execution of third party software is not permitted.
    Modification of the underlying VM configuration is not permitted.

Bypassing any of these limitations will void all warranties and may place your installation in an unsupportable state.

拿到一个新的产品服务,我第一要做的就是先通读遍官方的文档,初次打开,觉得不可思议,一共就几十篇文档,两三个小时就能过完。回头再看的时候发现,对于这么一个 out of box 的产品,几十篇文档绰绰有余了,我简单的总结了下涉及到的,也是我关心的方面:
1. 从监控的角度出发,官方提供了 API 方便调用,
2. 用户、log audit/forwarding 同样在 web portal 上简单的点点就完成
3. 用户的认证方式也是支持多种,包括默认的 build-in 方式以及 LDAP 等
4. 数据的备份异常的简单,根据文档,几条 cli 就能搞定
5. 使用 virtualbox/VMware 来创建 instance,倒入证书、升级版本、迁移也是异常的方便
6. 如果磁盘用量规划不足,临时的增加 block device 也是异常的快捷

文档的价值有多高了?比如在指导你做 upgrade 的时候,会很明确的提示你:

  1. Shut down your Enterprise virtual machine.
  2. Take a snapshot of your virtual machine.
  3. Boot your virtual machine.
  4. Enable Maintenance Mode.

还有其他需要关注的吗?没了,就是这么简单。如果还有文档上没有涉及的问题怎么办,直接开 ticket,他们工程师反应时间、回答问题的质量以及态度跟 RedHat 是一个级别的。
要是 github IPO 了,我会长期持有他们家的股票的。放张很早之前我司某早期工程师破解的截图,现在我们已经「改邪归正」,早用官方授权的 seats 了。

 

APCN-2 中断 & 海底光缆 101

作为互联网或者说全球性的物理网络中极其基础重要的设施,一直比较低调,很少会有人去主动关注他,他就像空气水一样的理所当然的被人习惯性的遗忘,导致出了问题才想到有这么一个事物存在,空气差了人们会想到开净化器带口罩,但是光缆中断了一般人还真没辙,比如前些日子 APCN-2 的中断,影响的不仅仅是 CN,包括 PH 在内的很多国家都受到了影响。
先说说 APCN-2 的问题。
亚太2号海底电缆(Asia Pacific Cable Network-2,APCN-2)是由 26 个投资机构共同发起筹建,连接亚洲国家和地区,全长约 19000km 共有 10 个登陆站,中国登陆站包括上海崇明和广东汕头。骨干路径由四对光缆组成,通过 Dense Wavelength Division Multiplexing(DWDM) 技术使得最大带宽达到了 2.56 Tbit/s,目前带宽应该是 06 年第二次升级的 280Gbps,每组光缆的传输速度达 640Gbps(10x64x4),采用具有自愈功能的环型网络结构。中国电信、中国联通均参与了此条海缆的建设。最权威的信息可以看这里

3 月 21 号 APCN-2 的 S4A(Chongming, China 到 Pusan, Korea) 发生了中断,3 月 24 号,S6(Tanshui, Taiwan 到 Chikura, Japan) 又发生了中断。 大致的中断位置可以看这里。由此因此了大面积的丢包也不足为奇。4 月 7 号终于恢复了


关于海底光缆是如何铺设的,Quora 上有几个科普性的问答:
* How are major undersea cables laid in the ocean?
* How would you go about laying undersea fibre cable?
* Do private telecommunications companies own the undersea cables that connect the internet across continents?
* What is the cost of laying high voltage underwater, sea or fresh, copper cable, dollar per meter or $/km?

上面的还看不明白?百字不如一视频,直接看 discovery 上的这个视频吧,这个纪录片描述了 Tyco Telecommunications 旗下的光缆铺设船 Tyco Resolute 铺设从 Costa Rica 拉一条光缆衔接到 Pan-American Crossing (PAC) 主光缆的过程。看到那么霸气的 monster,我还是蛮期望有机会上去呆两天的。

几家跟之相关的铺设海底光缆的公司,有兴趣的可以进去浏览浏览,看上去都是财大气粗型的:
* http://www.k-kcs.co.jp/english
* http://www.subcom.com/home.aspx
* http://www.te.com/en/home.html

Continue reading

OpenSSL(CVE-2014-0160)漏洞修复速度

OpenSSL 不是第一次出现这类影响非常严重的漏洞了,12 年 4 月的 CVE-2012-2110 这个漏洞当时就让我忙活了好一阵子
今天早上扫 feedly 的时候发现了这个漏洞,迅速确认了我们线上的服务器不受影响之后就开始围观了。  
OpenSSL 官方最先挂了通知
几个主要的发行版本都在当天(4 月 8 日)发出了 Heartbleed bug 影响以及修复方式:USN(1, 2) , DSA, Bugzilla

要说影响,基本可以确定包含 taobao 在内的绝大多数的都受到影响了,其他的小电商小网站就不计其数了,wooyun 上有截图为证:

话说这漏洞出来也好长时间了(相对于互联网的速度),到我写这篇博客的时候(08-21-04-08-2014),最少也有五六个小时了。很不幸的是,我跑了一遍,发现还有下面这些网站没有及时的升级打补丁:
1. kyfw.12306.cn,这个证书本来就有问题
2. 126.com/163.com 旗下的各种邮箱
3. login.yahoo.com,  www.flickr.com 这两个一起的,www.tumblr.com 虽然被收购了,估计还没整合好,这次修复的还蛮迅速的
4. www.quora.com 没想到反应这么慢
5. 京东的蛮有意思的,海外、国内登录两套标准,passport.en.jd.com 这个没有 ssl 证书,passport.jd.com 则有证书
6. fitbit.com,毕竟不是专业搞互联网的,速度慢可以理解

上面的在没有确认升级完毕之前,务必不要再登录,要检验是否安全很简单,可以去 github 上跑个脚本或者直接用这个。其他的比如 amazon.com, coursera.org, dropbox.com, evernote.com, facebook, twitter.com, foursquare.com, github.com, instagram.com, vimeo.com, wikipedia.org 我扫的时候发现已经没有问题了(或许原来就没遇到这个漏洞)。

其实这个漏洞保守估计在一周之前就已经发现,但是并没有公开,所以 cloudflare 有足够的时间去升级,今天发布的这个博客绝对是个绝妙的加分项,完爆 ali。

如何修复?
对于服务商,抓紧时间升级。
对于用户,最好修改一遍密码,因为可以确认包括 taobao 在内已经有部分用户的信息泄漏了,如果你是所有网站公用一套登录名密码,连夜修改吧。


update: 07-15-04-09-2014,上面提到的 6 个网站已经全部完成修复,速度还算可以接受。

update: 04-09-2014,Cisco 官方发出了受影响设备的列表,IOS 以及 Nexus 的不在影响之列。