Migrating GitHub Enterprise From Beijing To Shanghai

We need to migrate our GitHub Enterprise from one data center located in Beijing to one in Shanghai. The whole process is not complex but time consuming, it takes us more than one week to finish the migration. I will share some pieces of practice for you.

The current version of GitHub Enterprise running version is 11.10.331, a little bit out of date. it's running on a VirtualBox with 4 CPU cores, 16G memory, the total code takes about 100GByte disk space. According to GitHub staff says, VirtualBox has some performance issues and sometimes even data corruption, so since version 2.0, GitHub no longer support VirtuBox, and they recommend VMware's free vSphere hypervisor, AWS or OpenStack KVM as a replacement.

Considering our new environment in Shanghai, it has quite strict restrictions of choosing platforms from the aspect of security, it's impossible to install something that hasn't been comprehensively investigated and tested, VMware or KVM just out. Since AWS is not so widely spreaded in China and most importantly, it's not possible to host it in a private network, also out. The only choice for us is stilling using VirtualBox.

Now, we need to export the code data from VirtualBox, it has at least two ways, the most straightforward way is to copy the vmdk file, or using VirtualBox's export feature, which you can put the mirror into another OS environment. We choose the first one, copy the vmdk file directly, remember to shutdown the VirtualBox before doing this, or your data will be corrupted.

Originally, We plan to rsync the 100G data from Beijing data center to Shanghai directly via wlan, although both ends are private networks, we can open a DNAT connection from Beijing temporally, so the server hosted in Shanghai can send a active connection to Beijng and begin to transfer, this seems good except that the outbound bandwidth is only 20Mbps, that means, in theory, the tranfer time is 100G*8*1024/20 = 40960s, almost half a day. Actually, there's another issue we can't control, as the data is transfered through the a long distance on the public network, the stability is not guaranteed, what if the connection lost due to some unknown issue? the previous time is in vain.

So, we use a quite traditional way to get it done, take a USB HDD to the data center and copy the data to the HDD, the whole process takes about 2 hours. Later, we rsync the data from HDD to our data center in Shanghai through a leased line.

Now, the raw data is ready for Shanghai, next step it to setup a new VirtualBox instance, it's quite easy, here, we recommend you to install a VNC or something others like ssh X11 forwarding, so you can use a GUI to do the left operations. If you're quite familiar with VBoxManage, that's also ok. The only thing you need to do is to create a new instance, mount the vmdk file as the storage disk.

After booting up the GitHub Enterprise, we found that we can't enter into the http://example.com/setup even we uploaded the license, after opening a ticket, GitHub engineer confirmed that there exists a bug in this version which occasionnlly prevent from entering the management console. In order to continue, we first needed to ssh into the GitHub using admin user, after that, we should remove the previous license file via sudo rm /data/enterprise/enterprise.ghl, then run god restart enterprise-manage. But we were stuck in the fisrt step because we had no sudo permission.

Known the issue, we now require to root the instance. To gain root access we need to boot into recovery mode, just shutdown the VM and power it back on. inside the hypervisor console, hold left shift down while it boots, this will take us to the GRUB menu where we can choose to boot into recovery mode. As the boot process goes so fast, we think up a way to delay the booting by execute the below command outside VirtualBox:
$ VBoxManage modifyvm YOUR_VM_INSTANCE_NAME –bioslogodisplaytime 10000

After that, we can have enough time to press down the button. Note, it's *shift* key, not tab key for VirtualBox. We made a mistake by pressing down the tab unconsciously until we found that stupid accident.

When entering the recovery mode, do as this documentation says. Change the root passwd, and add ssh public key to root/.ssh/authorized_keys. After the server has rebooted normally, we can now use the PubkeyAuthentication method to log in as root.

If everything goes well, we can now upload the license, enter into the management console.

Why we want to enter into the management console? Besides the normal user add/delete management, we also need to change the current authentication process from the built in auth to LDAP-backend auth. The setup is straightforward, we only need to fill out the form with some section values like Host, Port, User ID field, etc. our IT support team provide us. When we click the "Save settings" button, waiting about 10 minutes for it took effect, something weird happened, we  can't login with our LDAP account, everytime we try to login, it returned a 500 error. After some troubleshooting, we found that the LDAP setting didn't work because the configuration run wasn't properly triggered when the setting were saved, so the default authentication was still active which we can see from the login welcome portal, it should be something like "Sign in via LDAP", but the fact we saw was "Sign in". Later, we ran enterprise-configure command as our technical support suggested, this time, it worked. Why? We can only suspect that, there were some customizations made to the VirtulBox that caused the configuration process to fail.

After Signing into the new LDAP-backend portal, say, before that it's username is barackobama, the new account name is barackobama.bo. with the new account, we saw a empty page, that make sense, since it's a new account, barackobama.bo which doesn't have any connection with the old account barackobama. If so, all of our code are lost, or we need to git clone with the old account then git push with the new account manually to make the code alive. This can't beat GitHub, just rename the user to the new username with the http://example.com/stafftools/[username]/admin URL. Note that the dot "." will be changed to a hyphen, so remember to rename the user to "barackobama-bo".

Our GitHub Enterprise comes back to life again within a pre-scheduled downtime.

Finally, thanks to @michaeltwofish, @b4mboo, @buckelij, @donal. Your guys professional skills and behavior really affect me a lot, impressive!

维护 github enterprise 版本

作为高富帅的公司,我们毫不手软的购买了 github 的 enterprise 版本,截止到 3 月底,我们累计的投入已经接近 6 位数的 $$$ 了。我作为维护者,从管理的角度说说使用的感受。总的来说,四个字 – 「物超所值」。如果你们公司的工程师在 500 人以下,不妨试试,码农们的心情绝对会因为使用这么好的产品而屁颠的合不拢嘴,间接的提高的生产效率,最终受益的还是公司。
目前这种 "out of box" 的产品越来越多,github 是一个典型,包括我之前提到的 elasticsearch 同样是一个典型(github enterprise 重度依赖 es)。好处不必多说,维护起来工作量会小的多,你没有必要也不大可能了解到产品内部的运行的机制,这个从 github 提供的 ssh 登录账号就能看出:
To preserve the integrity of the appliance and ensure it remains in a consistent state, we have the following limitations in place:

    Root access is not provided.
    The admin user password is not provided.
    Installation and execution of third party software is not permitted.
    Modification of the underlying VM configuration is not permitted.

Bypassing any of these limitations will void all warranties and may place your installation in an unsupportable state.

拿到一个新的产品服务,我第一要做的就是先通读遍官方的文档,初次打开,觉得不可思议,一共就几十篇文档,两三个小时就能过完。回头再看的时候发现,对于这么一个 out of box 的产品,几十篇文档绰绰有余了,我简单的总结了下涉及到的,也是我关心的方面:
1. 从监控的角度出发,官方提供了 API 方便调用,
2. 用户、log audit/forwarding 同样在 web portal 上简单的点点就完成
3. 用户的认证方式也是支持多种,包括默认的 build-in 方式以及 LDAP 等
4. 数据的备份异常的简单,根据文档,几条 cli 就能搞定
5. 使用 virtualbox/VMware 来创建 instance,倒入证书、升级版本、迁移也是异常的方便
6. 如果磁盘用量规划不足,临时的增加 block device 也是异常的快捷

文档的价值有多高了?比如在指导你做 upgrade 的时候,会很明确的提示你:

  1. Shut down your Enterprise virtual machine.
  2. Take a snapshot of your virtual machine.
  3. Boot your virtual machine.
  4. Enable Maintenance Mode.

还有其他需要关注的吗?没了,就是这么简单。如果还有文档上没有涉及的问题怎么办,直接开 ticket,他们工程师反应时间、回答问题的质量以及态度跟 RedHat 是一个级别的。
要是 github IPO 了,我会长期持有他们家的股票的。放张很早之前我司某早期工程师破解的截图,现在我们已经「改邪归正」,早用官方授权的 seats 了。


github 是家有良心的互联网公司

注:这篇博客写在 12 年的 12 月,但是到 13 年的 2 月才发表 -.-

国外(大陆几乎没看到过)有不少公司会有 status.xx.oo 的页面,随时让客户了解当前网站的情况。不过很少会像 github 那样把每次出现的 outage 总结对外公布的,从这个角度来说,github 是个比较有『良心』的公司。虽然他们的 https://status.github.com/ 页面经常冒出黄色的甚至红色的报警,虽然他们上个月的 App Server Availability 只有 99.4643%。
上个月他们家出现了有史以来最严重的 outage,其前奏可以看这里,后续的总结可以看这里。两片总结的非常到位,造成问题的原因简单的总结就是:
1. 网络设备本身有 bug
2. HA 在关键时刻并不能 HA

除了 github 之外,CloudFlare 做的也很不错,比如下面的这几篇博客:

尽管 CloudFlare 包括诸如 EvernoteDropbox 没有 outage 这个 tag,但是其发布的任何一篇博客都足以完爆大陆的那些点点点公司了。