通过 noVNC 实现数千台自助机的实时可视化

背景很简单,目前我们运营维护着北京市属三甲医院数千台的自助机,这里面包含挂号取号机、检查报告机以及其他若干衍生出来的自助机种类,数量已经超过一千多台,不同的院区通过密密麻麻的专线跟我们的机房源源不断的进行着各种协议的数据传输与交互。

由于各方面技术以及非技术层面的限制,只能使用微软的系统,为了管理这部分的机器,一方面我们通过 saltstack 来进行日常的产品升级以及变更,对于桌面的可视,比如我们需要知道当前这台自助机前患者的操作,就需要 VNC 来帮助了。

一年前自助机刚上线时候,没有很详细的考虑过自助机的大规模运维问题,当时每台自助机都在 Clonezilla 打包阶段打包进了 realVNC,通过一台 Win 跳板机的 realVNC 客户端去连接,realVNC 有书签的功能,所以当时需要查看每台自助机界面的运行情况相对比较方便,随着数量的增加,这种先登录一台 win 的跳板机,在使用 realVNC 书签查看的传统 CS 架构就捉襟见肘了。迁移到 web 上势在必行,本着可小动不大动的原则,调研了 realVNC 的各种 SDK,遗憾的是都不支持,加上 realVNC
有版权问题,放弃。相比之下,TightVNC 则不存在这类问题。

后来我们组的同学调研了几个基于 web 的 VNC 客户端,peer-vncnoVNCguacamole ,peer-vnc 活跃量太低,底层用的是 noVNC,guacamole 还处在孵化期(功能相当强大)并且比较重,放弃。最终选择在 noVNC 的基础上做一些工作。

由于 TightVNC 默认不支持 WebSockets, noVNC 提供了 websockify 这个工具来做 TCP socket 的代理,接受 WebSockets 的握手,转化解析成 TCP socket 流量,然后在 CS 两端传递。由于我们涉及到较多的机器信息需要维护管理,我们将所有的机器信息,比如自助机的编码,机器名,IP/MAC 相关的信息预先存储到 MySQL 里面,接着按照 token: host:port 既定格式生成每家院区的连接信息配置文件,这一步可以在 jenkins/rundeck 上建个 job 方便每次增删改查。接着就可以启动了转发了。

在浏览器中打开本地的 6080 端口即可。下图是集成到我们内部管理平台上的截图,可以快速定位到自助机的健康状况并通过 VNC 进入确诊当前的状态:

noVNC 默认情况下会以交互式的方式连接,在这个过程中会做身份权限校验(账号连接、读写控制),是否是 true color 等,这个对于生产不是很适用,我们后来将授权这块做在 Django 上,结合 LDAP 做登录认证。考虑到专线带宽的限制,默认关闭了 true color 开启了压缩。VNC 对带宽的消耗还是比较厉害的,平均下来,每开一个新链接,会消耗 1Mbps 左右的带宽,所以如果需要做实时的展示大屏,需要考虑这块的瓶颈。

下图是 noVNC 实时图像,由此我们可以看到当前这台自助机的实时状态并对问题的自助机采取一定的操作:

最后将其挂在 haproxy/Nginx 后面,Nginx 1.3 之后支持 websocket,一个快速搭建并可投入使用的自助机可视化平台就算完成了。该平台我们内部命名为 bangkok。

以上思想可以移植到 *nix 平台上,目前不少主流的云 OpenStack/OpenNebula/Digital Ocean 也集成进了 noVNC。

Big transaction and MySQL replication lag

This saturday afternoon I was on call, and got a alert, one of our production MySQL slave begin to have lag like this:

After some time reviewing the monitoring metrics, I got the big breakthrough, there existed large spike of lock structs and delete operations during the lag period:

Usually, this was caused by big transaction on master node. The verfication was quite straightfoward, just analyze the binlog, the script was copied from percona.

After running the script, I do really found something astonishing, during that time the master had one transaction per minute, and each was comparatively big, nearly 10k delete operations for one transaction:
Timestamp : #160904 13:26:54 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7559 row(s) affected
Timestamp : #160904 13:27:56 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7726 row(s) affected
Timestamp : #160904 13:28:57 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7864 row(s) affected
Timestamp : #160904 13:29:59 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8039 row(s) affected
Timestamp : #160904 13:31:01 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7961 row(s) affected
….
Timestamp : #160904 13:33:05 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7985 row(s) affected
Timestamp : #160904 13:34:06 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7849 row(s) affected
Timestamp : #160904 13:35:08 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7927 row(s) affected
Timestamp : #160904 13:36:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8014 row(s) affected
Timestamp : #160904 13:37:11 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8031 row(s) affected
Timestamp : #160904 13:38:13 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8104 row(s) affected
Timestamp : #160904 13:39:15 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7730 row(s) affected
Timestamp : #160904 13:40:16 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7668 row(s) affected
Timestamp : #160904 13:41:18 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7836 row(s) affected
Timestamp : #160904 13:42:20 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7757 row(s) affected

Timestamp : #160904 13:48:29 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7835 row(s) affected
Timestamp : #160904 13:49:31 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7839 row(s) affected
Timestamp : #160904 13:50:32 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8028 row(s) affected
Timestamp : #160904 13:51:34 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8223 row(s) affected
Timestamp : #160904 13:52:35 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8119 row(s) affected
Timestamp : #160904 13:53:37 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7924 row(s) affected
Timestamp : #160904 13:54:38 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 2139 row(s) affected
Timestamp : #160904 14:17:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 5592 row(s) affected
Timestamp : #160904 14:18:12 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8260 row(s) affected
Timestamp : #160904 14:19:13 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7876 row(s) affected
Timestamp : #160904 14:20:14 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7994 row(s) affected
Timestamp : #160904 14:21:16 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7994 row(s) affected
Timestamp : #160904 14:22:17 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 7813 row(s) affected

Timestamp : #160904 18:15:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8191 row(s) affected
Timestamp : #160904 18:16:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8180 row(s) affected
Timestamp : #160904 18:17:27 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8304 row(s) affected
Timestamp : #160904 18:18:28 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 8128 row(s) affected

If we compare the transaction of other time period, we can get the result, below is the top 10 transaction of Sep 2 which is far less than 10000:
Timestamp : #160902 18:00:31 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 850 row(s) affected
Timestamp : #160902 10:00:33 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 813 row(s) affected
Timestamp : #160902 8:00:10 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 786 row(s) affected
Timestamp : #160902 10:55:42 Table : `qmq_backup`.`qmq_log_records` Query Type : DELETE 646 row(s) affected
Timestamp : #160902 12:30:01 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 608 row(s) affected
Timestamp : #160902 12:30:01 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 606 row(s) affected
Timestamp : #160902 19:07:37 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 599 row(s) affected
Timestamp : #160902 19:10:09 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 582 row(s) affected
Timestamp : #160902 19:07:23 Table : `qmq_backup`.`qmq_log_records` Query Type : INSERT 575 row(s) affected

That explains why io thread works well but still have big lag, since by default slave has only single thread to process all including big transaction from master, it can't handle that and break.

What's next? find which service cause that big transaction, back-flush from the DB log, there do really have a job that purge the old data every minute like this:
[2016-09-04 14:21:53  INFO com.benmu.rapidq.backup.backup.task.HistoryCleaner:?] clean history qmq_backup_messages data by condition create_time < '2016-08-20 14:21:53', total clean: 892, elapse: 60 ms

After the discussion with developer, we decided to stop the job. Besides that, we consider to increase the thread(slave_parallel_worker) of slave since we currently use MySQL 5.7 which is its big selling point.

Nginx 200 response with empty body by double slash

We setup a new cluster of Nginx to send requests to HIS using curl with domain name as the Host header, unfortunately, we get 200 return code but empty body like this:

When we use IP as Host, it works well, and return the expected result:

After some debug, nothing found and tcpdump comes:

From the tcpdump, the above red arrow uses domain name as Host header and the below one uses IP as Host header, we see there exists a double slash when using domain name.

After some review of Nginx config file, we found the misconfiguration location:

OpenVPN Connectivity Issue in Public Network

We established a ovpn tunnel between 2 IDCs in September 2014, and we have monitored the availability and performance of two ends for a long time. The geographical distance between 2 IDCs are quite short, but with different telecom carries. mtr shows that there exits ahout 7 hops from one end to the other. The below screenshot shows the standard ping loss.


The result is quite interesting. At first we used UDP protocol, and we often experienced network disconnection issue, later we switched to TCP, and it improved a lot. From the digram, the average package is 1.11%.
Why 1.11%, what I can explain is the tunnel is often fully saturated during the peak hour, and this can't solved at the moment, so no matter what protocol, the package loss should exists. Another possible reason is the complexity of public network which I can't quantitate.
The current plan works during current background, but no "how many 9s" guarantee. Anyway, if we want to achieve more stable connectivity, a DLL(dedicated leased line) is a better choice.

Can Venta Airwasher Effectively Reduce PM2.5 or PM10?

There is not official tests state that whether Venta airwasher is able to reduce PM2.5 or PM10 effectly, hence, I made some tests with the help of Dylos air particle counter during the good weather, PM2.5 index less than 100 means good or moderate, PM2.5 larger than 100 which means unhealthy or even hazardous. My bedroom's door or window is closed all the time and it's a confined space.

The answer is partially efficient. In centain conditions, it works, other times, no.

When the PM2.5 outside is less than 100, Venta can effectly keep the indoor PM2.5 around 30 or even less.

However, when outside is unhealthy, which means the index is above 100 and beyond, it really can't effectly reduce the particles inside, and the metrics I got from Dylos indoor have positive correlation with outside index, sometimes, Dylos even get 100 or even more which is totally unacceptable for people in the bedroom.

Now, the answer is quite clear, if the weather outside is good, just keep the Venta open and it can handle. When outside is unhealty, you really need to turn on your air filter, and don't rely on it, you need HEPA filter.

What about PM10? Venta seems work all the time, no matter good or bad outside, it can keep the index under 100 or lower, most of time, my bedroom PM10 is around 50, sometimes, after long time filter without opening the door, you can see single digit.

Metrics Dashboard Comparision for Linux Desktop

I need to know how good/poorly my desktop(Loongson & Rasberry Pi 2) is running, so there should at least exits a solution that can see its realtime and history metrics of the the system like cpu, memory, network, disk io etc.

Also tools like Glances, nmon, sar(sysstat) or atop provide a nice realtime metircs, it can't get its history data easily which it more suitable for massive production use, also no web dashborard for a nicely view, so I just pass them. Systems like Icinga, Nagios, Graphite, Zabbix, etc. are so heavy and complicated which are also not convenient for a 2GB memory desktop use.

Below are some tools that may help you for setting up a tiny to small monitoring/metrics system. At the end, I'll pick up the most suitable one for my own.

psdash
A dashboard overview of the system using psutils and Flask.

The installation is quite straightforward with pip packager manager. Unfortunately, As my Debian jessie testing version, I happened to a "PROTOCOL_SSLv3 is not defined" bug that stop the system running, you can modify the python gevent lib to work around(1, 2).   

The web dashboard is qute lighweight and nice, you can get the realtime data from the every-3-second refresh web automatically, the fatal problem is it can't get the history data and charts.

Linux-dash 
Quite similar to psdash, but comes with more technical stack like Node.js, Go, and PHP. It's offical slogon is "A simple, low-overhead web dashboard for GNU / Linux. (~1MB)", besides that, not much eye-catching point.

ezservermonitor-web 
Besides all the feature psdash and Linux-dash, it also has a simple trigger configuration, Say, for the load average metric, it will be displayed with gauges, when is less than 50%, it's green, when is more than 76% util, it turns to red. Also, it has the built-in ping and servers check function, which is handy for your family internal use.

Ezservermonitor also has a console based tools called EZ SERVER MONITOR`SH, without web interface.

Web VMStat 
It’s a small application written in Java and HTML which displays live Linux system statistics. It just takes over vmstat command in a pretty web page with SmoothieCharts and websocketd in realtime.

munin 
The last is usually the best. Yes, it's the only one I want to recommend for you, no matter how many desktops you have, Munin can handle them easily and more importantly, it only taks you few minutes depending on your network quality if you use apt/yum to install it directly, the real out-of-box product. it used RRD as it's backend storage engine. by default, 300s interval which I think it's enough for most of desktop users, with at one year history. 

If you are unsatisfied with its hundreds of plugins, just write a new one or porting from any Unix platform by yourself by any scripts language. This is its killer feature.

After the above comparision, you could choose ones according to your demands.

Want near-realtime(5m or so) and history data with charts? Munin is a best option.

Want realtime data, without history data? psdash, Linux-dash, Web-VmStat are those you're looking for.

Still not satisfied? You may consider using Graphite, Zabbix if you have system admin experience, since both are enterprise level open source product.