简介

某备份系统大量使用rsync来传输文件，但是偶尔会出现rsync客户端在上传数据的时候长时间卡死，本文记录了解决问题的步骤。

本文只涉及rsync客户端中io相关逻辑，关于rsync的发送算法并不涉及，服务端逻辑略有提到。

故障现象

rsync客户端一直驻留内存，strace跟踪rsync客户端进程发现：

# strace -p 22819
process 22819 attached - interrupt to quit
select(4, [3], [], null, {57, 106010})  = 0 (timeout)
select(4, [3], [], null, {60, 0})       = 0 (timeout)
select(4, [3], [], null, {60, 0})       = 0 (timeout)
select(4, [3], [], null, {60, 0})       = 0 (timeout)
select(4, [3], [], null, {60, 0})       = 0 (timeout)
select(4, [3], [], null, {60, 0})       = 0 (timeout)
select(4, [3], [], null, {60, 0})       = 0 (timeout)

故障原因查询

因为rsync在编译时没有把代码信息编译进去（也就是没有加上-g选项），所以gdb也无法跟踪具体的调用堆栈。
但是从上面的跟踪可以看出，进程一直在等待fd=3的读取事件，每次都是超时（默认60秒）。
ok，先查查这个fd=3是什么：

# ll /proc/22819/fd
total 0
lr-x------ 1 root root 64 jan  4 19:24 0 -> /dev/null
l-wx------ 1 root root 64 jan  4 19:24 1 -> pipe:[1247832664]
l-wx------ 1 root root 64 jan  4 19:24 2 -> pipe:[1247832664]
lrwx------ 1 root root 64 jan  4 19:24 3 -> socket:[1247890095]

可见fd=3是一个socket，查看这个socket的源地址和目标地址：

# grep 1247890095 /proc/net/tcp 
  13: 3ea8010a:4cc5 3d21010a:22a9 01 00000000:00000000 00:00000000 00000000     0        0 1247890095 1 ffff8808772ae940 23 3 24 3 7

源地址【3ea8010a:4cc5】，转换成十进制就是【10.1.168.62:19653】，也就是本机地址；目标地址【3d21010a:22a9】，转换成十进制就是【10.1.33.61:8873】。至于如何转换不再赘述，参考unp中关于数据的主机顺序和网络顺序的论述。
连接目标机器10.1.33.61，发现这台机器上根本没有与【10.1.168.62:19653】的连接，也就是服务端连接已经关闭。
因此，上述故障原因已经查明：服务端关闭连接，但是客户端仍在重试等待来自服务端的读取信息。默认的60秒只是select的超时时间，但是如果没有指定连接的超时时间，那么客户端会一直死等（没有keep alive等等，根据rsync 3.0.6代码）。

talk is cheap，show me your code

基于极客精神，我有兴趣查看源代码，看看rsync为什么会犯这种错误。代码为rsync 3.0.6，使用工具查看，可以得到调用关系（图有点大，点开看）。
客户端调用路线（大概）：main->start_client->send_files->read_ndx_and_attrs->readfd_unbuffered->read_loop->read_timeout，最后跟踪到read_timeout函数，代码如下所示（中文注释是我添加的）：

/**
 * read from a socket with i/o timeout. return the number of bytes
 * read. if no bytes can be read then exit, never return a number <= 0.
 *
//对于读取失败的问题，这里列为todo
 * todo: if the remote shell connection fails, then current versions
 * actually report an "unexpected eof" error here.  since it's a
 * fairly common mistake to try to use rsh when ssh is required, we
 * should trap that: if we fail to read any data at all, we should
 * give a better explanation.  we can tell whether the connection has
 * started by looking e.g. at whether the remote version is known yet.
 */
static int read_timeout(int fd, char *buf, size_t len)
{
    int n, cnt = 0;

    io_flush(full_flush);

    while (cnt == 0) {
        /* until we manage to read *something* */
        fd_set r_fds, w_fds;
        struct timeval tv;
        int maxfd = fd;
        int count;

        fd_zero(&r_fds);
        fd_zero(&w_fds);
        fd_set(fd, &r_fds);  //要从fd读取，把它加到读取集合中
        if (io_filesfrom_f_out >= 0) {
            int new_fd;
            if (ff_buf.len == 0) {
                if (io_filesfrom_f_in >= 0) {
                    fd_set(io_filesfrom_f_in, &r_fds);
                    new_fd = io_filesfrom_f_in;
                } else {
                    io_filesfrom_f_out = -1;
                    new_fd = -1;
                }
            } else {
                fd_set(io_filesfrom_f_out, &w_fds);
                new_fd = io_filesfrom_f_out;
            }
            if (new_fd > maxfd)
                maxfd = new_fd;
        }
        tv.tv_sec = select_timeout;  //设置超时，如果没有指定，默认60秒
        tv.tv_usec = 0;

        errno = 0;

        count = select(maxfd + 1, &r_fds, &w_fds, null, &tv);  //select调用

        //如果超时或者服务端关闭socket，都会返回count<0
        if (count <= 0) {
            if (errno == ebadf) {
                defer_forwarding_messages = 0;
                exit_cleanup(rerr_socketio);
            }
            check_timeout(); //处理超时，注意这两句，如果超时，需要在check_timeout()中退出，否则会一直循环
            continue;
        }
//下面的代码忽略

check_timeout函数（关键）：

static void check_timeout(void)
{
    time_t t;

    if (!io_timeout || ignore_timeout) //注意，客户端没有--timeout选项，io_timeout会默认为0，也就是直接返回
        return;

    if (!last_io_in) {
        last_io_in = time(null);
        return;
    }

    t = time(null);

    if (t - last_io_in >= io_timeout) {
        if (!am_server && !am_daemon) {
            rprintf(ferror, "io timeout after %d seconds -- exiting\n",
                (int)(t-last_io_in));
        }
        exit_cleanup(rerr_timeout);
    }
}

总结一下：
- 客户端select默认60秒超时，超时之后检查连接是否超时，如果超时则调用exit_cleanup退出；
- 如果客户端没有指定--timeout，那么io_timeout=0，程序会一直在select()中超时；

最新版本是否解决了这个问题？rsync-3.1.3

rsync-3.1.3版本的代码调用堆栈如下：
从这里看出，检查超时的时候，新版的rsync会尝试发送一个keep_alive到服务端，如果读写成功，表示服务端还存活，则perform_io()函数会更新时间戳，那么在check_timeout()的后续判断中就不会被判断为超时。

总结以及解决办法

在旧版本rsync中，当客户端正在读取服务端的信息，而此时服务端因为某种原因而断开连接（如服务器挂了，进程被kill了），客户端会出现循环等待，造成卡死的现象，这是本次要查的问题。
原因出现在，旧版rsync没有处理好上述问题，只是单纯以超时判断，而不去试探服务端是否存活，或者链接是否有效。
新版的解决办法：select超时之后尝试使用keep alive报文去写socket，尝试socket是否读写成功，如果成功，则更新socket的时间戳，相当于为这个socket加血续命。
可以考虑的解决办法：
- 更新rsync到新版，尤其是客户端；
- 客户端调用时加上--timeout选项指定超时；