在使用 Ansible 做批量配置管理的过程中,如果是私有云通常会遇到内网主机被隔离的情况,只能通过堡垒机与内网主机建立 SSH 连接,这对于使用 Ansible 来说是一个棘手的问题。当然这不仅是一个技术问题,同时也是管理问题。这里我们只讨论相关技术,针对这种情况我们可以利用 OpenSSH 提供的 ProxyCommand实现 SSH 代理功能。之前在网上找了一下,似乎没有比较全面的介绍这种方法的文章,下面讲一下使用 Ansible 的具体实现过程,以及遇到的一些深坑

什么是 OpenSSH ProxyCommand


ProxyCommand是 OpenSSH 的原生特性,通过 man ssh_config查看文档说明,原文如下:

ProxyCommand

    Specifies the command to use to connect to the server.  The command string extends to the end of the line, and is executed using the user's shell ‘exec’ directive to avoid a lingering shell process.

    Arguments to ProxyCommand accept the tokens described in the TOKENS section.  The command can be basically anything, and should read from its standard input and write to its standard output.  It should eventually connect an sshd(8) server running on some machine, or execute sshd -i somewhere.  Host key management will be done using the HostName of the host being connected (defaulting to the name typed by the user).  Setting the command to none disables this option entirely.  Note that CheckHostIP is not available for connects with a proxy command.

    This directive is useful in conjunction with nc(1) and its proxy support.  For example, the following directive would connect via an HTTP proxy at 192.0.2.0:
    
        ProxyCommand /usr/bin/nc -X connect -x 192.0.2.0:8080 %h %p
ProxyCommand 用来指定连接到服务器的命令参数,该命令参数添加至命令末尾,并且使用用户的 exec 指令来执行命令,使用 exec 执行可以替换原 shell 进程上下文的内容。

这个命令会接收 TOKENS 章节中描述的参数,并将其展开替换为对应的值。ProxyCommand 后面可以执行任何命令,只要满足:从标准输入流中读取数据,写入到标准输出流中。

需要注意的是:CheckHostIP 功能不能用于 在使用 ProxyCommand 建立的 ssh 连接中。

文档中提供的例子是使用 nc 来实现 ssh 代理,但是 nc 是外部命令,如果主机没有安装 nc,我们还可以使用 ssh 命令的 -W 选项完成这个工作。对于使用 Ansible 做批量配置管理,建议使用 ssh 原生支持的命令减少依赖。

查看 ssh 命令文档 man ssh

-W host:port
   Requests that standard input and output on the client be forwarded to host on port over the secure channel.  Implies -N, -T, ExitOnForwardFailure and ClearAllForwardings, though these can be overridden in the configuration file or using -o command line options.
-W host:port 将client过来的标准输入和输出请求通过安全连接在指定的主机端口做向前转发。这个选项直接就可以搭配上ProxyCommand的需求。

OpenSSH ProxyComand 与 Ansible 的结合使用


第一种:OpenSSH 原生方式

通过配置 Ansible 所在主机的~/.ssh/config本地配置文件。

Host *
  ForwardAgent yes
  HashKnownHosts yes
  Compression yes

Host bastion
   HostName 172.10.10.1
   ProxyCommand none
   IdentityFile /home/ansible-playbook/ssh/rsa/deploy_blj
   BatchMode yes
   User bastion_deploy
   Port 60022
   
Host 192.168.10.1
   HostName 192.168.10.1
   ServerAliveInterval 60
   TCPKeepAlive        yes
   ProxyCommand        ssh bastion -W %h:%p
   ControlMaster       auto
   User root
   Port 22

其中 bastion是堡垒机的相关配置,需要指定登录堡垒机的认证密钥,因为堡垒机到目标主机 192.168.10.1是免密登录,所以不需要认证密钥。

过程原理:

  • 本地机器通过ProxyCommand先与bastion堡垒机建立一个SSH连接,并把这个连接当作一个代理使用;
  • 堡垒机在与目标主机 192.168.10.1建立连接,堡垒机到目标主机 192.168.10.1是免密登录,所以不需要认证密钥。
  • 这样本地主机与目标主机之间就建立了一个间接的 ssh 连接

使用 Ansible 验证

修改 Ansible 配置文件/etc/ansible/ansible.cfgssh_connection 选项

ssh_args = -C -o ControlMaster=auto -F /root/.ssh/config

相关配置文件如下

/home/ansible-playbook/
├── ansible-common-command
│   ├── ping.yaml     # 测试目标主机连通性
├── inventory
│   └── bastion-dev   # 目标主机信息
└── ssh
    └── rsa
        └── deploy_blj     # 堡垒机认证密钥

bastion-dev存放 host 信息

[bastion-test]
192.168.10.1

ping.yaml测试 playbook

---
    - hosts: " host }}"
      remote_user( root
      tasks:
         - name: test ping
           ping:

使用 ansible 的 ping 模块测试连通性,使用 -vvv 参数查看调试信息。 测试结果如下:

[root@master ansible-playbook]# ansible-playbook -i inventory/bastion-dev ansible-common-command/ping.yaml -e "host=bastion-test" -vvv

PLAYBOOK: ping.yaml ************************************************************************************************************************************
1 plays in ansible-common-command/ping.yaml

PLAY [bastion-test] ************************************************************************************************************************************
META: ran handlers

TASK [test ping] ***************************************************************************************************************************************
task path: /home/ansible-playbook/ansible-common-command/ping.yaml:5
Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/ping.py
<192.168.10.1> ESTABLISH SSH CONNECTION FOR USER: root
<192.168.10.1> SSH: EXEC ssh -C -o ControlMaster=auto -F /root/.ssh/config -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 192.168.10.1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<192.168.10.1> (0, '\n{"invocation": {"module_args": {"data": "pong"), "ping": "pong"}\n', 'Killed by signal 1.\r\n')
ok: [192.168.10.1] => {
    "changed": false,
    "invocation": {
        "module_args": {
            "data": "pong"
        }
    },
    "ping": "pong"
}
META: ran handlers
META: ran handlers

PLAY RECAP *********************************************************************************************************************************************
192.168.10.1               : ok=1    changed=0    unreachable=0    failed=0

测试成功。

第二种:使用 Ansible 的ansible_ssh_common_args参数

修改 Ansible 配置文件/etc/ansible/ansible.cfgssh_connection 选项

ssh_args = -C -o ControlMaster=auto -o HashKnownHosts=yes -o Compression=yes -o ForwardAgent=yes
inventory 文件的第一种写法

bastion-dev 存放 host 信息

[gatewayd:children]
bastion-test

[gatewayd:vars]
ansible_ssh_common_args='-o ProxyCommand="ssh -i /home/ansible-playbook/ssh/rsa/deploy_blj -W %h:%p  bastion_deploy@172.10.10.1 -p 60022"'

[bastion-test]
192.168.10.1

通过主机组与组变量的方式实现多主机的 ssh 代理。这种方式不需要使用本地的 ssh 配置文件 ~/.ssh/config,只需要在 invetory文件中定义即可,配置方式更为灵活。

同样使用 ansible 的 ping 模块测试连通性,使用 -vvv 参数查看调试信息。 测试结果如下:

[root@master ansible-playbook]# ansible-playbook -i inventory/bastion-dev ansible-common-command/ping.yaml -e "host=bastion-test" -vvv

PLAYBOOK: ping.yaml ************************************************************************************************************************************
1 plays in ansible-common-command/ping.yaml

PLAY [bastion-test] ************************************************************************************************************************************
META: ran handlers

TASK [test ping] ***************************************************************************************************************************************
task path: /home/ansible-playbook/ansible-common-command/ping.yaml:5
Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/ping.py
<192.168.10.1> ESTABLISH SSH CONNECTION FOR USER: root
<192.168.10.1> SSH: EXEC ssh -C -o ControlMaster=auto -o HashKnownHosts=yes -o Compression=yes -o ForwardAgent=yes -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o 'ProxyCommand=ssh -i /home/ansible-playbook/ssh/rsa/deploy_blj -W %h:%p  bastion_deploy@172.10.10.1 -p 60022' 192.168.10.1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<192.168.10.1> (0, '\n{"invocation": {"module_args": {"data": "pong"}}, "ping": "pong"}\n', 'Killed by signal 1.\r\n')
ok: [192.168.10.1] => {
    "changed": false,
    "invocation": {
        "module_args": {
            "data": "pong"
        }
    },
    "ping": "pong"
}
META: ran handlers
META: ran handlers

PLAY RECAP *********************************************************************************************************************************************
192.168.10.1               : ok=1    changed=0    unreachable=0    failed=0

同样的测试成功。

inventory 文件的第二种写法

如果我们的 inventory 文件较多,使用上面的 inventory 文件写法每个文件都要把 ansible_ssh_common_args 写一遍。在Ansible 官方的 FAQ 给了一个例子:Ansible 官方 FAQ

inventory 文件 bastion-dev同一级目录下新建一个 group_vars/gatewayd.yml文件,用于存放 gatewayd 组变量。内容如下:

# file /home/ansible-playbook/inventory/group_vars/gatewayd.yml
ansible_ssh_common_args: '-o ProxyCommand="ssh -i /home/ansible-playbook/ssh/rsa/deploy_blj -W %h:%p  bastion_deploy@172.10.10.1 -p 60022"'

这样所有的 inventory 文件都可以直接使用 gatewayd 组变量,修改 bastion-dev 文件如下:

# file /home/ansible-playbook/inventory/bastion-dev
[gatewayd:children]
bastion-test

[bastion-test]
192.168.10.1

测试目标主机连通性:

[root@master ansible-playbook]# ansible-playbook -i inventory/bastion-dev ansible-common-command/ping.yaml -e "host=bastion-test" -vvv

PLAYBOOK: ping.yaml ************************************************************************************************************************************
1 plays in ansible-common-command/ping.yaml

PLAY [bastion-test] ************************************************************************************************************************************
META: ran handlers

TASK [test ping] ***************************************************************************************************************************************
task path: /home/ansible-playbook/ansible-common-command/ping.yaml:5
Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/ping.py
<192.168.10.1> ESTABLISH SSH CONNECTION FOR USER: root
<192.168.10.1> SSH: EXEC ssh -C -o ControlMaster=auto -o HashKnownHosts=yes -o Compression=yes -o ForwardAgent=yes -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o 'ProxyCommand=ssh -i /home/ansible-playbook/ssh/rsa/deploy_blj -W %h:%p  bastion_deploy@172.10.10.1 -p 60022' 192.168.10.1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<192.168.10.1> (0, '\n{"invocation": {"module_args": {"data": "pong"}}, "ping": "pong"}\n', 'Killed by signal 1.\r\n')
ok: [192.168.10.1] => {
    "changed": false,
    "invocation": {
        "module_args": {
            "data": "pong"
        }
    },
    "ping": "pong"
}
META: ran handlers
META: ran handlers

PLAY RECAP *********************************************************************************************************************************************
192.168.10.1               : ok=1    changed=0    unreachable=0    failed=0

以上几种方式各有优劣,需要按照自己的实际环境情况选择。

Ansible 使用 ProxyComand 遇到的问题


SSH 的 Multiplexing 特性问题复现

这里主要讲一个大家在使用 Ansible 的ansible_ssh_common_args参数会遇到的一个 ssh 连接问题。这个问题主要出现在大家 使用配置文件/etc/ansible/ansible.cfg 中的 ssh_connection 默认配置时。

测试方法还是和上面一样使用 ping 模块测试,我先描述一下问题现象:

  • 配置好 inventory 后,我们测试连通性第一次是成功的,
  • 然后我们立马进行第二次测试,会发现测试失败。
  • 如果过一段时间我们再来测试,会发现测试又成功了,如果在进行一次测试又会失败。

我们先看一下/etc/ansible/ansible.cfg默认配置中的 ssh_connection 选项。

[ssh_connection]

# ssh arguments to use
# Leaving off ControlPersist will result in poor performance, so use
# paramiko on older platforms rather than removing it, -C controls compression use
#ssh_args = -C -o ControlMaster=auto -o ControlPersist=60s

可以看到所有配置项都是注释掉的,所以使用的也是主机 ssh 默认配置.

首先我们对问题做一个复现,第一次测试,结果:连通性 OK

[root@master ansible-playbook]# ansible-playbook -i inventory/bastion-dev ansible-common-command/ping.yaml -e "host=bastion-test" -vvv

PLAYBOOK: ping.yaml ************************************************************************************************************************************
1 plays in ansible-common-command/ping.yaml

PLAY [bastion-test] ************************************************************************************************************************************
META: ran handlers

TASK [test ping] ***************************************************************************************************************************************
task path: /home/ansible-playbook/ansible-common-command/ping.yaml:5
Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/ping.py
<192.168.10.1> ESTABLISH SSH CONNECTION FOR USER: root
<192.168.10.1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o 'ProxyCommand=ssh -i /home/ansible-playbook/ssh/rsa/deploy_blj -W %h:%p  bastion_deploy@172.10.10.1 -p 60022' -o ControlPath=/root/.ansible/cp/54ec251972 192.168.10.1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<192.168.10.1> (0, '\n{"invocation": {"module_args": {"data": "pong"}}, "ping": "pong"}\n', '')
ok: [192.168.10.1] => {
    "changed": false,
    "invocation": {
        "module_args": {
            "data": "pong"
        }
    },
    "ping": "pong"
}
META: ran handlers
META: ran handlers

PLAY RECAP *********************************************************************************************************************************************
192.168.10.1               : ok=1    changed=0    unreachable=0    failed=0

立马做第二次测试,结果:"SSH Error: data could not be sent to remote host \"192.168.10.1\". Make sure this host can be reached over ssh"

[root@master ansible-playbook]# ansible-playbook -i inventory/bastion-dev ansible-common-command/ping.yaml -e "host=bastion-test" -vvv

PLAYBOOK: ping.yaml ************************************************************************************************************************************
1 plays in ansible-common-command/ping.yaml

PLAY [bastion-test] ************************************************************************************************************************************
META: ran handlers

TASK [test ping] ***************************************************************************************************************************************
task path: /home/ansible-playbook/ansible-common-command/ping.yaml:5
Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/ping.py
<192.168.10.1> ESTABLISH SSH CONNECTION FOR USER: root
<192.168.10.1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o 'ProxyCommand=ssh -i /home/ansible-playbook/ssh/rsa/deploy_blj -W %h:%p  bastion_deploy@172.10.10.1 -p 60022' -o ControlPath=/root/.ansible/cp/54ec251972 192.168.10.1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
fatal: [192.168.10.1]: UNREACHABLE! => {
    "changed": false,
    "msg": "SSH Error: data could not be sent to remote host \"192.168.10.1\". Make sure this host can be reached over ssh",
    "unreachable": true
}
        to retry, use: --limit @/home/ansible-playbook/ansible-common-command/ping.retry

PLAY RECAP *********************************************************************************************************************************************
192.168.10.1               : ok=0    changed=0    unreachable=1    failed=0

然后过两到三分钟,我们做第三次测试,结果:连通性 OK

[root@master ansible-playbook]# ansible-playbook -i inventory/bastion-dev ansible-common-command/ping.yaml -e "host=bastion-test" -vvv

PLAYBOOK: ping.yaml ************************************************************************************************************************************
1 plays in ansible-common-command/ping.yaml

PLAY [bastion-test] ************************************************************************************************************************************
META: ran handlers

TASK [test ping] ***************************************************************************************************************************************
task path: /home/ansible-playbook/ansible-common-command/ping.yaml:5
Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/ping.py
<192.168.10.1> ESTABLISH SSH CONNECTION FOR USER: root
<192.168.10.1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o 'ProxyCommand=ssh -i /home/ansible-playbook/ssh/rsa/deploy_blj -W %h:%p  bastion_deploy@172.10.10.1 -p 60022' -o ControlPath=/root/.ansible/cp/54ec251972 192.168.10.1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<192.168.10.1> (0, '\n{"invocation": {"module_args": {"data": "pong"}}, "ping": "pong"}\n', '')
ok: [192.168.10.1] => {
    "changed": false,
    "invocation": {
        "module_args": {
            "data": "pong"
        }
    },
    "ping": "pong"
}
META: ran handlers
META: ran handlers

PLAY RECAP *********************************************************************************************************************************************
192.168.10.1               : ok=1    changed=0    unreachable=0    failed=0

问题原因

为什么会出现这种情况呢?

我们可以看到调试信息里有ansible执行过程中会使用到的命令,调用 EXEC去执行 ssh连接:

ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o 'ProxyCommand=ssh -i /home/ansible-playbook/ssh/rsa/deploy_blj -W %h:%p  bastion_deploy@172.10.10.1 -p 60022' -o ControlPath=/root/.ansible/cp/54ec251972 192.168.10.1

因为 Ansible 对我们来说是个黑盒,所以我们使用这个命令来观察一下,在第二次测试的时候,本地主机是否连接到了目标主机上,加上 -v 参数做调试。

第一次

[root@master ansible-playbook]# ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o 'ProxyCommand=ssh -i /home/ansible-playbook/ssh/rsa/deploy_blj -W %h:%p  bastion_deploy@172.10.10.1 -p 60022' -o ControlPath=/root/.ansible/cp/54ec251972 192.168.10.1 -v
OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017
debug1: Reading configuration data /root/.ssh/config
debug1: /root/.ssh/config line 1: Applying options for *
debug1: /root/.ssh/config line 15: Applying options for 192.168.10.1
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 58: Applying options for *
debug1: auto-mux: Trying existing master
debug1: Control socket "/root/.ansible/cp/54ec251972" does not exist
debug1: Executing proxy command: exec ssh -i /home/ansible-playbook/ssh/rsa/cm_deploy_blj 192.168.10.1:22  bastion_deploy@1172.10.10.1-p 60022
debug1: permanently_set_uid: 0/0
debug1: permanently_drop_suid: 0
debug1: identity file /root/.ssh/id_rsa type 1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.4
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.2p2 usm-0.6.3
debug1: match: OpenSSH_7.2p2 usm-0.6.3 pat OpenSSH* compat 0x04000000
debug1: Authenticating to 192.168.10.1:22 as 'root'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256@libssh.org
debug1: kex: host key algorithm: ssh-rsa
debug1: kex: server->client cipher: aes128-ctr MAC: hmac-sha1 compression: zlib@openssh.com
debug1: kex: client->server cipher: aes128-ctr MAC: hmac-sha1 compression: zlib@openssh.com
debug1: kex: curve25519-sha256@libssh.org need=20 dh_need=20
debug1: kex: curve25519-sha256@libssh.org need=20 dh_need=20
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ssh-rsa SHA256:Bu7vaKlAjLi6MXgzRY2pk3AAdzAZQPxPls/mKaq6WhI
debug1: Host '192.168.10.1' is known and matches the RSA host key.
debug1: Found key in /root/.ssh/known_hosts:10
debug1: rekey after 4294967296 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 4294967296 blocks
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Enabling compression at level 6.
debug1: Authentication succeeded (none).
Authenticated to 192.168.10.1 (via proxy).
debug1: setting up multiplex master socket
debug1: channel 0: new [/root/.ansible/cp/54ec251972]
debug1: control_persist_detach: backgrounding master process
debug1: forking to background
debug1: Entering interactive session.
debug1: pledge: id
debug1: multiplexing control connection
debug1: channel 1: new [mux-control]
debug1: channel 2: new [client-session]
debug1: Requesting authentication agent forwarding.
debug1: Sending environment.
debug1: Sending env LANG = zh_CN.UTF-8
debug1: mux_client_request_session: master session id: 2
Last login: Sat May 16 17:38:28 2020 from 192.10.253.83
[root@localhost ~]# debug1: Received SSH2_MSG_UNIMPLEMENTED for 10

可以看到成功登录到了目标主机192.168.10.1并且输出了一堆调试信息,我们先退出,做第二次测试。

第二次

" Press <j>/<k> or <DOWN>/<UP> to move and then <ENTER> for login.
" Press </> for search and then <n>/<N> to go to next and previous.
" Press <u>/<p> or <PageUp>/<PageDown> for page turning.
" Press <:[ssh|telnet|rlogin] {user@host:port}> for login.
" Press <:{num}> for goto line and then Type 'q' for quit.
" Press <r>/<q> for reload and quit.
"=
  ShellCommand

第二次测试发现 shell 连接到了堡垒机就终止了,没有成功登录到目标主机。我们先退出堡垒机查看一下调试信息。

[root@master ansible-playbook]# ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o 'ProxyCommand=ssh -i /home/ansible-playbook/ssh/rsa/cm_deploy_blj -W %h:%p  bastion_deploy@172.10.10.1 -p 60022' -o ControlPath=/root/.ansible/cp/54ec251972 192.168.10.1 -v
OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017
debug1: Reading configuration data /root/.ssh/config
debug1: /root/.ssh/config line 1: Applying options for *
debug1: /root/.ssh/config line 15: Applying options for 192.168.10.1
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 58: Applying options for *
debug1: auto-mux: Trying existing master
debug1: multiplexing control connection
debug1: channel 1: new [mux-control]
debug1: channel 2: new [client-session]
debug1: Requesting authentication agent forwarding.
debug1: Sending environment.
debug1: Sending env LANG = zh_CN.UTF-8
debug1: mux_client_request_session: master session id: 2

可以看到第二次连接时没有认证过程,直接进行到了 debug1: multiplexing control connectiondebug1: mux_client_request_session: master session id: 2,显示复用了上一次 SSH 连接,说明在我们退出目标主机时,ssh 连接并没有中断,我们看一下本地主机到堡垒机的 ssh 端口连接情况。

[root@master ansible-playbook]# netstat -anp | grep 172.10.10.1:60022
tcp        0      0 10.100.100.91:57628     1172.10.10.160022      ESTABLISHED 29757/ssh

可以看到在退出之后,ssh 连接仍然存在。并且第二次连接复用了上一次的 session, ssh 终止在堡垒机上。

OpenSSH的 cookbook上有这个 Multiplexing 特性的详细介绍:https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Multiplexing#Errors_Preventing_Multiplexing

简单来说就是 ssh 连接的多次复用。

分析一下这个过程:

  • 第一次连接成功,由于本地主机开启了 ssh 连接复用,所以退出后 ssh连接 并未完全退出。
  • 第二次连接时直接复用了上一次的连接所以登录到堡垒机是没问题的,但是堡垒机到目标主机的连接似乎中断了。

这里做一个猜想:由于堡垒机的某些原因,退出后堡垒机与目标主机的 ssh 连接直接退出了。导致 ssh 连接从本地主机与目标主机 退化 到了与堡垒的连接。

这个猜想也非常好验证:我们登录到目标主机上,查看一下连接断开前后堡垒机与目标主机 ssh 连接情况即可。

连接正常时:

[root@localhost ~]# netstat -anp | grep  172.10.10.1
tcp        0      0 192.168.10.1:22          1172.10.10.145631     ESTABLISHED 102048/sshd: root@p
tcp        0      0 192.168.10.1:22          17172.10.10.15667     ESTABLISHED 102111/sshd: root@p

断开后:

[root@localhost ~]# netstat -anp | grep  172.10.10.1
tcp        0      0 192.168.10.1:22          1172.10.10.145631     ESTABLISHED 102048/sshd: root@p

可以看到断开连接后,与堡垒机的连接丢失了。

解决方法

因为 堡垒机 是商用产品,所以也没办法在堡垒机底层解决这个问题,只能通过 ssh 客户端入手。根据官方文档的说法,只要关闭 ssh 连接的复用功能这个问题应该就能解决。

官方文档中有这么一段话:

ControlPersist can be used in conjunction with ControlMaster. If ControlPersist is set to 'yes', then it will leave the master connection open in the background to accept new connections until either killed explicitly or closed with -O or ends at a pre-defined timeout. If ControlPersist is set to a time, then it will leave the master connection open for the designated time or until the last multiplexed session is closed, whichever is longer.
大意就是,可以通过 ControlPersist 参数来控制 ssh 共享会话在空闲时维持的时间长短。

我们再回头看一下,ansible 在远程过程中使用的命令:

ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o 'ProxyCommand=ssh -i /home/ansible-playbook/ssh/rsa/deploy_blj -W %h:%p  bastion_deploy@172.10.10.1 -p 60022' -o ControlPath=/root/.ansible/cp/54ec251972 192.168.10.1

里面有一个参数 -o ControlPersist=60s 因为我们 ansible 的配置是默认配置,所以这里使用的是主机 ssh 的默认配置,即 ssh 连接复用的保持时间为 60s。找到问题修改起来就简单多了,遵循最小修改原则,把这个参数时间改为 0 即可。

最简单的办法修改 /etc/ansible/ansible.cfg中的 ssh_connection 选项。

[ssh_connection]

# ssh arguments to use
# Leaving off ControlPersist will result in poor performance, so use
# paramiko on older platforms rather than removing it, -C controls compression use
ssh_args = -C -o ControlMaster=auto

Bingo~~~ 问题解决

参考