Linux - Systemd & Syncer 相爱相杀 - AirPlan

0x00 缘来

Syncer 数据同步，会因为网络延迟与网络丢包、上下游压力过大等原因导致断开，这些场景本身是可重启恢复的。
最初与 TiDB 使用的 supervise 小工具，这个工具有个问题是无法设置重试次数然后无限制重启进程，不适用这种链接上下游同步的场景。
Syncer 某些场景，如 DDL / DML 不支持这种就无法重试成功，这种就需要进程停止然后触发告警人工修复。

0x01 案例

背景
- Syncer 遇到以下操作会自动退出，目前采用的 Supervise 守护进程，当进程退出后自动拉起服务。但 Supervise 无法控制拉起次数，无法让进程彻底停止
  - 网络闪断 (这个需要被重试拉起)
  - tidb 或者 tikv 繁忙导致 server is busy
  - 不支持的 DDL
  - 不支持的 DML 语法
需求
- 需要守护进程工具拥有重试最大次数功能
  - 可选方案有 Supervisord 与 Systemd
    - Supervisord 有 Startretries 参数可以实现，但需要部署 Supervisord binary
      - 引用 liyangliang
    - Systemd 是 CentOS 系统自带服务，无需单独部署，官方参数暂无 startretries 类似功能。通过 Google 查到其他方式实现相关功能

Systemd 配置文件

cat /etc/systemd/system/syncer-port.service
- 注册 systemd 服务
- 修改 service 后需要执行 systemctl daemon-reload

Service

RestartSec：自动重启当前服务间隔的秒数
StartLimitInterval=, StartLimitBurst=
- 限制该服务的启动频率。默认值是每10秒内不得超过5次(StartLimitInterval=10s StartLimitBurst=5)
- 引用 jinbuguo
- man systemd
- systemd conf
- 引用 CSDN
- 引用阮一峰

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
[Unit]
Description=syncer service
After=syslog.target network.target remote-fs.target nss-lookup.target

[Service]
LimitNOFILE=1000000
User=root
ExecStart=/root/systemd-syncer/run_syncer.sh
Restart=always
RestartSec=15s
StartLimitInterval=300
StartLimitBurst=5

[Install]
WantedBy=multi-user.target

Syncer 脚本

cat run_syncer.sh

Syncer 服务启动脚本
启动脚本内容所涉及到的路径需要指定绝对路劲

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/bin/bash

svr=syncer-job
work_dir="/root/systemd-syncer"
bin_dir="${work_dir}/bin/syncer"
conf_dir="${work_dir}/conf"
log_dir="${work_dir}/log"
nohup_dir="${work_dir}/nohup"

${bin_dir} --config ${conf_dir}/${svr}.toml --enable-gtid  --auto-fix-gtid  --log-file ${log_dir}/${svr}.log --status-addr :11105 &>>${nohup_dir}/${svr}.out

systemd-syncer 目录结构

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
[root@ip-172-16-10-65 systemd-syncer]# tree ./
./
├── bin
│   └── syncer
├── conf
│   └── syncer-job.toml
├── log -> /data1/systemd-syncer.log
│   └── syncer-job.log
├── meta
│   └── syncer-job.meta
├── nohup
│   └── syncer-job.out
└── run-syncer
    └── run_syncer-job.sh

0x02 测试验证

Systemd 操作

systemctl start syncer-port
- 启动该服务
systemctl stop syncer-port
- 停止该服务
systemctl restart syncer-port
- 重启该服务

systemctl status syncer-port -l

查看服务状态，-l 可以查看更多信息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
[root@ip-172-16-10-65 bin]# systemctl status syncer-port -l
● syncer-port.service - syncer service
  Loaded: loaded (/etc/systemd/system/syncer-port.service; disabled; vendor preset: disabled)
  Active: failed (Result: start-limit) since Fri 2018-02-09 16:43:52 CST; 9min ago
  Process: 393938 ExecStart=/root/systemd-syncer/run_syncer.sh (code=exited, status=0/SUCCESS)
Main PID: 393938 (code=exited, status=0/SUCCESS)

Feb 09 16:43:52 ip-172-16-10-65 systemd[1]: syncer-port.service holdoff time over, scheduling restart.
Feb 09 16:43:52 ip-172-16-10-65 systemd[1]: start request repeated too quickly for syncer-port.service
Feb 09 16:43:52 ip-172-16-10-65 systemd[1]: Failed to start syncer service.
Feb 09 16:43:52 ip-172-16-10-65 systemd[1]: Unit syncer-port.service entered failed state.
Feb 09 16:43:52 ip-172-16-10-65 systemd[1]: syncer-port.service failed.

模拟同步失败

MySQL
- create index idx on hello (id,status,size) lock=none
- 利用早期 TiDB 版本不支持 lock=none 来复现 Syncer 不支持语法退出
Syncer
- Syncer 持续重启，因为 TiDB 无法执行该 DDL
- 进程退出，然后被 Systemd 重试拉起，5 次后，进程自动退出

journalctl -f -u syncer-port

使用 journalctl 获取该服务更多的日志信息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
Feb 09 16:33:42 ip-172-16-10-65 systemd[1]: Unit syncer-port.service entered failed state.
Feb 09 16:33:42 ip-172-16-10-65 systemd[1]: syncer-port.service failed.
Feb 09 16:42:36 ip-172-16-10-65 systemd[1]: Started syncer service.
Feb 09 16:42:36 ip-172-16-10-65 systemd[1]: Starting syncer service...
Feb 09 16:42:51 ip-172-16-10-65 systemd[1]: syncer-port.service holdoff time over, scheduling restart.
Feb 09 16:42:51 ip-172-16-10-65 systemd[1]: Started syncer service.
Feb 09 16:42:51 ip-172-16-10-65 systemd[1]: Starting syncer service...
Feb 09 16:43:06 ip-172-16-10-65 systemd[1]: syncer-port.service holdoff time over, scheduling restart.
Feb 09 16:43:06 ip-172-16-10-65 systemd[1]: Started syncer service.
Feb 09 16:43:06 ip-172-16-10-65 systemd[1]: Starting syncer service...
Feb 09 16:43:22 ip-172-16-10-65 systemd[1]: syncer-port.service holdoff time over, scheduling restart.
Feb 09 16:43:22 ip-172-16-10-65 systemd[1]: Started syncer service.
Feb 09 16:43:22 ip-172-16-10-65 systemd[1]: Starting syncer service...
Feb 09 16:43:37 ip-172-16-10-65 systemd[1]: syncer-port.service holdoff time over, scheduling restart.
Feb 09 16:43:37 ip-172-16-10-65 systemd[1]: Started syncer service.
Feb 09 16:43:37 ip-172-16-10-65 systemd[1]: Starting syncer service...
Feb 09 16:43:52 ip-172-16-10-65 systemd[1]: syncer-port.service holdoff time over, scheduling restart.
Feb 09 16:43:52 ip-172-16-10-65 systemd[1]: start request repeated too quickly for syncer-port.service
Feb 09 16:43:52 ip-172-16-10-65 systemd[1]: Failed to start syncer service.
Feb 09 16:43:52 ip-172-16-10-65 systemd[1]: Unit syncer-port.service entered failed state.
Feb 09 16:43:52 ip-172-16-10-65 systemd[1]: syncer-port.service failed.

日志重启记录

1
2
3
4
5
6
[root@ip-172-16-10-65 bin]# grep "status exits" s.log
2018/02/09 16:42:36 syncer.go:795: [info] print status exits, err:context canceled
2018/02/09 16:42:51 syncer.go:795: [info] print status exits, err:context canceled
2018/02/09 16:43:06 syncer.go:795: [info] print status exits, err:context canceled
2018/02/09 16:43:22 syncer.go:795: [info] print status exits, err:context canceled
2018/02/09 16:43:37 syncer.go:795: [info] print status exits, err:context canceled

0x03 告警

按照官方文档配置告警信息 Syncer 传送门
进程退出告警，可添加 Blackbox 组件，监听业务端口状态。

0x04 FAQ

Error

Job for Test-syncer.service failed because start of the service was attempted too often. See "systemctl status Test-syncer.service" and "journalctl -xe" for details. To force a start use "systemctl reset-failed Test-syncer.service" followed by "systemctl start Test-syncer.service" again.
- 在 StartLimitInterval 时间内，重启次数到达 StartLimitBurst 设置值，需要使用 systemctl reset-failed Test-syncer.service 重置计数器，然后重启该服务
systemd 启动遇见 (code=exited, status=1/FAILURE)
- 需要查看服务日志（先查 syncer 日志并使用 run_syncer 脚本手动启动测试下）

Systemd 资料片

systemd及系统初始化传送门
Systemd 入门教程：命令篇阮一峰
systemd 服务脚本的编写写脚本
编写 systemd service 文件
systemd restart // How to set up a systemd service to retry 5 times on a cycle of 30 seconds
- 指定时间内重启有效次数
- This worked worked for me for a service that runs a script using ‘Type=idle’. Note that ‘StartLimitInterval’ must be greater than ‘RestartSec * StartLimitBurst’ otherwise the service will be restarted indefinitely.

Contents

0x00 缘来

0x01 案例

Systemd 配置文件

Syncer 脚本

0x02 测试验证

Systemd 操作

模拟同步失败

日志重启记录

0x03 告警

0x04 FAQ

Error

Systemd 资料片