故障排查 – 朱金贺

业务日志监控中报告, 每天会有大约250次连接redis失败.

通过strace追踪发现.故障的时间点时写磁盘时间超过了10s.一般在10-15s之间. redis第二次重试使用的是10s.

这个实例所有的操作都是INCR, fdatasync 会block写.

strace -Ttt -f -p 11302 -T -e trace=fdatasync

11309 10:21:31.153900 fdatasync(116) = 0 <0.034295>

11309 10:21:32.078747 fdatasync(116) = 0 <7.592478>

11309 10:21:39.774959 fdatasync(116) = 0 <10.098802>

11309 10:21:49.990623 fdatasync(116) = 0 <2.026147>

11309 10:21:52.129676 fdatasync(116) = 0 <0.002802>

治标:

超时时间改为15s.

治本:

正在用watchdog抓一下超过5s的堆栈.

堆栈:

[11302 | signal handler] (1499754857)

--- WATCHDOG TIMER EXPIRED ---

/usr/local/bin/redis-server-2.8 10.160.86.216:6699(logStackTrace+0x3e)[0x445ace]

/lib64/libpthread.so.0(write+0x2d)[0x7f19ef3b06fd]

/lib64/libpthread.so.0(+0xf710)[0x7f19ef3b1710]

/lib64/libpthread.so.0(write+0x2d)[0x7f19ef3b06fd]

/usr/local/bin/redis-server-2.8 10.160.86.216:6699(flushAppendOnlyFile+0x4e)[0x44116e]

/usr/local/bin/redis-server-2.8 10.160.86.216:6699(serverCron+0x3b7)[0x41bb17]

/usr/local/bin/redis-server-2.8 10.160.86.216:6699(aeProcessEvents+0x1e9)[0x416b69]

/usr/local/bin/redis-server-2.8 10.160.86.216:6699(aeMain+0x2b)[0x416deb]

/usr/local/bin/redis-server-2.8 10.160.86.216:6699(main+0x31d)[0x41e49d]

/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f19ef02cd5d]

/usr/local/bin/redis-server-2.8 10.160.86.216:6699[0x415bd9]

[11302 | signal handler] (1499754857) --------

fdatasync会在某个时间点超过10s.

看来因为写磁盘堵塞了, 把机械硬盘换成了SSD, 解决了.

收到报警, Postgresql数据库服务器出现瞬时CPU占用高峰, load/cpu能从0.5瞬间飙升到5以上,每次持续15s-2mins. 波峰比较严重时候,会导致PHP进程数占满,进而影响API接口的新入请求.

由于每次发生的时间间隔大概2个小时, 但是没有规律. zabbix监控对这种突发最好只做到在1.5分钟后通知, 失去了事故现场.

由于报警延迟, 只能一边开着terminal实时观察, 一边在下次故障发生前排查其他的可能性.

怀疑是最近变更. 近期postgresql刚安装了jieba分词插件. 有可能会导致这个问题.

怀疑是由5%慢请求拖累整个数据库,造成服务器性能不够.这个根据经验是最可能的.

怀疑定时任务导致瞬时大量请求.

怀疑是因为数据量增大后内存热数据不够导致.

怀疑是vacuum或其他后台进程导致.

排查的过程中, 突然抓到事故现场. 立即查看内存和CPU相关指标.

1. _spin_unlock_irqrestore 占用89%的CPU时间.

2. 单个sql请求消耗内存大概是25M.

grep Pss /proc/`ps auxf | grep hbforce_v4| grep -ir select | tail -n 1 | awk '{print $2}'`/smaps |awk '{total+=$2}; END {print total}'

3. 内存现状,显示还有大量的cached内存, 系统故障时候伴随着内存也扫描和回收.(低于/proc/zoneinfo的low的值时触发回收,回收到high停止)

用到的知识点:

在系统空闲内存低于 watermark[low]时，开始启动内核线程kswapd进行内存回收.直到该zone的空闲内存数量达到watermark[high]后停止回收。如果上层申请内存的速度太快，导致空闲内存降至watermark[min]后，内核就会进行direct reclaim（直接回收），即直接在应用程序的进程上下文中进行回收，再用回收上来的空闲页满足内存申请，因此实际会阻塞应用程序.当前情况下.min-low-high之间只有15320KB的内存差.

而一个postgresql请求中需要分配的内存一般会在25MB左右.会导致直接触发强制回收.

下面是定位出来的触发逻辑:

1.内存不足时,单次申请内存过大,导致直接强制回收.触发compat_zone, 导致自旋锁,这个锁的特点是等待锁的过程中不休眠,而是占着CPU空转.spin_unlock_irqrestore — 禁止内核抢占并屏蔽中断，事先保存中断屏蔽位并事后恢复原状.

2.CPU空转导致新进入的任务堆积,load升高,并占用连接.

3.PHP不能处理导致进程占用,PHP-FPM进程达到最大值时候阻塞新请求进入.

4.API接口超时,导致前端调用失败,显示接口等待或失败.

5.内存回收完毕后释放锁,然后恢复正常.

总结原因是单次内存申请超过了min-low-high间隔,导致强制回收,触发了自旋锁.

调整内核参数解决:

vm.min_free_kbytes = 1081344

vm.vfs_cache_pressure = 200

当时的现场的数据:

[root@DATABASE ~]# cat /proc/zoneinfo

Node 0, zone Normal

pages free 20198

min 15319

low 19148

high 22978

[root@DATABASE ~]# free -m

total used free shared buffers cached

Mem: 32109 31778 330 0 256 30215

-/+ buffers/cache: 1307 30802

Swap: 0 0 0

[root@DATABASE ~]# cat /proc/meminfo

MemTotal: 32879936 kB

MemFree: 2506208 kB

Buffers: 892660 kB

Cached: 28122840 kB

SwapCached: 0 kB

Active: 16954896 kB

Inactive: 12560144 kB

Active(anon): 3459900 kB

Inactive(anon): 310148 kB

Active(file): 13494996 kB

Inactive(file): 12249996 kB

Unevictable: 0 kB

Mlocked: 0 kB

SwapTotal: 0 kB

SwapFree: 0 kB

Dirty: 7004 kB

Writeback: 0 kB

AnonPages: 258360 kB

Mapped: 3255272 kB

Shmem: 3271380 kB

Slab: 471896 kB

SReclaimable: 419744 kB

SUnreclaim: 52152 kB

KernelStack: 3680 kB

PageTables: 38140 kB

NFS_Unstable: 0 kB

Bounce: 0 kB

WritebackTmp: 0 kB

CommitLimit: 16439968 kB

Committed_AS: 8292496 kB

VmallocTotal: 34359738367 kB

VmallocUsed: 65256 kB

VmallocChunk: 34359667396 kB

HardwareCorrupted: 0 kB

AnonHugePages: 0 kB

HugePages_Total: 0

HugePages_Free: 0

HugePages_Rsvd: 0

HugePages_Surp: 0

Hugepagesize: 2048 kB

DirectMap4k: 6016 kB

DirectMap2M: 2091008 kB

DirectMap1G: 31457280 kB

[root@DATABASE ~]# numactl --hardware

available: 1 nodes (0)

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

node 0 size: 32767 MB

node 0 free: 1141 MB

node distances:

node 0

0: 10

[root@DATABASE ~]# cat /proc/zoneinfo

Node 0, zone Normal

pages free 20198

min 15319

low 19148

high 22978

scanned 0

spanned 7602176

present 7498240

[root@DATABASE ~]# sar -B 1

Linux 2.6.32-279.el6.x86_64 (DATABASE) 06/21/2017 _x86_64_ (16 CPU)

12:51:57 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff

12:51:58 AM 129.17 612.50 13683.33 0.00 23310.42 0.00 7663.54 7704.17 100.53

12:51:59 AM 1872.34 12.77 17207.45 0.00 29784.04 0.00 2325.53 2325.53 100.00

12:52:00 AM 148.84 0.00 19734.30 11.05 24716.86 0.00 4424.42 4424.42 100.00

12:52:01 AM 84.21 0.00 2047.37 21.05 18484.21 0.00 4347.37 4178.95 96.13

12:52:02 AM 1025.00 0.00 18497.92 2.08 22652.08 0.00 5256.25 5289.58 100.63

12:52:03 AM 2235.79 450.53 52714.74 4.21 87222.11 0.00 24298.95 24297.89 100.00

12:52:05 AM 13.45 6.72 13763.87 0.00 22565.55 0.00 1813.45 1813.45 100.00

12:52:06 AM 8.00 0.00 14222.00 0.00 16852.00 0.00 5255.00 5199.00 98.93

12:52:07 AM 11734.69 114.29 190776.53 1.02 223285.71 0.00 43768.37 43814.29 100.10

12:52:08 AM 58848.98 1559.18 226263.27 0.00 233448.98 0.00 1057.14 1057.14 100.00

12:52:09 AM 108673.47 24.49 140714.29 0.00 142710.20 0.00 0.00 0.00 0.00

12:52:10 AM 65834.34 16.16 41429.29 0.00 42891.92 0.00 0.00 0.00 0.00

12:52:11 AM 59104.17 0.00 59959.38 0.00 60518.75 0.00 0.00 0.00 0.00

12:52:12 AM 56040.40 137.37 46878.79 0.00 40731.31 0.00 0.00 0.00 0.00

对应的是

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----

82 1 0 581128 587036 30372132 0 0 17092 616 63036 2132 1 99 0 0 0

73 1 0 687572 586204 30274540 0 0 1476 24 18942 581 0 100 0 0 0

83 0 0 700872 585844 30250296 0 0 432 588 16374 427 0 100 0 0 0

51 1 0 715412 585816 30238628 0 0 1752 12 16660 508 1 99 0 0 0

87 1 0 726340 585784 30229336 0 0 212 0 17575 480 1 99 0 0 0

88 1 0 742876 585760 30196484 0 0 116 0 25373 712 1 99 0 0 0

80 1 0 753324 585620 30180436 0 0 1328 428 17074 504 1 99 0 0 0

90 0 0 828596 585088 30092832 0 0 1736 8 19115 1378 1 99 0 0 0

86 0 0 835460 585000 30088000 0 0 16 0 17247 461 0 100 0 0 0

82 0 0 888980 584364 29995256 0 0 24 8 18090 574 1 99 0 0 0

1 1 0 1012756 583688 29915708 0 0 20796 112 19206 4382 6 73 20 1 0

1 1 0 1018156 583952 29982312 0 0 68868 1520 8720 5912 4 7 85 3 0

0 1 0 962080 584016 30083852 0 0 101396 24 6240 4274 3 4 89 4 0

r b swpd free buff cache si so bi bo in cs us sy id wa st

1 1 0 906804 584032 30144308 0 0 61180 16 4026 3513 1 2 93 4 0

这些信息没有在默认的监控中体现出来, 后续又细化zabbix内存监控.

细化的模板在这里

参考文章:

http://blog.2baxb.me/archives/1065

https://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases

https://www.kernel.org/doc/gorman/html/understand/understand013.html

http://os.51cto.com/art/201310/412638.htm

http://www.cs.fsu.edu/~baker/devices/lxr/source/2.6.11.8/linux/mm/mempool.c

http://blog.csdn.net/longxibendi/article/details/44625703

https://www.google.com/#newwindow=1&hl=en&q=kswapd+spin_unlock_irqrestore+cpu

https://www.google.com/#q=%22compact_zone%22+spin_unlock_irqrestore

分类：故障排查

redis刷磁盘可能会导致瞬时无法连接

direct reclaim内存回收导致postgresql CPU占用激增