某服务挂了。

设备被强制重启之后发现 LVM 满了,但是文件无法访问,所有文件操作显示 Input/output error

查看 dmesg 发现大量文件系统错误,应该是磁盘写满后仍有进程不断读写的过程中被强制断电的结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[ 1714.217864] XFS (dm-0): page discard on page 00000000161e11d5, inode 0xd861b703d, offset 937984.
[ 1714.219674] XFS (dm-0): page discard on page 000000001d433e5e, inode 0xd861b703d, offset 942080.
[ 1714.221132] XFS (dm-0): page discard on page 00000000820efe8d, inode 0xd861b703d, offset 946176.
[ 1714.222431] XFS (dm-0): page discard on page 00000000518c8216, inode 0xd861b703d, offset 950272.
[ 1714.223744] XFS (dm-0): page discard on page 00000000753db760, inode 0xd861b703d, offset 954368.
[ 1714.225041] XFS (dm-0): page discard on page 00000000da40787d, inode 0xd861b703d, offset 958464.
[ 1714.226341] XFS (dm-0): page discard on page 00000000ba8adb4b, inode 0xd861b703d, offset 962560.
[ 1714.227629] XFS (dm-0): page discard on page 00000000784c4724, inode 0xd861b703d, offset 966656.
[ 1714.228923] XFS (dm-0): page discard on page 0000000063b2c764, inode 0xd861b703d, offset 970752.
[ 1714.228990] XFS (dm-0): page discard on page 0000000046a36fd8, inode 0xd861b703d, offset 974848.
[ 1714.337426] dm-0: writeback error on inode 58084519997, offset 905216, sector 34365282240
[ 1716.586318] dm-0: writeback error on inode 58084519997, offset 905216, sector 34365309816
[ 1728.444718] xfs_discard_page: 9674 callbacks suppressed

...

[ 1763.990454] XFS (dm-0): xfs_do_force_shutdown(0x8) called from line 955 of file fs/xfs/xfs_trans.c. Return address = 00000000ea9478e4
[ 1763.990459] XFS (dm-0): Corruption of in-memory data detected. Shutting down filesystem
[ 1763.992696] XFS (dm-0): Please unmount the filesystem and rectify the problem(s)

日志写的很清楚了,那就来卸载修复吧

1
2
~> umount /data
~> xfs_repair /dev/mapper/data

显示

1
2
3
4
5
6
7
8
9
10
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair. If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

咦,为什么还要我 remount。

1
2
3
~> mount -a
~> umount /data
~> xfs_repair /dev/mapper/data

然后就是漫长的等待…(因为是 HDD)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- 19:33:58: zeroing log - 521728 of 521728 blocks done
- scan filesystem freespace and inode maps...
- 19:34:11: scanning filesystem freespace - 33 of 33 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- 19:34:11: scanning agi unlinked lists - 33 of 33 allocation groups done
- process known inodes and perform inode discovery...
- agno = 0
- agno = 15
- agno = 30
...
- agno = 27
- agno = 28
- agno = 29
- 19:43:14: process known inodes and inode discovery - 16555072 of 16555072 inodes done
- process newly discovered inodes...
- 19:43:14: process newly discovered inodes - 33 of 33 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 19:43:15: setting up duplicate extent list - 33 of 33 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 7
- agno = 3
- agno = 8
...
- agno = 30
- agno = 31
- agno = 32
- 19:43:24: check for inodes claiming duplicate blocks - 16555072 of 16555072 inodes done
Phase 5 - rebuild AG headers and trees...
- 19:43:27: rebuild AG headers and trees - 33 of 33 allocation groups done
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- 19:48:58: rebuild AG headers and trees - 33 of 33 allocation groups done
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
- 19:50:02: verify and correct link counts - 33 of 33 allocation groups done
done

完事后重启,就可以重新访问 LVM 里的文件啦。

后记:

这次所幸根分区是单独的盘,如果根分区和 LVM 在同一块物理盘上的话,需要重启系统进入救援模式,手动激活 LVM 再执行修复。

以及 XFS 还是靠谱呀(看向某每天摸鱼看番剧透的 btrfs 开发者