又到了喜闻乐见帮人修系统的时间。

根据情况描述是 “stuck at ‘Starting switch root’”,初步判定是 initramfs 坏掉了。经过询问果不其然,某个特别爱没事儿就更新的家伙在更新的时候被数据中心的智障远程手按了电源¯(ツ)/¯

既然启动不能那就进 Rescue 呗。虽然 CentOS 这系统不怎么样,但是至少也算在原版 ISO 就提供了 Rescue 的选项,还能自动寻找根分区挂载。本来以为小事一桩,准备 chroot 的时候突然冒出来一句 “chroot: cannot run command `/bin/sh’: Exec format error” 就瞬间心里一顿$#%#$^$#@#%#

一般来讲,报错 “Exec format error” 代表二进制格式错误,例如在 32 位系统中执行 64 位二进制文件。但这不可能,首先同样是 x86_64 架构,安装的系统和 ISO 的系统也是同样的版本,磁盘也没有报错,为什么会出现这种问题?

更新时断电导致文件被写坏的可能性并不是没有,但奇怪的是执行 /bin/bash 也报一样的错误。能坏到这种程度怕不是 glibc 挂了…?但总之第一步是要能先进系统瞧瞧究竟,于是心一横,直接用 Live CD 的 /usr 覆盖磁盘上的对应目录。

1
~> cp -r /usr /mnt/sysimage/

然后再 chroot,果然成功进入系统,yum 命令也可以用了。既然是更新过程中断电,那应该先尝试修复未完成的更新。

1
2
3
4
~> yum-complete-transaction
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
No unfinished transactions left.

啊咧。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
~> yum update
--> Finished Dependency Resolution
Error: Multilib version problems found. This often means that the root
cause is something else and multilib version checking is just
pointing out that there is a problem. Eg.:

1. You have an upgrade for libselinux which is missing some
dependency that another package requires. Yum is trying to
solve this by installing an older version of glibc of the
different architecture. If you exclude the bad architecture
yum will tell you what the root cause is (which package
requires what). You can try redoing the upgrade with
--exclude glibc.otherarch ... this should give you an error
message showing the root cause of the problem.

2. You have multiple architectures of glibc installed, but
yum can only see an upgrade for one of those arcitectures.
If you don't want/need both architectures anymore then you
can remove the one with the missing update and everything
will work.

3. You have duplicate versions of glibc installed already.
You can use "yum check" to get yum show these errors.

...you can also use --setopt=protected_multilib=false to remove
this checking, however this is almost never the correct thing to
do as something else is very likely to go wrong (often causing
much more problems).

Protected multilib versions: glibc..... != glibc.....
kbd... x86_64 != kbd ... x86_64

…还真是。

进一步了解到这个系统并没有安装 32 位包,那么这个 32 位的 glibc 是哪儿来的呢?yum check 无果,更何况还有一个奇怪的 kbd 包甚至跟 32 位毫无关系。看来坏的不轻,但是既然确定系统中没有 32 位包,那么可以继续莽到底了。

1
2
3
4
5
6
7
# 完成所有未更新的包
~> yum update --setopt=protected_multilib=false
# 为重装系统做准备
~> mkdir /root/tmp
~> cd /root/tmp
# 重装所有包
~> yum reinstall * --setopt=protected_multilib=false

一通操作猛如虎,第一次重装所有包时还会报错 ldconfig 一些库是空文件,第二次完整重装就看不到报错了。看来可行,果断重启,一堆 OK 之后看到了 login 提示。

1
2
3
4
5
6
7
8
localhost login: root

Login incorrect

Login incorrect


Login incorrect

嗯嗯嗯???我还没输入密码呢???

还没输入密码就报错,至少不是 PAM 的问题。在那之前呢?应该是 systemd-logind 了。幸运的是网络可以正常启动,SSH 能够正常登入系统。如果网络或者 SSH 也不行,那就只好再次进 rescue 了。

系统里很多本来应该有的底层服务没有启动,例如 systemd-logind。想查看这货的状态,结果报错:

1
Failed to activate service 'org.freedesktop.systemd1': timed out

咦?systemd 自己都没正常启动,看看日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...
localhost dbus-daemon[1166]: Encountered error 'Error in file /etc/dbus-1/system.d/org.freedesktop.hostname1.conf, line 1, column 0: no element found
' while parsing '/etc/dbus-1/system.d/org.freedesktop.hostname1.conf'
localhost dbus-daemon[1166]: Encountered error 'Error in file /etc/dbus-1/system.d/org.freedesktop.import1.conf, line 1, column 0: no element found
' while parsing '/etc/dbus-1/system.d/org.freedesktop.import1.conf'
localhost dbus-daemon[1166]: Encountered error 'Error in file /etc/dbus-1/system.d/org.freedesktop.locale1.conf, line 1, column 0: no element found
' while parsing '/etc/dbus-1/system.d/org.freedesktop.locale1.conf'
localhost dbus-daemon[1166]: Encountered error 'Error in file /etc/dbus-1/system.d/org.freedesktop.login1.conf, line 1, column 0: no element found
' while parsing '/etc/dbus-1/system.d/org.freedesktop.login1.conf'
localhost dbus-daemon[1166]: Encountered error 'Error in file /etc/dbus-1/system.d/org.freedesktop.machine1.conf, line 1, column 0: no element found
' while parsing '/etc/dbus-1/system.d/org.freedesktop.machine1.conf'
localhost dbus-daemon[1166]: Encountered error 'Error in file /etc/dbus-1/system.d/org.freedesktop.systemd1.conf, line 1, column 0: no element found
' while parsing '/etc/dbus-1/system.d/org.freedesktop.systemd1.conf'
localhost dbus-daemon[1166]: Encountered error 'Error in file /etc/dbus-1/system.d/org.freedesktop.timedate1.conf, line 1, column 0: no element found
' while parsing '/etc/dbus-1/system.d/org.freedesktop.timedate1.conf'
...

病的不轻。这些文件内容消失了,重装包并不会覆盖它们。只好手动来覆盖了。

1
2
3
4
5
6
7
# 安装必要工具
~> yum install yum-utils rpm2cpio
# 获取并解包 rpm
~> cd /root/tmp
~> yumdownloader dbus systemd
~> rpm2cpio dbus-1.10.24-15.el7.x86_64.rpm | cpio -idmv
~> rpm2cpio systemd-219-78.el7_9.5.x86_64.rpm | cpio -idmv

于是得到了这些包的内容。手动将 etc/dbus-1 目录下的所有文件覆盖到文件系统,然后执行 kill 1

1
2
3
4
5
6
7
localhost systemd[1]: Received SIGTERM from PID 4332 (bash).
localhost systemd[1]: Reexecuting.
localhost systemd[1]: systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ
localhost systemd[1]: Detected architecture x86-64.
localhost dbus[1166]: [system] Successfully activated service 'org.freedesktop.systemd1'
localhost systemd[1]: Started Login Service.
localhost systemd-logind[4723]: New seat seat0.

仔细检查一遍,各种服务都在正常启动了!

(๑•̀ㅂ•́)و✧

后记:别没事儿更新生产用系统。

后记 2:西方国家的数据中心远程手没有一个靠谱的。