搭建在家中的proxmox/zfs的磁盘报警了,本文是处理过程。还想补充一点,proxmox/zfs,真香:)
发现问题
收到告警邮件
SMART error (CurrentPendingSector) detected on host: pve
This message was generated by the smartd daemon running on:
host name: pve
DNS domain: local
The following warning/error was logged by the smartd daemon:
Device: /dev/sdb [SAT], 24 Currently unreadable (pending) sectors
Device info:
ST2000DM001-1CH164, S/N:W1E816VY, WWN:5-000c50-073c58600, FW:CC29, 2.00 TB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Thu Jul 5 09:19:59 2018 CST
Another message will be sent in 24 hours if the problem persists.
SMART error (OfflineUncorrectableSector) detected on host: pve
This message was generated by the smartd daemon running on:
host name: pve
DNS domain: local
The following warning/error was logged by the smartd daemon:
Device: /dev/sdb [SAT], 24 Offline uncorrectable sectors
Device info:
ST2000DM001-1CH164, S/N:W1E816VY, WWN:5-000c50-073c58600, FW:CC29, 2.00 TB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Thu Jul 5 09:19:59 2018 CST
Another message will be sent in 24 hours if the problem persists.
检查磁盘坏道
badblocks -v /dev/sdb > /tmp/sdb-bad-blocks.txt 2>&1
内容:
正在检查从 0 到 1953514583的块
检查坏块(只读测试): 已完成
此步已完成,发现了 12 个坏块。(12/0/0 个错误)
91950516
91950517
91950518
91950519
728082060
728082061
728082062
728082063
728083428
728083429
728083430
728083431
坏道其实不多,但是至少说明这块盘的健康状况在变差,因为自己做的是raidz1,只能容忍一块磁盘出错,为了数据安全,还是换掉吧。因为我的个人服务器为了节约成本,用的都是普通的最便宜pc用的的希捷酷鱼磁盘,而且这块盘已经服务5年多了,改让它休息休息了:)
切换磁盘
切换磁盘的过程参考的官方这篇文章
先查看下状态
$ zpool status -v
pool: poolA
state: ONLINE
scan: scrub repaired 96K in 4h0m with 0 errors on Sun May 12 04:24:44 2019
config:
NAME STATE READ WRITE CKSUM
poolA ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-ST2000DM001-1CH164_W1E816VY ONLINE 0 0 0
ata-ST2000DM006-2DM164_Z4ZAKL4K ONLINE 0 0 0
ata-TOSHIBA_HDWD120_48GH796CS ONLINE 0 0 0
logs
ata-INTEL_SSDSC2KW256G8_BTLA81300DD1256CGN-part4 ONLINE 0 0 0
cache
ata-INTEL_SSDSC2KW256G8_BTLA81300DD1256CGN-part5 ONLINE 0 0 0
errors: No known data errors
pool: poolB
state: ONLINE
scan: scrub repaired 0B in 5h17m with 0 errors on Sun May 12 05:41:54 2019
config:
NAME STATE READ WRITE CKSUM
poolB ONLINE 0 0 0
ata-HGST_HTS541010B7E610_WX11A187J862 ONLINE 0 0 0
errors: No known data errors
将坏盘下线
$ zpool offline poolA /dev/disk/by-id/ata-ST2000DM001-1CH164_W1E816VY
下线后的状态
$ zpool status -v
pool: poolA
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 96K in 4h0m with 0 errors on Sun May 12 04:24:44 2019
config:
NAME STATE READ WRITE CKSUM
poolA DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-ST2000DM001-1CH164_W1E816VY OFFLINE 0 0 0
ata-ST2000DM006-2DM164_Z4ZAKL4K ONLINE 0 0 0
ata-TOSHIBA_HDWD120_48GH796CS ONLINE 0 0 0
logs
ata-INTEL_SSDSC2KW256G8_BTLA81300DD1256CGN-part4 ONLINE 0 0 0
cache
ata-INTEL_SSDSC2KW256G8_BTLA81300DD1256CGN-part5 ONLINE 0 0 0
errors: No known data errors
pool: poolB
state: ONLINE
scan: scrub repaired 0B in 5h17m with 0 errors on Sun May 12 05:41:54 2019
config:
NAME STATE READ WRITE CKSUM
poolB ONLINE 0 0 0
ata-HGST_HTS541010B7E610_WX11A187J862 ONLINE 0 0 0
errors: No known data errors
更换硬盘
将坏掉的磁盘拔下,再将新买的磁盘替换(在原来的磁盘位置上插入新磁盘)
可以通过smartctl -i /dev/sdb
查看新磁盘的信息
拷贝分区表
$ sgdisk --replicate=/dev/sdb /dev/sdc
官方文档对这里做了说明,千万别填反了,否则后果自负
sgdisk --replicate=/dev/target /dev/source
Make sure you get the devices in the right order: you’re invoking sgdisk on the WORKING disk (source), replicating TO the new disk (target).
因为我们是替换目标是/dev/sdb,所以是sgdisk –replicate=/dev/sdb /dev/sdc
为新磁盘生成guid
$ sgdisk --randomize-guids /dev/sdb
[可选]新磁盘安装grub
$ grub-install /dev/sdb
这一步我的机器上返回:
Installing for x86_64-efi platform.
grub-install:错误: cannot find EFI directory.
可能是因为我的启动盘是另外的ssd,不在zfs上,所以,这一步失败也没有关系。
在zfs中更换到新磁盘
由于我的zfs在搭建时,使用的是磁盘id,而不是/dev/sda这样的引用来创建的,所以,在replace时,和官方文档有点不一样。新插入的磁盘是ata-ST2000DM005-2CW102_WFM0PY3X,因为 /dev/disk/by-id/ata-ST2000DM005-2CW102_WFM0PY3X
$ zpool replace poolA ata-ST2000DM001-1CH164_W1E816VY ata-ST2000DM005-2CW102_WFM0PY3X
$ zpool status -v
pool: poolA
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Jun 7 17:51:40 2019
161M scanned out of 1.76T at 1.44M/s, 356h33m to go
53.4M resilvered, 0.01% done
config:
NAME STATE READ WRITE CKSUM
poolA DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
ata-ST2000DM001-1CH164_W1E816VY OFFLINE 0 0 0
ata-ST2000DM005-2CW102_WFM0PY3X ONLINE 0 0 0 (resilvering)
ata-ST2000DM006-2DM164_Z4ZAKL4K ONLINE 0 0 0
ata-TOSHIBA_HDWD120_48GH796CS ONLINE 0 0 0
logs
ata-INTEL_SSDSC2KW256G8_BTLA81300DD1256CGN-part4 ONLINE 0 0 0
cache
ata-INTEL_SSDSC2KW256G8_BTLA81300DD1256CGN-part5 ONLINE 0 0 0
errors: No known data errors
pool: poolB
state: ONLINE
scan: scrub repaired 0B in 5h17m with 0 errors on Sun May 12 05:41:54 2019
config:
NAME STATE READ WRITE CKSUM
poolB ONLINE 0 0 0
ata-HGST_HTS541010B7E610_WX11A187J862 ONLINE 0 0 0
errors: No known data errors
注意:Once “zpool status” finally shows nothing but ONLINE, it is safe to reboot. 在新磁盘重建好之前,这段时间不能重启
总结
zfs文件系统是非常易于使用和维护的。
PS:今后可以将搭建proxmox/zfs的文章放出来。