阵列卡状态检查

  • 检查阵列卡状态
    gengee@HP-Z240:~$ sudo megasasctl
    a0       LSI MegaRAID SAS 9265-8i encl:1 ldrv:1  batt:FAULT, module missing, pack missing, charge failed
    a0d0      7451GiB RAID 5   1x3  optimal
    a0e252s1   3726GiB  a0d0  online
    a0e252s2   3726GiB  a0d0  online
    a0e252s3   3726GiB  a0d0  online
  • 检查VD中磁盘的状态
    gengee@HP-Z240:~$ sudo megaraidsas-status
    -- Arrays informations --
    -- ID | Type | Size | Status
    a0d0 | RAID 5 | 7451GiB | optimal
    
    -- Disks informations
    -- ID | Model | Status | Warnings
    a0e252s1 | ATA ST4000NM0165 3726GiB | online
    a0e252s2 | ATA ST4000NM0165 3726GiB | online
    a0e252s3 | ATA ST4000NM0165 3726GiB | online
  • 检查每个磁盘的S.M.A.R.T状态
    gengee@HP-Z240:~$ sudo megacli -CfgDsply -aALL -nolog |grep -i -E "Physical Disk:|Slot Number|Drive has flagged a S.M.A.R.T alert"
    Physical Disk: 0
    Slot Number: 1
    Drive has flagged a S.M.A.R.T alert : No
    Physical Disk: 1
    Slot Number: 2
    Drive has flagged a S.M.A.R.T alert : No
    Physical Disk: 2
    Slot Number: 3
    Drive has flagged a S.M.A.R.T alert : No

注意

  • 建议使用第三方工具集合 HWraid for GNU/Linux
  • 不建议使用MegaRAID Storage Manager(只针对Ubuntu系统而言)官方提到,可以利用alien把RPM包转制为DEB包进行安装。

    但是,实际测试,在此场景下,会发生不可预料的错误。(比如,Linux根目录下大量文件夹被置为744)
    附上以上方法出错,利用USB启动盘修复文件系统权限的办法。
    https://askubuntu.com/questions/831216/reinstalling-grub2-efi-partition

磁盘自检和定期监测

  • 扫描列举系统中磁盘
    gengee@HP-Z240:~$ sudo smartctl --scan
    /dev/sda -d scsi # /dev/sda, SCSI device
    /dev/sdb -d scsi # /dev/sdb, SCSI device
    /dev/bus/0 -d megaraid,93 # /dev/bus/0 [megaraid_disk_93], SCSI device
    /dev/bus/0 -d megaraid,95 # /dev/bus/0 [megaraid_disk_95], SCSI device
    /dev/bus/0 -d megaraid,97 # /dev/bus/0 [megaraid_disk_97], SCSI device
  • 查看磁盘信息
    gengee@HP-Z240:~$ sudo smartctl --info /dev/bus/0 -d megaraid,93
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.10.0-40-generic] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Device Model:     ST4000NM0165
    Serial Number:    ZAD0N5VA
    LU WWN Device Id: 5 000c50 0a1e6a3e8
    Firmware Version: HPS0
    User Capacity:    4,000,787,030,016 bytes [4.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   ACS-3 T13/2161-D revision 5
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Wed Jan 31 10:45:59 2018 CST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
  • 磁盘自检(快速检查)
    gengee@HP-Z240:~$ sudo smartctl -t short /dev/bus/0 -d megaraid,93
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.10.0-40-generic] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
    Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
    Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
    Testing has begun.
    Please wait 1 minutes for test to complete.
    Test will complete after Wed Jan 31 10:51:50 2018
    
    Use smartctl -X to abort test.
  • 查看磁盘自检结果
    gengee@HP-Z240:~$ sudo smartctl -l selftest /dev/bus/0 -d  megaraid,93
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.10.0-40-generic] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF READ SMART DATA SECTION ===
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed without error       00%      1705         -
    # 2  Short offline       Completed without error       00%      1696         -
  • 磁盘定期监测
  1. 启用定期监测,加入系统未自动检测到的"megaraid,93-97"
    gengee@HP-Z240:/var/log$ vim /etc/smartd.conf
    # 注意下一行“DEVICESCAN”必须注释
    # DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
    # 每日 02:00 快速检查 sda,每周六 03:00 完整检查 sda。
    /dev/bus/0 -d megaraid,93 -a -o on -S on -s (S/../.././01|L/../../6/02)
    /dev/bus/0 -d megaraid,95 -a -o on -S on -s (S/../.././01|L/../../6/02)
    /dev/bus/0 -d megaraid,97 -a -o on -S on -s (S/../.././01|L/../../6/02)
    #
    # # 监控 SMART 状态
    /dev/bus/0 -d megaraid,93 -H -l error -l selftest -t -I 194
    /dev/bus/0 -d megaraid,95 -H -l error -l selftest -t -I 194
    /dev/bus/0 -d megaraid,97 -H -l error -l selftest -t -I 194
  2. 查看smartd监测信息
    gengee@HP-Z240:/var/log$ sudo grep smartd  syslog
    Jan 31 08:36:53 HP-Z240 smartd[1346]: smartd 6.6 2016-05-31 r4324 [x86_64-linux-4.10.0-40-generic] (local build)
    Jan 31 08:36:53 HP-Z240 smartd[1346]: Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
    Jan 31 08:36:53 HP-Z240 smartd[1346]: Opened configuration file /etc/smartd.conf
    Jan 31 08:36:53 HP-Z240 smartd[1346]: Configuration file /etc/smartd.conf parsed.
    Jan 31 08:36:53 HP-Z240 smartd[1346]: Device: /dev/bus/0, type changed from 'megaraid,93' to 'sat+megaraid,93'
    Jan 31 08:36:53 HP-Z240 smartd[1346]: Device: /dev/bus/0 [megaraid_disk_93] [SAT], opened
    Jan 31 08:36:53 HP-Z240 smartd[1346]: Device: /dev/bus/0 [megaraid_disk_93] [SAT], ST4000NM0165, S/N:ZAD0N5VA, WWN:5-000c50-0a1e6a3e8, FW:HPS0, 4.00 TB

参考文章

MegaCli useful commands with examples
S.M.A.R.T.
HWRAID
HWRAID Ubuntu源和GPG KEY

1. `deb http://hwraid.le-vert.net/ubuntu xenial main`
2. `wget -O - https://hwraid.le-vert.net/ubuntu/hwraid.le-vert.net.gpg.key | sudo apt-key add -`