目录
from lin.wang
strocli是megacli的升级版本,针对于戴尔服务器是perccli,用法完全一致
smartctl可以查看磁盘的主控芯片smart信息
lsscsi可以查看系统的scsi信息,数据来源/proc/scsi/scsi相关,该文档此处暂不介绍
这些工具都是查看磁盘相关信息的常用工具,对于排查磁盘状态和raid卡问题都有帮助
安装一下storcli或者perccli,并且将命令软连接到/usr/bin/目录下,方便使用命令:
ln -s /opt/megaraid/storcli/storcli64 /usr/bin/
ln -s /opt/megaraid/perccli/percclie64 /usr/bin/
由系统磁盘盘符/dev/sdf定位对应的硬盘盘位思路如下:
perccli64 /c0/eall/sall show 看到该磁盘有
img-/c0/eall/sall 从该图看到有四个jbod分区,根据经验一般人为jbod的分区系统盘符会在raid分区之前,也就是说jbod的分区会从/dev/sda > /dev/sdd,raid的分区从/dev/sde开始;
dg代表drive group,是配置raid建分组的顺序,有图上看到32:4和32:5是一个卷组。
perccli64 /c0/vall show看到该磁盘的dg与vd的对应关系如下
img-/c0/vall 由图上看到dg/vd就是raid的卷组和系统里卷组的顺序对应关系,一般如果服务器只有raid卷组来说的话,vd0就是操作系统里的/dev/sda,以此类推;但是如果服务器包括了jbod卷组,则raid的卷组从jbod后开始排序,本例中也就是vd0=/dev/sde,则要定位/dev/sdf的话vd=1,对应dg=1;
回到img-/c0/eall/sall上,dg为1时,did=6,did就是device id,这个概念后边有用;同时slot no.也就是slt = 6对应的服务器上盘位就是第7个(从0开始到6),此时即定位到了/dev/sdf的物理盘位。
反之从服务器上看到硬盘故障灯,可以反推对应的系统分区盘符
note:
如果服务器没有jbod卷组,全是raid的,则此时/c0/vall找到对应关系即可定位关联关系
实际操作时还可以通过 perccli64 /c0/e32/s6 start/stop locate点亮关闭磁盘灯,来判断定位是否正确
perccli64 show ctrlcount 查看有几个控制器即几个raid卡
perccli64 show 显示raid卡信息
[root@node-15 ~]# perccli64 show status code = 0 status = success description = none number of controllers = 1 host name = node-15.domain.tld operating system = linux3.10.0-327.20.1.es2.el7.x86_64 system overview : =============== ------------------------------------------------------------------------ ctl model ports pds dgs dnopt vds vnopt bbu spr ds ehs asos hlth ------------------------------------------------------------------------ 0 perch730mini 8 16 11 0 11 0 opt on 3 n 0 opt ------------------------------------------------------------------------ ctl=controller index|dgs=drive groups|vds=virtual drives|fld=failed pds=physical drives|dnopt=dg notoptimal|vnopt=vd notoptimal|opt=optimal msng=missing|dgd=degraded|ndatn=need attention|unkwn=unknown spr=scheduled patrol read|ds=dimmerswitch|ehs=emergency hot spare y=yes|n=no|asos=advanced software options|bbu=battery backup unit hlth=health|safe=safe-mode boot
可以看到只有一个raid卡,ctrl 0也是就是/c0
storcli64 /c0 show
[root@node-15 ~]# perccli64 /c0 show generating detailed summary of the adapter, it may take a while to complete. controller = 0 status = success description = none product name = perc h730 mini serial number = 663021z sas address = 51866da066153000 pci address = 00:03:00:00 system time = 01/10/2019 20:48:38 mfg. date = 06/17/16 controller time = 01/10/2019 12:44:21 fw package build = 25.4.0.0017 bios version = 6.29.00.0_4.16.07.00_0x06120100 fw version = 4.260.00-6259 driver name = megaraid_sas driver version = 06.807.10.00-rh1 current personality = raid-mode vendor id = 0x1000 device id = 0x5d subvendor id = 0x1028 subdevice id = 0x1f49 host interface = pci-e device interface = sas-12g bus number = 3 device number = 0 function number = 0 drive groups = 11 topology : ======== --------------------------------------------------------------------------- dg arr row eid:slot did type state bt size pdc pi sed ds3 fspace tr --------------------------------------------------------------------------- 0 - - - - raid1 optl n 931.0 gb dflt n n dflt n n 0 0 - - - raid1 optl n 931.0 gb dflt n n dflt n n 0 0 0 32:4 4 drive onln n 931.0 gb dflt n n dflt - n 0 0 1 32:5 5 drive onln n 931.0 gb dflt n n dflt - n 1 - - - - raid0 optl n 931.0 gb dflt n n dflt n n 1 0 - - - raid0 optl n 931.0 gb dflt n n dflt n n 1 0 0 32:6 6 drive onln n 931.0 gb dflt n n dflt - n 2 - - - - raid0 optl n 931.0 gb dflt n n dflt n n 2 0 - - - raid0 optl n 931.0 gb dflt n n dflt n n 2 0 0 32:7 7 drive onln n 931.0 gb dflt n n dflt - n 3 - - - - raid0 optl n 931.0 gb dflt n n dflt n n 3 0 - - - raid0 optl n 931.0 gb dflt n n dflt n n 3 0 0 32:8 8 drive onln n 931.0 gb dflt n n dflt - n 4 - - - - raid0 optl n 931.0 gb dflt n n dflt n n 4 0 - - - raid0 optl n 931.0 gb dflt n n dflt n n 4 0 0 32:9 9 drive onln n 931.0 gb dflt n n dflt - n 5 - - - - raid0 optl n 931.0 gb dflt n n dflt n n 5 0 - - - raid0 optl n 931.0 gb dflt n n dflt n n 5 0 0 32:10 10 drive onln n 931.0 gb dflt n n dflt - n 6 - - - - raid0 optl n 931.0 gb dflt n n dflt n n 6 0 - - - raid0 optl n 931.0 gb dflt n n dflt n n 6 0 0 32:11 11 drive onln n 931.0 gb dflt n n dflt - n 7 - - - - raid0 optl n 931.0 gb dflt n n dflt n n 7 0 - - - raid0 optl n 931.0 gb dflt n n dflt n n 7 0 0 32:12 12 drive onln n 931.0 gb dflt n n dflt - n 8 - - - - raid0 optl n 931.0 gb dflt n n dflt n n 8 0 - - - raid0 optl n 931.0 gb dflt n n dflt n n 8 0 0 32:13 13 drive onln n 931.0 gb dflt n n dflt - n 9 - - - - raid0 optl n 931.0 gb dflt n n dflt n n 9 0 - - - raid0 optl n 931.0 gb dflt n n dflt n n 9 0 0 32:14 14 drive onln n 931.0 gb dflt n n dflt - n 10 - - - - raid0 optl n 931.0 gb dflt n n dflt n n 10 0 - - - raid0 optl n 931.0 gb dflt n n dflt n n 10 0 0 32:15 15 drive onln n 931.0 gb dflt n n dflt - n --------------------------------------------------------------------------- dg=disk group index|arr=array index|row=row index|eid=enclosure device id did=device id|type=drive type|onln=online|rbld=rebuild|dgrd=degraded pdgd=partially degraded|offln=offline|bt=background task active pdc=pd cache|pi=protection info|sed=self encrypting drive|frgn=foreign ds3=dimmer switch 3|dflt=default|msng=missing|fspace=free space present tr=transport ready virtual drives = 11 vd list : ======= ------------------------------------------------------------- dg/vd type state access consist cache cac scc size name ------------------------------------------------------------- 0/0 raid1 optl rw yes rwbd - off 931.0 gb 1/1 raid0 optl rw yes rwbd - off 931.0 gb 2/2 raid0 optl rw yes rwbd - off 931.0 gb 3/3 raid0 optl rw yes rwbd - off 931.0 gb 4/4 raid0 optl rw yes rwbd - off 931.0 gb 5/5 raid0 optl rw yes rwbd - off 931.0 gb 6/6 raid0 optl rw yes rwbd - off 931.0 gb 7/7 raid0 optl rw yes rwbd - off 931.0 gb 8/8 raid0 optl rw yes rwbd - off 931.0 gb 9/9 raid0 optl rw yes rwbd - off 931.0 gb 10/10 raid0 optl rw yes rwbd - off 931.0 gb ------------------------------------------------------------- cac=cachecade|rec=recovery|ofln=offline|pdgd=partially degraded|dgrd=degraded optl=optimal|ro=read only|rw=read write|hd=hidden|trans=transportready|b=blocked| consist=consistent|r=read ahead always|nr=no read ahead|wb=writeback| fwb=force writeback|wt=writethrough|c=cached io|d=direct io|scc=scheduled check consistency physical drives = 16 pd list : ======= ---------------------------------------------------------------------------- eid:slt did state dg size intf med sed pi sesz model sp ---------------------------------------------------------------------------- 32:0 0 jbod - 185.75 gb sata ssd n n 512b intel ssdsc2bx200g4r u 32:1 1 jbod - 185.75 gb sata ssd n n 512b intel ssdsc2bx200g4r u 32:2 2 jbod - 185.75 gb sata ssd n n 512b intel ssdsc2bx200g4r u 32:3 3 jbod - 185.75 gb sata ssd n n 512b intel ssdsc2bx200g4r u 32:4 4 onln 0 931.0 gb sata hdd n n 512b st91000640ns u 32:5 5 onln 0 931.0 gb sata hdd n n 512b st91000640ns u 32:6 6 onln 1 931.0 gb sata hdd n n 512b st91000640ns u 32:7 7 onln 2 931.0 gb sata hdd n n 512b st91000640ns u 32:8 8 onln 3 931.0 gb sata hdd n n 512b st91000640ns u 32:9 9 onln 4 931.0 gb sata hdd n n 512b st91000640ns u 32:10 10 onln 5 931.0 gb sata hdd n n 512b st91000640ns u 32:11 11 onln 6 931.0 gb sata hdd n n 512b st91000640ns u 32:12 12 onln 7 931.0 gb sata hdd n n 512b st91000640ns u 32:13 13 onln 8 931.0 gb sata hdd n n 512b st91000640ns u 32:14 14 onln 9 931.0 gb sata hdd n n 512b st91000640ns u 32:15 15 onln 10 931.0 gb sata hdd n n 512b st91000640ns u ---------------------------------------------------------------------------- eid-enclosure device id|slt-slot no.|did-device id|dg-drivegroup dhs-dedicated hot spare|ugood-unconfigured good|ghs-global hotspare ubad-unconfigured bad|onln-online|offln-offline|intf-interface med-media type|sed-self encryptive drive|pi-protection info sesz-sector size|sp-spun|u-up|d-down/powersave|t-transition|f-foreign ugunsp-unsupported|ugshld-unconfigured shielded|hspshld-hotspare shielded cfshld-configured shielded|cpybck-copyback|cbshld-copyback shielded bbu_info : ======== ---------------------------------------------- model state retentiontime temp mode mfgdate ---------------------------------------------- bbu optimal 0 hour(s) 38c - 0/00/00 ----------------------------------------------
[root@node-15 ~]# perccli64 /c0/eall/sall show controller = 0 status = success description = show drive information succeeded. drive information : ================= ---------------------------------------------------------------------------- eid:slt did state dg size intf med sed pi sesz model sp ---------------------------------------------------------------------------- 32:0 0 jbod - 185.75 gb sata ssd n n 512b intel ssdsc2bx200g4r u 32:1 1 jbod - 185.75 gb sata ssd n n 512b intel ssdsc2bx200g4r u 32:2 2 jbod - 185.75 gb sata ssd n n 512b intel ssdsc2bx200g4r u 32:3 3 jbod - 185.75 gb sata ssd n n 512b intel ssdsc2bx200g4r u 32:4 4 onln 0 931.0 gb sata hdd n n 512b st91000640ns u 32:5 5 onln 0 931.0 gb sata hdd n n 512b st91000640ns u 32:6 6 onln 1 931.0 gb sata hdd n n 512b st91000640ns u 32:7 7 onln 2 931.0 gb sata hdd n n 512b st91000640ns u 32:8 8 onln 3 931.0 gb sata hdd n n 512b st91000640ns u 32:9 9 onln 4 931.0 gb sata hdd n n 512b st91000640ns u 32:10 10 onln 5 931.0 gb sata hdd n n 512b st91000640ns u 32:11 11 onln 6 931.0 gb sata hdd n n 512b st91000640ns u 32:12 12 onln 7 931.0 gb sata hdd n n 512b st91000640ns u 32:13 13 onln 8 931.0 gb sata hdd n n 512b st91000640ns u 32:14 14 onln 9 931.0 gb sata hdd n n 512b st91000640ns u 32:15 15 onln 10 931.0 gb sata hdd n n 512b st91000640ns u ---------------------------------------------------------------------------- eid-enclosure device id|slt-slot no.|did-device id|dg-drivegroup dhs-dedicated hot spare|ugood-unconfigured good|ghs-global hotspare ubad-unconfigured bad|onln-online|offln-offline|intf-interface med-media type|sed-self encryptive drive|pi-protection info sesz-sector size|sp-spun|u-up|d-down/powersave|t-transition|f-foreign ugunsp-unsupported|ugshld-unconfigured shielded|hspshld-hotspare shielded cfshld-configured shielded|cpybck-copyback|cbshld-copyback shielded
note:
根据经验,jbod的分区在raid的分区之前
[root@node-15 ~]# perccli64 /c0/e32/s6 show all controller = 0 status = success description = show drive information succeeded. drive /c0/e32/s6 : ================ ------------------------------------------------------------------- eid:slt did state dg size intf med sed pi sesz model sp ------------------------------------------------------------------- 32:6 6 onln 1 931.0 gb sata hdd n n 512b st91000640ns u ------------------------------------------------------------------- eid-enclosure device id|slt-slot no.|did-device id|dg-drivegroup dhs-dedicated hot spare|ugood-unconfigured good|ghs-global hotspare ubad-unconfigured bad|onln-online|offln-offline|intf-interface med-media type|sed-self encryptive drive|pi-protection info sesz-sector size|sp-spun|u-up|d-down/powersave|t-transition|f-foreign ugunsp-unsupported|ugshld-unconfigured shielded|hspshld-hotspare shielded cfshld-configured shielded|cpybck-copyback|cbshld-copyback shielded drive /c0/e32/s6 - detailed information : ======================================= drive /c0/e32/s6 state : ====================== shield counter = 0 media error count = 46431 *** 很明显的问题发生了46431次介质错误 *** other error count = 0 drive temperature = 31c (87.80 f) predictive failure count = 126 *** 预测故障次数126次 *** s.m.a.r.t alert flagged by drive = yes drive /c0/e32/s6 device attributes : ================================== sn = 9xga228l manufacturer id = ata model number = st91000640ns nand vendor = na wwn = 5000c500918f2f8a firmware revision = aa63 raw size = 931.512 gb [0x74706db0 sectors] coerced size = 931.0 gb [0x74600000 sectors] non coerced size = 931.012 gb [0x74606db0 sectors] device speed = 6.0gb/s link speed = 12.0gb/s ncq setting = n/a write cache = enabled logical sector size = 512b physical sector size = 512b connector name = 00 drive /c0/e32/s6 policies/settings : ================================== drive position = drivegroup:1, span:0, row:0 enclosure position = 0 connected port number = 0(path0) sequence number = 2 commissioned spare = no emergency spare = no last predictive failure event sequence number = 95183 *** 上一次预测错误的序号95183 *** successful diagnostics completion on = n/a sed capable = no sed enabled = no secured = no cryptographic erase capable = no locked = no needs ekm attention = no pi eligible = no certified = yes wide port capable = no port information : ================ ----------------------------------------- port status linkspeed sas address ----------------------------------------- 0 active 12.0gb/s 0x500056b33fefe586 ----------------------------------------- inquiry data = 5a 0c ff 3f 37 c8 10 00 00 00 00 00 3f 00 00 00 00 00 00 00 20 20 20 20 20 20 20 20 20 20 20 20 58 39 41 47 32 32 4c 38 00 00 00 00 04 00 20 20 20 20 41 41 33 36 54 53 31 39 30 30 36 30 30 34 53 4e 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 80 00 40 00 2f 00 40 00 02 00 02 07 00 ff 3f 10 00 3f 00 10 fc fb 00 10 00 ff ff ff 0f 00 00 07 00
note:
通过单个卷组的信息查看,发现了media error,说明了硬盘是有问题的
[root@node-15 ~]# perccli64 /c0/vall show controller = 0 status = success description = none virtual drives : ============== ------------------------------------------------------------- dg/vd type state access consist cache cac scc size name ------------------------------------------------------------- 0/0 raid1 optl rw yes rwbd - off 931.0 gb 1/1 raid0 optl rw yes rwbd - off 931.0 gb 2/2 raid0 optl rw yes rwbd - off 931.0 gb 3/3 raid0 optl rw yes rwbd - off 931.0 gb 4/4 raid0 optl rw yes rwbd - off 931.0 gb 5/5 raid0 optl rw yes rwbd - off 931.0 gb 6/6 raid0 optl rw yes rwbd - off 931.0 gb 7/7 raid0 optl rw yes rwbd - off 931.0 gb 8/8 raid0 optl rw yes rwbd - off 931.0 gb 9/9 raid0 optl rw yes rwbd - off 931.0 gb 10/10 raid0 optl rw yes rwbd - off 931.0 gb ------------------------------------------------------------- cac=cachecade|rec=recovery|ofln=offline|pdgd=partially degraded|dgrd=degraded optl=optimal|ro=read only|rw=read write|hd=hidden|trans=transportready|b=blocked| consist=consistent|r=read ahead always|nr=no read ahead|wb=writeback| fwb=force writeback|wt=writethrough|c=cached io|d=direct io|scc=scheduled check consistency
note:
vd:一般认为是该硬盘在系统里的设备顺序,一般如果只有raid分区,那么vd=0的就是系统里的/dev/sda,vd=1就是/dev/sdb以此类推,但是如果有jbod的分区,先排列jbod分区,如jbod的到了/dev/sdc,vd0则是/dev/sdd,以此类推;
dg:是在raid卡里配置卷组的顺序;
storcli64 /c0 show time
显示raid的时间
storcli64 /c0 show alilog logfile=node-x.alilog
获取alilog,所有的log都包括了
storcli64 /c0 show all logfile=node-x.all.log
raid卡的信息
storcli64 /c0 show badblocks
磁盘坏道的信息
perccli64 /c0 show events filter=fatal
显示事件级别为fatal的,可以获取所有毁灭性事件的信息,发现磁盘故障或raid卡故障
perccli64 /c0 show cc
数据一致性检测,raid1以上的级别多个盘的数据是需要进行一致性检测的,但是单盘raid0可能是不需要的,是否影响性能不确定
--scan
scan for devices
--scan-open
scan for devices and try to open each device
-x, --xall
show all information for device
-a, --all
show all smart information for device
-i, --info
show identity information for device
-d type, --device=type
specify device type to one of: ata, scsi, nvme[,nsid], sat[,auto][,n][+type], usbcypress[,x], usbjmicron[,p][,x][,n], usbprolific, usbsunplus, marvell, areca,n/e, 3ware,n, hpt,l/m/n, megaraid,n, aacraid,h,l,id, cciss,n, auto, test
-s value, --smart=value
enable/disable smart on device (on/off)
-o value, --offlineauto=value(ata)
enable/disable automatic offline testing on device (on/off)
-s value, --saveauto=value(ata)
enable/disable attribute autosave on device (on/off)
-h, --health
show device smart health status
-c, --capabilities(ata,nvme)
show device smart capabilities
-a, --attributes
show device smart vendor-specific attributes and values
-l type, --log=type
show device log. type: error, selftest, selective, directory[,g|s],
xerror[,n][,error], xselftest[,n][,selftest],
background, sasphy[,reset], sataphy[,reset],
scttemp[sts,hist], scttempint,n[,p],
scterc[,n,m], devstat[,n], ssd,
gplog,n[,range], smartlog,n[,range],
nvmelog,n,size
-t test, --test=test
run test. test: offline, short, long, conveyance, force, vendor,n,
select,m-n, pending,n, afterselect,[on|off]
-x, --abort
abort any non-captive test on device
[root@node-15 ~]# smartctl --scan /dev/sda -d scsi # /dev/sda, scsi device /dev/sdb -d scsi # /dev/sdb, scsi device /dev/sdc -d scsi # /dev/sdc, scsi device /dev/sdd -d scsi # /dev/sdd, scsi device /dev/sde -d scsi # /dev/sde, scsi device /dev/sdf -d scsi # /dev/sdf, scsi device /dev/sdg -d scsi # /dev/sdg, scsi device /dev/sdh -d scsi # /dev/sdh, scsi device /dev/sdi -d scsi # /dev/sdi, scsi device /dev/sdj -d scsi # /dev/sdj, scsi device /dev/sdk -d scsi # /dev/sdk, scsi device /dev/sdl -d scsi # /dev/sdl, scsi device /dev/sdm -d scsi # /dev/sdm, scsi device /dev/sdn -d scsi # /dev/sdn, scsi device /dev/sdo -d scsi # /dev/sdo, scsi device /dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], scsi device /dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], scsi device /dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], scsi device /dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], scsi device /dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], scsi device /dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], scsi device /dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], scsi device /dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], scsi device /dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], scsi device /dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], scsi device /dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], scsi device /dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], scsi device /dev/bus/0 -d megaraid,12 # /dev/bus/0 [megaraid_disk_12], scsi device /dev/bus/0 -d megaraid,13 # /dev/bus/0 [megaraid_disk_13], scsi device /dev/bus/0 -d megaraid,14 # /dev/bus/0 [megaraid_disk_14], scsi device /dev/bus/0 -d megaraid,15 # /dev/bus/0 [megaraid_disk_15], scsi device
note:
通过前面的章节我们定位到了磁盘/dev/sdf在perccli里的did即device_id为6,也就是/dev/bus/0 -d megaraid,6
[root@node-15 ~]# smartctl -i -d megaraid,6 /dev/sdf smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build) copyright (c) 2002-16, bruce allen, christian franke, www.smartmontools.org === start of information section === model family: seagate constellation.2 (sata) device model: st91000640ns serial number: 9xga228l lu wwn device id: 5 000c50 0918f2f8a add. product id: dell(tm) firmware version: aa63 user capacity: 1,000,204,886,016 bytes [1.00 tb] sector size: 512 bytes logical/physical rotation rate: 7200 rpm form factor: 2.5 inches device is: in smartctl database [for details use: -p show] ata version is: ata8-acs t13/1699-d revision 4 sata version is: sata 3.0, 6.0 gb/s (current: 6.0 gb/s) local time is: fri jan 11 11:28:46 2019 cst smart support is: available - device has smart capability. smart support is: enabled
一般此处可以用来查看磁盘的整体健康状态指标参数
针对以下输出信息,字段的解释
- id:属性id,通常是一个1到255之间的十进制或十六进制的数字。
- attribute_name:硬盘制造商定义的属性名。
- flag:属性操作标志(可以忽略)。
- value:这是表格中最重要的信息之一,代表给定属性的标准化值,在1到253之间。253意味着最好情况,1意味着最坏情况。取决于属性和制造商,初始化value可以被设置成100或200.
- worst:所记录的最小value。
- thresh:在报告硬盘failed状态前,worst可以允许的最小值,也就是worst如果小于thresh,磁盘就会报告failed。
- type:属性的类型(pre-fail或oldage)。pre-fail类型的属性可被看成一个关键属性,表示参与磁盘的整体smart健康评估(passed/failed)。如果任何pre-fail类型的属性故障,那么可视为磁盘将要发生故障。另一方面,oldage类型的属性可被看成一个非关键的属性(如正常的磁盘磨损),表示不会使磁盘本身发生故障。
- updated:表示属性的更新频率。offline代表磁盘上执行离线测试的时间。
- when_failed:如果value小于等于thresh,会被设置成“failing_now”;如果worst小于等于thresh会被设置成“in_the_past”;如果都不是,会被设置成“-”。在“failing_now”情况下,需要尽快备份重要文件,特别是属性是pre-fail类型时。“in_the_past”代表属性已经故障了,但在运行测试的时候没问题。“-”代表这个属性从没故障过。
- raw_value:制造商定义的原始值,从value派生。
[root@node-15 ~]# smartctl -a -d megaraid,6 /dev/sdf smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build) copyright (c) 2002-16, bruce allen, christian franke, www.smartmontools.org === start of read smart data section === smart attributes data structure revision number: 10 vendor specific smart attributes with thresholds: id# attribute_name flag value worst thresh type updated when_failed raw_value 1 raw_read_error_rate 0x010f 081 038 044 pre-fail always in_the_past 151546765 3 spin_up_time 0x0103 094 094 000 pre-fail always - 0 4 start_stop_count 0x0032 100 100 020 old_age always - 21 5 reallocated_sector_ct 0x0133 100 100 036 pre-fail always - 0 7 seek_error_rate 0x000f 085 060 030 pre-fail always - 338813105 9 power_on_hours 0x0032 079 079 000 old_age always - 18784 10 spin_retry_count 0x0013 100 100 097 pre-fail always - 0 12 power_cycle_count 0x0032 100 100 020 old_age always - 21 184 end-to-end_error 0x0032 100 100 099 old_age always - 0 187 reported_uncorrect 0x0032 001 001 000 old_age always - 1710 188 command_timeout 0x0032 100 100 000 old_age always - 0 189 high_fly_writes 0x003a 100 100 000 old_age always - 0 190 airflow_temperature_cel 0x0022 069 053 045 old_age always - 31 (min/max 24/40) 191 g-sense_error_rate 0x0032 100 100 000 old_age always - 0 192 power-off_retract_count 0x0032 100 100 000 old_age always - 19 193 load_cycle_count 0x0032 100 100 000 old_age always - 852 194 temperature_celsius 0x0022 031 047 000 old_age always - 31 (0 14 0 0 0) 195 hardware_ecc_recovered 0x001a 117 099 000 old_age always - 151546765 197 current_pending_sector 0x0012 084 084 000 old_age always - 688 198 offline_uncorrectable 0x0010 084 084 000 old_age offline - 688 199 udma_crc_error_count 0x003e 200 200 000 old_age always - 0 240 head_flying_hours 0x0000 100 253 000 old_age offline - 8093 (164 214 0) 241 total_lbas_written 0x0000 100 253 000 old_age offline - 1870535293 242 total_lbas_read 0x0000 100 253 000 old_age offline - 1530387871
note:
关于以下检测结果,说明检测结果是passed的,就是磁盘还可以使用,但是列出了一条检测异常的worst<thresh,type是pre-fail,when_failed是in_the_past,说明预测这个盘快坏了。
[root@node-15 ~]# smartctl -h -d megaraid,6 /dev/sdf smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build) copyright (c) 2002-16, bruce allen, christian franke, www.smartmontools.org === start of read smart data section === smart status not supported: ata return descriptor not supported by controller firmware smart overall-health self-assessment test result: passed warning: this result is based on an attribute check. please note the following marginal attributes: id# attribute_name flag value worst thresh type updated when_failed raw_value 1 raw_read_error_rate 0x010f 081 038 044 pre-fail always in_the_past 151546765
[root@node-15 ~]# smartctl -l error -d megaraid,6 /dev/sdf smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build) copyright (c) 2002-16, bruce allen, christian franke, www.smartmontools.org === start of read smart data section === smart error log version: 1 ata error count: 46431 (device log contains only the most recent five errors) cr = command register [hex] fr = features register [hex] sc = sector count register [hex] sn = sector number register [hex] cl = cylinder low register [hex] ch = cylinder high register [hex] dh = device/head register [hex] dc = device command register [hex] er = error register [hex] st = status register [hex] powered_up_time is measured from power on, and printed as ddd+hh:mm:ss.sss where dd=days, hh=hours, mm=minutes, ss=sec, and sss=millisec. it "wraps" after 49.710 days. error 46431 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours) when the command that caused the error occurred, the device was active or idle. after command completion occurred, registers were: er st sc sn cl ch dh -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f error: unc at lba = 0x0fffffff = 268435455 commands leading to the command that caused the error were: cr fr sc sn cl ch dh dc powered_up_time command/feature_name -- -- -- -- -- -- -- -- ---------------- -------------------- 42 00 00 ff ff ff 4f 00 46d+15:15:32.968 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:29.901 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:26.825 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 read verify sector(s) ext error 46430 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours) when the command that caused the error occurred, the device was active or idle. after command completion occurred, registers were: er st sc sn cl ch dh -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f error: unc at lba = 0x0fffffff = 268435455 commands leading to the command that caused the error were: cr fr sc sn cl ch dh dc powered_up_time command/feature_name -- -- -- -- -- -- -- -- ---------------- -------------------- 42 00 00 ff ff ff 4f 00 46d+15:15:29.901 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:26.825 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 read verify sector(s) ext error 46429 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours) when the command that caused the error occurred, the device was active or idle. after command completion occurred, registers were: er st sc sn cl ch dh -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f error: unc at lba = 0x0fffffff = 268435455 commands leading to the command that caused the error were: cr fr sc sn cl ch dh dc powered_up_time command/feature_name -- -- -- -- -- -- -- -- ---------------- -------------------- 42 00 00 ff ff ff 4f 00 46d+15:15:26.825 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 read verify sector(s) ext b0 da 00 00 4f c2 00 00 46d+15:15:17.838 smart return status error 46428 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours) when the command that caused the error occurred, the device was active or idle. after command completion occurred, registers were: er st sc sn cl ch dh -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f error: unc at lba = 0x0fffffff = 268435455 commands leading to the command that caused the error were: cr fr sc sn cl ch dh dc powered_up_time command/feature_name -- -- -- -- -- -- -- -- ---------------- -------------------- 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 read verify sector(s) ext b0 da 00 00 4f c2 00 00 46d+15:15:17.838 smart return status 2f 00 01 e0 00 00 40 00 46d+15:15:17.703 read log ext error 46427 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours) when the command that caused the error occurred, the device was active or idle. after command completion occurred, registers were: er st sc sn cl ch dh -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f error: unc at lba = 0x0fffffff = 268435455 commands leading to the command that caused the error were: cr fr sc sn cl ch dh dc powered_up_time command/feature_name -- -- -- -- -- -- -- -- ---------------- -------------------- 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 read verify sector(s) ext 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 read verify sector(s) ext b0 da 00 00 4f c2 00 00 46d+15:15:17.838 smart return status 2f 00 01 e0 00 00 40 00 46d+15:15:17.703 read log ext 42 00 00 ff ff ff 4f 00 46d+15:15:15.276 read verify sector(s) ext
- 如果没有开启磁盘的smart可以通过-s on device开启
- 一般来说如果samrtctl -i 获取info时没有什么信息输出且smart support是允许的可用的,那么说明可能需要做test才能获取到-t short/long,该测试不会破坏硬盘上的数据,但对于存储一般不适用离线offline测试
- 收集时可以通过-x -a参数获取更全面的磁盘信息
- smartctl是可以配置服务的/etc/smartmontools/smartd.conf,对此目前没有研究,后续有研究成果再更新
如对本文有疑问, 点击进行留言回复!!
linux下文本编辑器vim的使用方法(复制、粘贴、替换、行号、撤销、多文件操作)
网友评论