经验首页 前端设计 程序设计 Java相关 移动开发 数据库/运维 软件/图像 大数据/云计算 其他经验
当前位置:技术经验 » 数据库/运维 » Linux/Shell » 查看文章
storcli64和smartctl定位硬盘的故障信息
来源:cnblogs  作者:lin.wang  时间:2019/5/10 8:51:34  对本文有异议
定位硬盘盘位和盘符的方法

From Lin.Wang

Section One : Introduction

strocli是megacli的升级版本,针对于戴尔服务器是perccli,用法完全一致

smartctl可以查看磁盘的主控芯片smart信息

lsscsi可以查看系统的scsi信息,数据来源/proc/scsi/scsi相关,该文档此处暂不介绍

这些工具都是查看磁盘相关信息的常用工具,对于排查磁盘状态和raid卡问题都有帮助

Section Two : Install package

安装一下storcli或者perccli,并且将命令软连接到/usr/bin/目录下,方便使用命令:

ln -s /opt/MegaRAID/storcli/storcli64 /usr/bin/

ln -s /opt/MegaRAID/perccli/percclie64 /usr/bin/

Section Three : Step

由系统磁盘盘符/dev/sdf定位对应的硬盘盘位思路如下:

  1. perccli64 /c0/eall/sall show 看到该磁盘有

    img-/c0/eall/sall

    从该图看到有四个jbod分区,根据经验一般人为jbod的分区系统盘符会在raid分区之前,也就是说jbod的分区会从/dev/sda > /dev/sdd,raid的分区从/dev/sde开始;

    DG代表drive group,是配置raid建分组的顺序,有图上看到32:4和32:5是一个卷组。

  2. perccli64 /c0/vall show看到该磁盘的DG与VD的对应关系如下

img-/c0/vall

? 由图上看到DG/VD就是raid的卷组和系统里卷组的顺序对应关系,一般如果服务器只有raid卷组来说的话,VD0就是操作系统里的/dev/sda,以此类推;但是如果服务器包括了jbod卷组,则raid的卷组从jbod后开始排序,本例中也就是VD0=/dev/sde,则要定位/dev/sdf的话VD=1,对应DG=1;

? 回到img-/c0/eall/sall上,DG为1时,DID=6,DID就是device id,这个概念后边有用;同时Slot NO.也就是slt = 6对应的服务器上盘位就是第7个(从0开始到6),此时即定位到了/dev/sdf的物理盘位。

反之从服务器上看到硬盘故障灯,可以反推对应的系统分区盘符

Note:

? 如果服务器没有jbod卷组,全是raid的,则此时/c0/vall找到对应关系即可定位关联关系

? 实际操作时还可以通过 perccli64 /c0/e32/s6 start/stop locate点亮关闭磁盘灯,来判断定位是否正确

Section Four : storcli/perccli Usage

查看控制器的信息

perccli64 show ctrlcount 查看有几个控制器即几个raid卡

perccli64 show 显示raid卡信息

  1. [root@node-15 ~]# perccli64 show
  2. Status Code = 0
  3. Status = Success
  4. Description = None
  5. Number of Controllers = 1
  6. Host Name = node-15.domain.tld
  7. Operating System = Linux3.10.0-327.20.1.es2.el7.x86_64
  8. System Overview :
  9. ===============
  10. ------------------------------------------------------------------------
  11. Ctl Model Ports PDs DGs DNOpt VDs VNOpt BBU sPR DS EHS ASOs Hlth
  12. ------------------------------------------------------------------------
  13. 0 PERCH730Mini 8 16 11 0 11 0 Opt On 3 N 0 Opt
  14. ------------------------------------------------------------------------
  15. Ctl=Controller Index|DGs=Drive groups|VDs=Virtual drives|Fld=Failed
  16. PDs=Physical drives|DNOpt=DG NotOptimal|VNOpt=VD NotOptimal|Opt=Optimal
  17. Msng=Missing|Dgd=Degraded|NdAtn=Need Attention|Unkwn=Unknown
  18. sPR=Scheduled Patrol Read|DS=DimmerSwitch|EHS=Emergency Hot Spare
  19. Y=Yes|N=No|ASOs=Advanced Software Options|BBU=Battery backup unit
  20. Hlth=Health|Safe=Safe-mode boot

可以看到只有一个raid卡,ctrl 0也是就是/c0

storcli64 /c0 show

  1. [root@node-15 ~]# perccli64 /c0 show
  2. Generating detailed summary of the adapter, it may take a while to complete.
  3. Controller = 0
  4. Status = Success
  5. Description = None
  6. Product Name = PERC H730 Mini
  7. Serial Number = 663021Z
  8. SAS Address = 51866da066153000
  9. PCI Address = 00:03:00:00
  10. System Time = 01/10/2019 20:48:38
  11. Mfg. Date = 06/17/16
  12. Controller Time = 01/10/2019 12:44:21
  13. FW Package Build = 25.4.0.0017
  14. BIOS Version = 6.29.00.0_4.16.07.00_0x06120100
  15. FW Version = 4.260.00-6259
  16. Driver Name = megaraid_sas
  17. Driver Version = 06.807.10.00-rh1
  18. Current Personality = RAID-Mode
  19. Vendor Id = 0x1000
  20. Device Id = 0x5D
  21. SubVendor Id = 0x1028
  22. SubDevice Id = 0x1F49
  23. Host Interface = PCI-E
  24. Device Interface = SAS-12G
  25. Bus Number = 3
  26. Device Number = 0
  27. Function Number = 0
  28. Drive Groups = 11
  29. TOPOLOGY :
  30. ========
  31. ---------------------------------------------------------------------------
  32. DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace TR
  33. ---------------------------------------------------------------------------
  34. 0 - - - - RAID1 Optl N 931.0 GB dflt N N dflt N N
  35. 0 0 - - - RAID1 Optl N 931.0 GB dflt N N dflt N N
  36. 0 0 0 32:4 4 DRIVE Onln N 931.0 GB dflt N N dflt - N
  37. 0 0 1 32:5 5 DRIVE Onln N 931.0 GB dflt N N dflt - N
  38. 1 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  39. 1 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  40. 1 0 0 32:6 6 DRIVE Onln N 931.0 GB dflt N N dflt - N
  41. 2 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  42. 2 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  43. 2 0 0 32:7 7 DRIVE Onln N 931.0 GB dflt N N dflt - N
  44. 3 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  45. 3 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  46. 3 0 0 32:8 8 DRIVE Onln N 931.0 GB dflt N N dflt - N
  47. 4 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  48. 4 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  49. 4 0 0 32:9 9 DRIVE Onln N 931.0 GB dflt N N dflt - N
  50. 5 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  51. 5 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  52. 5 0 0 32:10 10 DRIVE Onln N 931.0 GB dflt N N dflt - N
  53. 6 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  54. 6 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  55. 6 0 0 32:11 11 DRIVE Onln N 931.0 GB dflt N N dflt - N
  56. 7 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  57. 7 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  58. 7 0 0 32:12 12 DRIVE Onln N 931.0 GB dflt N N dflt - N
  59. 8 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  60. 8 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  61. 8 0 0 32:13 13 DRIVE Onln N 931.0 GB dflt N N dflt - N
  62. 9 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  63. 9 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  64. 9 0 0 32:14 14 DRIVE Onln N 931.0 GB dflt N N dflt - N
  65. 10 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  66. 10 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N
  67. 10 0 0 32:15 15 DRIVE Onln N 931.0 GB dflt N N dflt - N
  68. ---------------------------------------------------------------------------
  69. DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
  70. DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Dgrd=Degraded
  71. Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active
  72. PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign
  73. DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present
  74. TR=Transport Ready
  75. Virtual Drives = 11
  76. VD LIST :
  77. =======
  78. -------------------------------------------------------------
  79. DG/VD TYPE State Access Consist Cache Cac sCC Size Name
  80. -------------------------------------------------------------
  81. 0/0 RAID1 Optl RW Yes RWBD - OFF 931.0 GB
  82. 1/1 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  83. 2/2 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  84. 3/3 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  85. 4/4 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  86. 5/5 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  87. 6/6 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  88. 7/7 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  89. 8/8 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  90. 9/9 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  91. 10/10 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  92. -------------------------------------------------------------
  93. Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
  94. Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked|
  95. Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
  96. FWB=Force WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
  97. Check Consistency
  98. Physical Drives = 16
  99. PD LIST :
  100. =======
  101. ----------------------------------------------------------------------------
  102. EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
  103. ----------------------------------------------------------------------------
  104. 32:0 0 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U
  105. 32:1 1 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U
  106. 32:2 2 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U
  107. 32:3 3 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U
  108. 32:4 4 Onln 0 931.0 GB SATA HDD N N 512B ST91000640NS U
  109. 32:5 5 Onln 0 931.0 GB SATA HDD N N 512B ST91000640NS U
  110. 32:6 6 Onln 1 931.0 GB SATA HDD N N 512B ST91000640NS U
  111. 32:7 7 Onln 2 931.0 GB SATA HDD N N 512B ST91000640NS U
  112. 32:8 8 Onln 3 931.0 GB SATA HDD N N 512B ST91000640NS U
  113. 32:9 9 Onln 4 931.0 GB SATA HDD N N 512B ST91000640NS U
  114. 32:10 10 Onln 5 931.0 GB SATA HDD N N 512B ST91000640NS U
  115. 32:11 11 Onln 6 931.0 GB SATA HDD N N 512B ST91000640NS U
  116. 32:12 12 Onln 7 931.0 GB SATA HDD N N 512B ST91000640NS U
  117. 32:13 13 Onln 8 931.0 GB SATA HDD N N 512B ST91000640NS U
  118. 32:14 14 Onln 9 931.0 GB SATA HDD N N 512B ST91000640NS U
  119. 32:15 15 Onln 10 931.0 GB SATA HDD N N 512B ST91000640NS U
  120. ----------------------------------------------------------------------------
  121. EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
  122. DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
  123. UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
  124. Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
  125. SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign
  126. UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
  127. CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
  128. BBU_Info :
  129. ========
  130. ----------------------------------------------
  131. Model State RetentionTime Temp Mode MfgDate
  132. ----------------------------------------------
  133. BBU Optimal 0 hour(s) 38C - 0/00/00
  134. ----------------------------------------------
看磁盘的Device id、Slot No. 以及DriveGroup
  1. [root@node-15 ~]# perccli64 /c0/eall/sall show
  2. Controller = 0
  3. Status = Success
  4. Description = Show Drive Information Succeeded.
  5. Drive Information :
  6. =================
  7. ----------------------------------------------------------------------------
  8. EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
  9. ----------------------------------------------------------------------------
  10. 32:0 0 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U
  11. 32:1 1 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U
  12. 32:2 2 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U
  13. 32:3 3 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U
  14. 32:4 4 Onln 0 931.0 GB SATA HDD N N 512B ST91000640NS U
  15. 32:5 5 Onln 0 931.0 GB SATA HDD N N 512B ST91000640NS U
  16. 32:6 6 Onln 1 931.0 GB SATA HDD N N 512B ST91000640NS U
  17. 32:7 7 Onln 2 931.0 GB SATA HDD N N 512B ST91000640NS U
  18. 32:8 8 Onln 3 931.0 GB SATA HDD N N 512B ST91000640NS U
  19. 32:9 9 Onln 4 931.0 GB SATA HDD N N 512B ST91000640NS U
  20. 32:10 10 Onln 5 931.0 GB SATA HDD N N 512B ST91000640NS U
  21. 32:11 11 Onln 6 931.0 GB SATA HDD N N 512B ST91000640NS U
  22. 32:12 12 Onln 7 931.0 GB SATA HDD N N 512B ST91000640NS U
  23. 32:13 13 Onln 8 931.0 GB SATA HDD N N 512B ST91000640NS U
  24. 32:14 14 Onln 9 931.0 GB SATA HDD N N 512B ST91000640NS U
  25. 32:15 15 Onln 10 931.0 GB SATA HDD N N 512B ST91000640NS U
  26. ----------------------------------------------------------------------------
  27. EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
  28. DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
  29. UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
  30. Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
  31. SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign
  32. UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
  33. CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded

Note:

? 根据经验,jbod的分区在raid的分区之前

查看指定硬盘的信息
  1. [root@node-15 ~]# perccli64 /c0/e32/s6 show all
  2. Controller = 0
  3. Status = Success
  4. Description = Show Drive Information Succeeded.
  5. Drive /c0/e32/s6 :
  6. ================
  7. -------------------------------------------------------------------
  8. EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
  9. -------------------------------------------------------------------
  10. 32:6 6 Onln 1 931.0 GB SATA HDD N N 512B ST91000640NS U
  11. -------------------------------------------------------------------
  12. EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
  13. DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
  14. UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
  15. Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
  16. SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign
  17. UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
  18. CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
  19. Drive /c0/e32/s6 - Detailed Information :
  20. =======================================
  21. Drive /c0/e32/s6 State :
  22. ======================
  23. Shield Counter = 0
  24. Media Error Count = 46431 *** 很明显的问题发生了46431次介质错误 ***
  25. Other Error Count = 0
  26. Drive Temperature = 31C (87.80 F)
  27. Predictive Failure Count = 126 *** 预测故障次数126 ***
  28. S.M.A.R.T alert flagged by drive = Yes
  29. Drive /c0/e32/s6 Device attributes :
  30. ==================================
  31. SN = 9XGA228L
  32. Manufacturer Id = ATA
  33. Model Number = ST91000640NS
  34. NAND Vendor = NA
  35. WWN = 5000c500918f2f8a
  36. Firmware Revision = AA63
  37. Raw size = 931.512 GB [0x74706db0 Sectors]
  38. Coerced size = 931.0 GB [0x74600000 Sectors]
  39. Non Coerced size = 931.012 GB [0x74606db0 Sectors]
  40. Device Speed = 6.0Gb/s
  41. Link Speed = 12.0Gb/s
  42. NCQ setting = N/A
  43. Write Cache = Enabled
  44. Logical Sector Size = 512B
  45. Physical Sector Size = 512B
  46. Connector Name = 00
  47. Drive /c0/e32/s6 Policies/Settings :
  48. ==================================
  49. Drive position = DriveGroup:1, Span:0, Row:0
  50. Enclosure position = 0
  51. Connected Port Number = 0(path0)
  52. Sequence Number = 2
  53. Commissioned Spare = No
  54. Emergency Spare = No
  55. Last Predictive Failure Event Sequence Number = 95183 *** 上一次预测错误的序号95183 ***
  56. Successful diagnostics completion on = N/A
  57. SED Capable = No
  58. SED Enabled = No
  59. Secured = No
  60. Cryptographic Erase Capable = No
  61. Locked = No
  62. Needs EKM Attention = No
  63. PI Eligible = No
  64. Certified = Yes
  65. Wide Port Capable = No
  66. Port Information :
  67. ================
  68. -----------------------------------------
  69. Port Status Linkspeed SAS address
  70. -----------------------------------------
  71. 0 Active 12.0Gb/s 0x500056b33fefe586
  72. -----------------------------------------
  73. Inquiry Data =
  74. 5a 0c ff 3f 37 c8 10 00 00 00 00 00 3f 00 00 00
  75. 00 00 00 00 20 20 20 20 20 20 20 20 20 20 20 20
  76. 58 39 41 47 32 32 4c 38 00 00 00 00 04 00 20 20
  77. 20 20 41 41 33 36 54 53 31 39 30 30 36 30 30 34
  78. 53 4e 20 20 20 20 20 20 20 20 20 20 20 20 20 20
  79. 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 80
  80. 00 40 00 2f 00 40 00 02 00 02 07 00 ff 3f 10 00
  81. 3f 00 10 fc fb 00 10 00 ff ff ff 0f 00 00 07 00

Note:

通过单个卷组的信息查看,发现了media error,说明了硬盘是有问题的

查看磁盘与系统磁盘分区的对应
  1. [root@node-15 ~]# perccli64 /c0/vall show
  2. Controller = 0
  3. Status = Success
  4. Description = None
  5. Virtual Drives :
  6. ==============
  7. -------------------------------------------------------------
  8. DG/VD TYPE State Access Consist Cache Cac sCC Size Name
  9. -------------------------------------------------------------
  10. 0/0 RAID1 Optl RW Yes RWBD - OFF 931.0 GB
  11. 1/1 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  12. 2/2 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  13. 3/3 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  14. 4/4 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  15. 5/5 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  16. 6/6 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  17. 7/7 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  18. 8/8 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  19. 9/9 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  20. 10/10 RAID0 Optl RW Yes RWBD - OFF 931.0 GB
  21. -------------------------------------------------------------
  22. Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
  23. Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked|
  24. Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
  25. FWB=Force WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
  26. Check Consistency

Note:

VD:一般认为是该硬盘在系统里的设备顺序,一般如果只有raid分区,那么VD=0的就是系统里的/dev/sda,VD=1就是/dev/sdb以此类推,但是如果有jbod的分区,先排列jbod分区,如jbod的到了/dev/sdc,VD0则是/dev/sdd,以此类推;
DG:是在raid卡里配置卷组的顺序;

Raid卡日志收集相关命令

storcli64 /c0 show time 显示raid的时间

storcli64 /c0 show alilog logfile=node-x.alilog 获取alilog,所有的log都包括了

storcli64 /c0 show all logfile=node-x.all.log raid卡的信息

storcli64 /c0 show badblocks 磁盘坏道的信息

perccli64 /c0 show events filter=fatal 显示事件级别为fatal的,可以获取所有毁灭性事件的信息,发现磁盘故障或raid卡故障

perccli64 /c0 show cc 数据一致性检测,raid1以上的级别多个盘的数据是需要进行一致性检测的,但是单盘raid0可能是不需要的,是否影响性能不确定

Section Five : Smartctl Get Error info of Disks

Common Commands Usage Description

--scan Scan for devices

--scan-open Scan for devices and try to open each device

-x, --xall Show all information for device

-a, --all Show all SMART information for device

-i, --info Show identity information for device

-d TYPE, --device=TYPE Specify device type to one of: ata, scsi, nvme[,NSID], sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,p][,x][,N], usbprolific, usbsunplus, marvell, areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, aacraid,H,L,ID, cciss,N, auto, test

-s VALUE, --smart=VALUE Enable/disable SMART on device (on/off)

-o VALUE, --offlineauto=VALUE(ATA) Enable/disable automatic offline testing on device (on/off)

-S VALUE, --saveauto=VALUE(ATA) Enable/disable Attribute autosave on device (on/off)

-H, --health Show device SMART health status

-c, --capabilities(ATA,NVMe) Show device SMART capabilities

-A, --attributes Show device SMART vendor-specific Attributes and values

-l TYPE, --log=TYPE Show device log. TYPE: error, selftest, selective, directory[,g|s],
? xerror[,N][,error], xselftest[,N][,selftest],
? background, sasphy[,reset], sataphy[,reset],
? scttemp[sts,hist], scttempint,N[,p],
? scterc[,N,M], devstat[,N], ssd,
? gplog,N[,RANGE], smartlog,N[,RANGE],
? nvmelog,N,SIZE

-t TEST, --test=TEST Run test. TEST: offline, short, long, conveyance, force, vendor,N,
? select,M-N, pending,N, afterselect,[on|off]

-X, --abort Abort any non-captive test on device

Get info for /dev/sdf

查看所有设备列表
  1. [root@node-15 ~]# smartctl --scan
  2. /dev/sda -d scsi # /dev/sda, SCSI device
  3. /dev/sdb -d scsi # /dev/sdb, SCSI device
  4. /dev/sdc -d scsi # /dev/sdc, SCSI device
  5. /dev/sdd -d scsi # /dev/sdd, SCSI device
  6. /dev/sde -d scsi # /dev/sde, SCSI device
  7. /dev/sdf -d scsi # /dev/sdf, SCSI device
  8. /dev/sdg -d scsi # /dev/sdg, SCSI device
  9. /dev/sdh -d scsi # /dev/sdh, SCSI device
  10. /dev/sdi -d scsi # /dev/sdi, SCSI device
  11. /dev/sdj -d scsi # /dev/sdj, SCSI device
  12. /dev/sdk -d scsi # /dev/sdk, SCSI device
  13. /dev/sdl -d scsi # /dev/sdl, SCSI device
  14. /dev/sdm -d scsi # /dev/sdm, SCSI device
  15. /dev/sdn -d scsi # /dev/sdn, SCSI device
  16. /dev/sdo -d scsi # /dev/sdo, SCSI device
  17. /dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
  18. /dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
  19. /dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
  20. /dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
  21. /dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
  22. /dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
  23. /dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device
  24. /dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device
  25. /dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device
  26. /dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device
  27. /dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device
  28. /dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device
  29. /dev/bus/0 -d megaraid,12 # /dev/bus/0 [megaraid_disk_12], SCSI device
  30. /dev/bus/0 -d megaraid,13 # /dev/bus/0 [megaraid_disk_13], SCSI device
  31. /dev/bus/0 -d megaraid,14 # /dev/bus/0 [megaraid_disk_14], SCSI device
  32. /dev/bus/0 -d megaraid,15 # /dev/bus/0 [megaraid_disk_15], SCSI device

Note:

通过前面的章节我们定位到了磁盘/dev/sdf在perccli里的DID即device_id为6,也就是/dev/bus/0 -d megaraid,6

查看磁盘信息
  1. [root@node-15 ~]# smartctl -i -d megaraid,6 /dev/sdf
  2. smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
  3. Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
  4. === START OF INFORMATION SECTION ===
  5. Model Family: Seagate Constellation.2 (SATA)
  6. Device Model: ST91000640NS
  7. Serial Number: 9XGA228L
  8. LU WWN Device Id: 5 000c50 0918f2f8a
  9. Add. Product Id: DELL(tm)
  10. Firmware Version: AA63
  11. User Capacity: 1,000,204,886,016 bytes [1.00 TB]
  12. Sector Size: 512 bytes logical/physical
  13. Rotation Rate: 7200 rpm
  14. Form Factor: 2.5 inches
  15. Device is: In smartctl database [for details use: -P show]
  16. ATA Version is: ATA8-ACS T13/1699-D revision 4
  17. SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
  18. Local Time is: Fri Jan 11 11:28:46 2019 CST
  19. SMART support is: Available - device has SMART capability.
  20. SMART support is: Enabled
查看磁盘的属性信息

一般此处可以用来查看磁盘的整体健康状态指标参数

针对以下输出信息,字段的解释

  • ID:属性ID,通常是一个1到255之间的十进制或十六进制的数字。
  • ATTRIBUTE_NAME:硬盘制造商定义的属性名。
  • FLAG:属性操作标志(可以忽略)。
  • VALUE:这是表格中最重要的信息之一,代表给定属性的标准化值,在1到253之间。253意味着最好情况,1意味着最坏情况。取决于属性和制造商,初始化VALUE可以被设置成100或200.
  • WORST:所记录的最小VALUE。
  • THRESH:在报告硬盘FAILED状态前,WORST可以允许的最小值,也就是WORST如果小于THRESH,磁盘就会报告FAILED。
  • TYPE:属性的类型(Pre-fail或Oldage)。Pre-fail类型的属性可被看成一个关键属性,表示参与磁盘的整体SMART健康评估(PASSED/FAILED)。如果任何Pre-fail类型的属性故障,那么可视为磁盘将要发生故障。另一方面,Oldage类型的属性可被看成一个非关键的属性(如正常的磁盘磨损),表示不会使磁盘本身发生故障。
  • UPDATED:表示属性的更新频率。Offline代表磁盘上执行离线测试的时间。
  • WHEN_FAILED:如果VALUE小于等于THRESH,会被设置成“FAILING_NOW”;如果WORST小于等于THRESH会被设置成“In_the_past”;如果都不是,会被设置成“-”。在“FAILING_NOW”情况下,需要尽快备份重要文件,特别是属性是Pre-fail类型时。“In_the_past”代表属性已经故障了,但在运行测试的时候没问题。“-”代表这个属性从没故障过。
  • RAW_VALUE:制造商定义的原始值,从VALUE派生。
  1. [root@node-15 ~]# smartctl -A -d megaraid,6 /dev/sdf
  2. smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
  3. Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
  4. === START OF READ SMART DATA SECTION ===
  5. SMART Attributes Data Structure revision number: 10
  6. Vendor Specific SMART Attributes with Thresholds:
  7. ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
  8. 1 Raw_Read_Error_Rate 0x010f 081 038 044 Pre-fail Always In_the_past 151546765
  9. 3 Spin_Up_Time 0x0103 094 094 000 Pre-fail Always - 0
  10. 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 21
  11. 5 Reallocated_Sector_Ct 0x0133 100 100 036 Pre-fail Always - 0
  12. 7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail Always - 338813105
  13. 9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 18784
  14. 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
  15. 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 21
  16. 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
  17. 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 1710
  18. 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
  19. 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
  20. 190 Airflow_Temperature_Cel 0x0022 069 053 045 Old_age Always - 31 (Min/Max 24/40)
  21. 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
  22. 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 19
  23. 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 852
  24. 194 Temperature_Celsius 0x0022 031 047 000 Old_age Always - 31 (0 14 0 0 0)
  25. 195 Hardware_ECC_Recovered 0x001a 117 099 000 Old_age Always - 151546765
  26. 197 Current_Pending_Sector 0x0012 084 084 000 Old_age Always - 688
  27. 198 Offline_Uncorrectable 0x0010 084 084 000 Old_age Offline - 688
  28. 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
  29. 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 8093 (164 214 0)
  30. 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 1870535293
  31. 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1530387871
查看磁盘的健康检测状态

Note:

关于以下检测结果,说明检测结果是PASSED的,就是磁盘还可以使用,但是列出了一条检测异常的WORST<THRESH,TYPE是Pre-fail,WHEN_FAILED是In_the_past,说明预测这个盘快坏了。

  1. [root@node-15 ~]# smartctl -H -d megaraid,6 /dev/sdf
  2. smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
  3. Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
  4. === START OF READ SMART DATA SECTION ===
  5. SMART Status not supported: ATA return descriptor not supported by controller firmware
  6. SMART overall-health self-assessment test result: PASSED
  7. Warning: This result is based on an Attribute check.
  8. Please note the following marginal Attributes:
  9. ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
  10. 1 Raw_Read_Error_Rate 0x010f 081 038 044 Pre-fail Always In_the_past 151546765
查看磁盘的错误日志
  1. [root@node-15 ~]# smartctl -l error -d megaraid,6 /dev/sdf
  2. smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
  3. Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
  4. === START OF READ SMART DATA SECTION ===
  5. SMART Error Log Version: 1
  6. ATA Error Count: 46431 (device log contains only the most recent five errors)
  7. CR = Command Register [HEX]
  8. FR = Features Register [HEX]
  9. SC = Sector Count Register [HEX]
  10. SN = Sector Number Register [HEX]
  11. CL = Cylinder Low Register [HEX]
  12. CH = Cylinder High Register [HEX]
  13. DH = Device/Head Register [HEX]
  14. DC = Device Command Register [HEX]
  15. ER = Error register [HEX]
  16. ST = Status register [HEX]
  17. Powered_Up_Time is measured from power on, and printed as
  18. DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
  19. SS=sec, and sss=millisec. It "wraps" after 49.710 days.
  20. Error 46431 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
  21. When the command that caused the error occurred, the device was active or idle.
  22. After command completion occurred, registers were:
  23. ER ST SC SN CL CH DH
  24. -- -- -- -- -- -- --
  25. 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
  26. Commands leading to the command that caused the error were:
  27. CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  28. -- -- -- -- -- -- -- -- ---------------- --------------------
  29. 42 00 00 ff ff ff 4f 00 46d+15:15:32.968 READ VERIFY SECTOR(S) EXT
  30. 42 00 00 ff ff ff 4f 00 46d+15:15:29.901 READ VERIFY SECTOR(S) EXT
  31. 42 00 00 ff ff ff 4f 00 46d+15:15:26.825 READ VERIFY SECTOR(S) EXT
  32. 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 READ VERIFY SECTOR(S) EXT
  33. 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 READ VERIFY SECTOR(S) EXT
  34. Error 46430 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
  35. When the command that caused the error occurred, the device was active or idle.
  36. After command completion occurred, registers were:
  37. ER ST SC SN CL CH DH
  38. -- -- -- -- -- -- --
  39. 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
  40. Commands leading to the command that caused the error were:
  41. CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  42. -- -- -- -- -- -- -- -- ---------------- --------------------
  43. 42 00 00 ff ff ff 4f 00 46d+15:15:29.901 READ VERIFY SECTOR(S) EXT
  44. 42 00 00 ff ff ff 4f 00 46d+15:15:26.825 READ VERIFY SECTOR(S) EXT
  45. 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 READ VERIFY SECTOR(S) EXT
  46. 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 READ VERIFY SECTOR(S) EXT
  47. 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 READ VERIFY SECTOR(S) EXT
  48. Error 46429 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
  49. When the command that caused the error occurred, the device was active or idle.
  50. After command completion occurred, registers were:
  51. ER ST SC SN CL CH DH
  52. -- -- -- -- -- -- --
  53. 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
  54. Commands leading to the command that caused the error were:
  55. CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  56. -- -- -- -- -- -- -- -- ---------------- --------------------
  57. 42 00 00 ff ff ff 4f 00 46d+15:15:26.825 READ VERIFY SECTOR(S) EXT
  58. 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 READ VERIFY SECTOR(S) EXT
  59. 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 READ VERIFY SECTOR(S) EXT
  60. 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 READ VERIFY SECTOR(S) EXT
  61. b0 da 00 00 4f c2 00 00 46d+15:15:17.838 SMART RETURN STATUS
  62. Error 46428 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
  63. When the command that caused the error occurred, the device was active or idle.
  64. After command completion occurred, registers were:
  65. ER ST SC SN CL CH DH
  66. -- -- -- -- -- -- --
  67. 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
  68. Commands leading to the command that caused the error were:
  69. CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  70. -- -- -- -- -- -- -- -- ---------------- --------------------
  71. 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 READ VERIFY SECTOR(S) EXT
  72. 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 READ VERIFY SECTOR(S) EXT
  73. 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 READ VERIFY SECTOR(S) EXT
  74. b0 da 00 00 4f c2 00 00 46d+15:15:17.838 SMART RETURN STATUS
  75. 2f 00 01 e0 00 00 40 00 46d+15:15:17.703 READ LOG EXT
  76. Error 46427 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
  77. When the command that caused the error occurred, the device was active or idle.
  78. After command completion occurred, registers were:
  79. ER ST SC SN CL CH DH
  80. -- -- -- -- -- -- --
  81. 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
  82. Commands leading to the command that caused the error were:
  83. CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  84. -- -- -- -- -- -- -- -- ---------------- --------------------
  85. 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 READ VERIFY SECTOR(S) EXT
  86. 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 READ VERIFY SECTOR(S) EXT
  87. b0 da 00 00 4f c2 00 00 46d+15:15:17.838 SMART RETURN STATUS
  88. 2f 00 01 e0 00 00 40 00 46d+15:15:17.703 READ LOG EXT
  89. 42 00 00 ff ff ff 4f 00 46d+15:15:15.276 READ VERIFY SECTOR(S) EXT
补充
  • 如果没有开启磁盘的smart可以通过-s on device开启
  • 一般来说如果samrtctl -i 获取info时没有什么信息输出且smart support是允许的可用的,那么说明可能需要做test才能获取到-t short/long,该测试不会破坏硬盘上的数据,但对于存储一般不适用离线offline测试
  • 收集时可以通过-x -a参数获取更全面的磁盘信息
  • smartctl是可以配置服务的/etc/smartmontools/smartd.conf,对此目前没有研究,后续有研究成果再更新

原文链接:http://www.cnblogs.com/wangl-blog/p/10839635.html

 友情链接:直通硅谷  点职佳  北美留学生论坛

本站QQ群:前端 618073944 | Java 606181507 | Python 626812652 | C/C++ 612253063 | 微信 634508462 | 苹果 692586424 | C#/.net 182808419 | PHP 305140648 | 运维 608723728

W3xue 的所有内容仅供测试,对任何法律问题及风险不承担任何责任。通过使用本站内容随之而来的风险与本站无关。
关于我们  |  意见建议  |  捐助我们  |  报错有奖  |  广告合作、友情链接(目前9元/月)请联系QQ:27243702 沸活量
皖ICP备17017327号-2 皖公网安备34020702000426号