Self-Monitoring, Analysis, and Reporting Technology
From Wikipedia, the free encyclopedia
Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T., is a monitoring system for computer hard disks to detect and report on various indicators of reliability, in the hope of anticipating failures.
Contents |
[edit] Background
Fundamentally, hard drives can suffer one of two classes of failures:
- Predictable ones, when some failure modes, especially mechanical wear and aging, happen gradually over time. A monitoring device can detect these, much as a temperature dial on the dashboard of an automobile can warn a driver — before serious damage occurs — that the engine has started to overheat.
- Unpredictable ones, when other failures may occur suddenly and unpredictably, such as an electronic component burning out.
Mechanical failures, which are usually predictable failures, account for 60 percent of drive failure[1]. The purpose of S.M.A.R.T. is to warn a user or system administrator of impending drive failure while time remains to take preventive action — such as copying the data to a replacement device. Approximately 30% of failures can be predicted by S.M.A.R.T.[2] Work at Google on over 100,000 drives has shown little overall predictive value of S.M.A.R.T. status as a whole, but that certain sub-categories of information S.M.A.R.T. implementations might track do correlate with actual failure rates - specifically that following the first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors and first errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities.[3]
Pctechguide's page on S.M.A.R.T. (2003)[4] comments that the technology has gone through three phases:
- "In its original incarnation SMART provided failure prediction by monitoring certain online hard drive activities. A subsequent version improved failure prediction by adding an automatic off-line read scan to monitor additional operations. The latest SMART III technology not only monitors hard drive activities but adds failure prevention by attempting to detect and repair sector errors. Also, whilst earlier versions of the technology only monitored hard drive activity for data that was retrieved by the operating system, SMART III tests all data and all sectors of a drive by using off-line data collection to confirm the drive's health during periods of inactivity."
[edit] History and predecessors
The industry's first hard disk monitoring technology was introduced by IBM in 1992 in their IBM 9337 Disk Arrays for AS/400 servers[5] utilizing IBM 0662 SCSI-2 disk drives. Later it was named Predictive Failure Analysis (PFA) technology. It was measuring several key device health parameters and evaluating them within the drive firmware. Communications between the physical unit and the monitoring software were limited to a binary result - device is OK, or is likely to fail soon.
Later[6] another variant was created by computer manufacturer Compaq and disk drive manufacturers Seagate, Quantum, and Conner, which was named IntelliSafe. The disk drives were measuring the disk health parameters and the values were transferred to the operating system and user-space monitoring software. Each disk drive vendor was free to decide which parameters are to be included for monitoring and what are their thresholds. The unification was at the protocol level with the host.
Compaq submitted their implementation to Small Form Committee for standardization in early 1995[7]. It was supported by IBM, by Compaq's development partners Seagate, Quantum, and Conner, and by Western Digital who did not have a failure prediction system at the time. IntelliSafe's approach was chosen as it gives more flexibility. The resulting jointly-developed standard was named S.M.A.R.T.
[edit] Standards and implementation
Many motherboards will display a warning message when a disk drive approaches failure. Although an industry standard amongst most major hard drive manufacturers,[8] there are some remaining issues and much proprietary "secret knowledge" held by individual manufacturers as to their specific approach. As a result, S.M.A.R.T. is not always implemented correctly on many computer platforms due to the absence of industry-wide software & hardware standards for S.M.A.R.T. data interchange.[citation needed]
From a legal perspective, the term "S.M.A.R.T." refers only to a signaling method between internal disk drive electromechanical sensors and the host computer — thus a disk drive manufacturer could include a sensor for just one physical attribute and then advertise the product as S.M.A.R.T. compatible. For example, a drive manufacturer might claim to support S.M.A.R.T. but not include a temperature sensor, which the customer might reasonably expect to be present since reliability typically is the inverse of temperature, in which case temperature would be a crucial predictor of failure.
Some S.M.A.R.T.-enabled motherboards and related software may not communicate with certain S.M.A.R.T.-capable drives, depending on the type of interface. Few external drives connected via USB and Firewire correctly send S.M.A.R.T. data over those interfaces. With so many ways to connect a hard drive (e.g. SCSI, Fibre Channel, ATA, SATA, SAS, SSA) it's difficult to predict whether S.M.A.R.T. reports will function correctly.
Even on hard drives and interfaces that support it, S.M.A.R.T. data may not be reported correctly to the computer's operating system. Some disk controllers can duplicate all write operations on a secondary "backup" drive in real-time. This feature is known as "RAID mirroring". However, many programs which are designed to analyze changes in drive behavior and relay S.M.A.R.T. alerts to the operator do not function when a computer system is configured for RAID support, usually because under normal RAID array operational conditions, the computer may not be permitted to 'see' (or directly access) individual physical drives, but only logical volumes, by the RAID array subsystem.
On the Windows platform, many programs designed to monitor and report S.M.A.R.T. information will only function under an administrator account. At present S.M.A.R.T. is implemented individually by manufacturers, and while some aspects are standardized for compatibility, others are not.
One of the other fundamental problems with S.M.A.R.T is that it slows performance and for this reason it is disabled by default in many motherboard BIOSes.[citation needed]
[edit] Attributes
Each drive manufacturer defines a set of attributes and selects threshold values which attributes should not go below under normal operation. Attribute values can range from 1 to 253 (1 representing the worst case and 253 representing the best). Depending on the manufacturer, a value of 100 or 200 will often be chosen as the "normal" value. Manufacturers that have supported one or more S.M.A.R.T. attributes in various products include: Samsung, Seagate, IBM (Hitachi), Fujitsu, Maxtor, Western Digital. These manufacturers do not necessarily agree on precise attribute definitions and measurement units; therefore the following list should be regarded as a general reference only. Note that the attribute values are always mapped to the range of 1 to 253 in a way that means higher values are better. For example, the "Reallocated Sectors Count" attribute value decreases as the number of reallocated sectors increases. In this case, the attribute's raw value will often indicate the actual number of sectors that were reallocated, although vendors are in no way required to adhere to this convention.
[edit] Known S.M.A.R.T. attributes
Attributes marked "CRITICAL" are potential indicators of imminent electromechanical failure. Legend: means that the higher the value the better, while means the opposite. Note that even though some attributes are called "Number of errors" it may still be that lower is better. These are marked with "?" meaning that they need to be verified.
[edit] Threshold Exceeds Condition
Threshold Exceeds Condition (TEC) is a supposed date when a critical drive statistic attribute will achieves its threshold value. When your Drive Health software reports a "Nearest T.E.C." it should be considered as a "Failure date".
Prognosis of this date is based on the factor "Speed of attribute change"; how many points each month the value is decreasing/increasing. This factor is calculated automatically at any change of S.M.A.R.T. attributes for each attribute individually. Note that TEC dates are not guarantees; hard drives can and will either last much longer or fail much sooner than the date given by a TEC.
[edit] References
S.M.A.R.T. attribute meaning. PalickSoft. Retrieved on February 3, 2006.
Zbigniew Chlondowski. S.M.A.R.T. Site: attributes reference table. S.M.A.R.T. Linux. Retrieved on Jan 17, 2007.
- ^ Seagate statement on enhanced smart attributes
- ^ http://smartlinux.sourceforge.net/smart/faq.php?#2 ("How does S.M.A.R.T. work?")
- ^ Failure Trends in a Large Disk Drive Population (Conclusion section) by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso, Google Inc. 1600 Amphitheatre Pkwy Mountain View, CA 94043
- ^ pctechguide's page on S.M.A.R.T. (2003)
- ^ IBM Announcement Letter No. ZG92-0289 dated September 01, 1992
- ^ Seagate - The evolution of S.M.A.R.T.
- ^ Compaq. IntelliSafe. Technical Report SSF-8035, Small Form Committee, January 1995.
- ^ pctechguide: "Industry acceptance of PFA technology eventually led to SMART (Self-Monitoring, Analysis and Reporting Technology) becoming the industry-standard reliability prediction indicator..." [1]
[edit] External links
[edit] Software
Various operating-system specific software can extend the users ability to monitor disk drive conditions through the S.M.A.R.T. interface and predict when a failure is likely to occur by logging deviations in attribute values. This software may also possess the capability to distinguish between gradual degradation over time (representing normal wear) and a sudden change (which may indicate a more serious problem).
- S.M.A.R.T site; links to several SMART tools.
- smartmontools — open-source for Linux, Mac OS X / Darwin, FreeBSD, NetBSD, OpenBSD, Solaris, Darwin, OS/2, Windows (native) and Cygwin.
- SMARTReporter — open-source for Apple Macintosh
- HDTune freeware diagnostic utility for Windows.