Print 73 comment(s) - last by Chernobyl68.. on Mar 15 at 6:01 PM

Study says failure rates 15 times that of what manufacturers indicate

A study released this week by Carnegie Mellon University revealed that hard drive manufacturers may be exaggerating their mean-time before failure (MTBF) ratings on hard drives. In fact, researchers at Carnegie indicated that on the average, failure rates were as high as 15 times the rated MTBFs.

Rounding-up roughly 100,000 hard drives across a variety of manufacturers, researchers at Carnegie tested the drives in various operating conditions as well as real world scenarios. Some drives were at Internet services providers, others at large data centers and some were at research labs. According to test results, the majority of the drives did not appear to be affected by their operating environment. In fact, researchers indicated that drive operating temperatures had little to no effect on failure rates -- a cool hard drive survived no longer than one running hot.

The types of drives used in the study ranged from Serial ATA drives, SCSI and even high-end fiber-channel (FC) drives. Typically, customers will be paying a much larger premium for SCSI and FC drives, which also happen to usually carry longer warranty periods and higher MTBF ratings.

Carnegie researchers found that these high-end drives did not outlast their mainstream counterparts:
In our data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks. This may indicate that disk-independent factors, such as operating conditions, usage and environmental factors affect replacement rates more than component specific factors.
According to the study, the number one cause of drive failures was simply age. The longer the drive has been in operation, the more likely it will fail. According to the study, drives tended to start showing signs of failure after roughly five to seven years of service, after which there was a significant increase in average failure rates (AFR). The failure rates of drives that failed in their first year of service or shorter was just as high as those after the seven year mark.

According to Carnegie researchers, manufacturer MTBF ratings are highly overrated. Take for example the Seagate Cheetah X15 series, which has a MTBF rating of 1.5 million hours. This equates to roughly over 171 years of constant service before problems. Carnegie's researchers said however that customers should expect a more reasonable 9 to 11 years. Interestingly, real world tests in the study showed a consistent average failure of about six years.

The average replacement rate of drives ranged from 2-percent to a whopping 13-percent annually, indicating that there is a need for manufacturers to reevaluate the way a MTBF rating is generated. Worst of all, these rates were for drives with MTBF ratings between 1 million and 1.5 million hours.

Garth Gibson, associate professor of computer science at Carnegie indicated that the study was proof that MTBFs are not a reliable way of measuring drive quality. "We had no evidence that SATA drives are less reliable than the SCSI or Fiber Channel drives," said Gibson.

Carnegie researchers concluded that backup measures are a necessity with critically important data, no matter what kind of hard drive is being used. It is interesting to note that even Google's own data centers use mainly SATA and PATA drives. At the current rate, it is only a matter of time before SATA will perform equal or better than SCSI and FC drives, offering the same reliability, and for much less money.

Comments     Threshold

This article is over a month old, voting and posting comments is disabled

RE: Not really.
By zippercow on 3/9/2007 8:28:31 PM , Rating: 5
You beat me to it. I actually wrote a short document for my company a while back that shows the formula (not sure if that will make more sense to anyone):

MTBF (Mean Time Between Failure) measures the average time that a device works properly without failure. The MTBF of any hardware is calculated using the following formula:

[short time period]*[number of pieces tested]/[number of pieces tested which failed within that time period]=MTBF

The MTBF rating of our DVR is 449,616.52 hours (~51 years).This means that if 51 DVRs were to be run for 1 year, 1 failure out of those 51 could be expected.

RE: Not really.
By BladeVenom on 3/9/2007 9:47:35 PM , Rating: 4
Which kind of underscores their point, since no hard drive is going to last 51 years.

RE: Not really.
By oab on 3/9/2007 10:21:49 PM , Rating: 1
It might last 51 years, but it won't be in service any more.

It still might mechanically work (though it might loose some information due to the that para magnetic effect that I don't understand.

RE: Not really.
By S3anister on 3/10/2007 3:22:19 AM , Rating: 1
superparamagnetic effect lol. I did an analytical paper on that... not too fun.

RE: Not really.
By TomZ on 3/9/2007 10:19:30 PM , Rating: 4
This formula shows precisely the flaw that the researchers empirically proved, which is that this calculation doesn't take into account the "bathtub curve" increase in failures at the end of the product's life.

RE: Not really.
By JeffDM on 3/10/2007 10:09:20 AM , Rating: 2
I thought that study proved that the "bathtub" curve doesn't exist for hard drives. The trailing edge does exist as you suggest, but not the leading edge.

RE: Not really.
By TomZ on 3/10/2007 10:45:37 AM , Rating: 3
The leading edge of the "bathtub curve" is eliminated by testing performed at the factory prior to shipping the product to customers, in addition to focused QA activities that help to manage the yield.

RE: Not really.
By rgsaunders on 3/9/2007 11:00:40 PM , Rating: 4
The explanation given by an engineer from Seagate or WD several years ago was a variation on your explanation, as I recall it went something like this; if an MTBF is 150,000 hours, then if you replace the drives on a regular recommended schedule (e.g. 3 years), then it will be 150,000 hours of operation before you have a failure. The critical point here which is always left out is that these figures are predicated on you replacing drives on a regular recommended schedule, ie every 3 to 5 years. I am not saying this is either ethical or technically correct, this is the rather lame explanation given by the hard drive manufacturing industry when this issue arose 10 or 15 years ago. The process of assigning an MTBF would appear to be relatively simple, but as you have seen, it is anything but.

RE: Not really.
By nothingtoseehere on 3/10/2007 4:43:03 PM , Rating: 3
That equation means nothing: Simple test for 0.1 second and if no drive smokes in the first fraction of a second, you have done the test to prove any MTBF number you wish to publish: You can choose to publish a number lower than the measured value of .... INFINITY... decisions decisions.

Should a drive fail, simply reduce the test time until it doesn't.

The 'Time period' in your equation should be at least the MTBF value itself, otherwise the published value is only a (grossly mistaken) estimate of the MTBF, as the article shows.

Perhaps it's time to rename MTBF to 'Misleading Term By Frauds' or something of that sorts, which would be a more accurate description of what the value stands for.

RE: Not really.
By mindless1 on 3/12/2007 2:40:23 AM , Rating: 2
You mean, MTBF _ATTEMPTS_ to measure the average time.

Just because one device can use a certain test methodology to reach a reasonable result applicable towards a MTBF, does not mean another device can use the same test. When the test produces data inconsistent with true MTBFR rates, the test is invalid, unapplicable, and potentially fraudulent when so obviously deceiving.

RE: Not really.
By tcsenter on 3/12/2007 4:15:17 AM , Rating: 3
Mean Time BETWEEN Failure(s) = the expected time between two successive failures of a system or sub-system.

It is misleading to consumers, yes, but not by design nor intent. Its misleading because uninformed consumers have read into it something it does not mean and has never meant.

MTBF and MTTF mean the same thing today they always have. Nothing has changed, except the level of education and understanding of the audience [erroneously] interpreting the meaning and practical significance.

Here is a great primer on statistical reliability models or standards:

Some of the widely accepted reliability engineering or prediction standards:

RDF 2000
IEC 62380
NSWC-98/LE1 (Mechanical)
Chinese 299B

"When an individual makes a copy of a song for himself, I suppose we can say he stole a song." -- Sony BMG attorney Jennifer Pariser

Most Popular ArticlesAre you ready for this ? HyperDrive Aircraft
September 24, 2016, 9:29 AM
Leaked – Samsung S8 is a Dream and a Dream 2
September 25, 2016, 8:00 AM
Inspiron Laptops & 2-in-1 PCs
September 25, 2016, 9:00 AM
Snapchat’s New Sunglasses are a Spectacle – No Pun Intended
September 24, 2016, 9:02 AM
Walmart may get "Robot Shopping Carts?"
September 17, 2016, 6:01 AM

Copyright 2016 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki