backtop


Print 73 comment(s) - last by Chernobyl68.. on Mar 15 at 6:01 PM

Study says failure rates 15 times that of what manufacturers indicate

A study released this week by Carnegie Mellon University revealed that hard drive manufacturers may be exaggerating their mean-time before failure (MTBF) ratings on hard drives. In fact, researchers at Carnegie indicated that on the average, failure rates were as high as 15 times the rated MTBFs.

Rounding-up roughly 100,000 hard drives across a variety of manufacturers, researchers at Carnegie tested the drives in various operating conditions as well as real world scenarios. Some drives were at Internet services providers, others at large data centers and some were at research labs. According to test results, the majority of the drives did not appear to be affected by their operating environment. In fact, researchers indicated that drive operating temperatures had little to no effect on failure rates -- a cool hard drive survived no longer than one running hot.

The types of drives used in the study ranged from Serial ATA drives, SCSI and even high-end fiber-channel (FC) drives. Typically, customers will be paying a much larger premium for SCSI and FC drives, which also happen to usually carry longer warranty periods and higher MTBF ratings.

Carnegie researchers found that these high-end drives did not outlast their mainstream counterparts:
In our data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks. This may indicate that disk-independent factors, such as operating conditions, usage and environmental factors affect replacement rates more than component specific factors.
According to the study, the number one cause of drive failures was simply age. The longer the drive has been in operation, the more likely it will fail. According to the study, drives tended to start showing signs of failure after roughly five to seven years of service, after which there was a significant increase in average failure rates (AFR). The failure rates of drives that failed in their first year of service or shorter was just as high as those after the seven year mark.

According to Carnegie researchers, manufacturer MTBF ratings are highly overrated. Take for example the Seagate Cheetah X15 series, which has a MTBF rating of 1.5 million hours. This equates to roughly over 171 years of constant service before problems. Carnegie's researchers said however that customers should expect a more reasonable 9 to 11 years. Interestingly, real world tests in the study showed a consistent average failure of about six years.

The average replacement rate of drives ranged from 2-percent to a whopping 13-percent annually, indicating that there is a need for manufacturers to reevaluate the way a MTBF rating is generated. Worst of all, these rates were for drives with MTBF ratings between 1 million and 1.5 million hours.

Garth Gibson, associate professor of computer science at Carnegie indicated that the study was proof that MTBFs are not a reliable way of measuring drive quality. "We had no evidence that SATA drives are less reliable than the SCSI or Fiber Channel drives," said Gibson.

Carnegie researchers concluded that backup measures are a necessity with critically important data, no matter what kind of hard drive is being used. It is interesting to note that even Google's own data centers use mainly SATA and PATA drives. At the current rate, it is only a matter of time before SATA will perform equal or better than SCSI and FC drives, offering the same reliability, and for much less money.


Comments     Threshold


This article is over a month old, voting and posting comments is disabled

RE: Not really.
By masher2 (blog) on 3/10/2007 11:46:50 AM , Rating: 2
> "So why it's called: "MEAN time BETWEEEN failure"? Since the current metric has nothing to do with mean time between failures..."

The term means exactly what it says. A MTBF of one million hours means that, if you operate a hard drive within its service life, you will average one failure every million hours.

The confusion occurs because drives have MTBFs greater than their service life. If the service life is, for example, 100,000 hours, then you'd expect a 10% chance of that drive failing during its service life.


RE: Not really.
By TomZ on 3/10/2007 4:55:58 PM , Rating: 2
Yes, that makes sense, however, most consumers would assume that MTBF gives some indication of the useful life, which it clearly does not, both in theory and empirically.

It is even interesting that the cited researchers and the writer of this DT article also infer such a connection. Maybe it is all a misunderstanding.


RE: Not really.
By TomZ on 3/10/2007 4:56:40 PM , Rating: 1
Oh, and how lame of folks to downrate your post. Your comment was spot on, if you ask me.


RE: Not really.
By Oregonian2 on 3/12/2007 1:25:39 PM , Rating: 2
MTBF is defined in the standards. Most common ones used are Telcordia's and there's a MIL spec that's used often. Those define what it is. How companies come up with those numbers for their products is the part in contention.

If a device's MTBF is a million hours, it's supposed to mean that if you've a thousand of them, after a million hours 500 of them will still be working. It's the MEAN time to failure using the word MEAN mathematically, it's not the same as AVERAGE (although can be sometimes). IOW - half take less time than MTBF to fail and half take more (that's what "mean" means). The first five hundred could have failed much earlier or could have all failed at the 990,000 hour point. It's the point where half of them will have failed which very obviously means that one has only a 50-50 chance of one's device lasting that long. It is NOT the expected lifetime of the device.


RE: Not really.
By martyspants on 3/12/2007 9:59:20 PM , Rating: 2
No.

Mean means the average (even mathematically). The term you are describing is the Median Time Between Failures.


RE: Not really.
By Oregonian2 on 3/13/2007 5:41:37 PM , Rating: 2
Yup, you're right. I'm wrong. I need to sleep more, my brain was went bonkers that day. I just took another peek at Telcordia SR-332 (been a few years), the most often used spec (along with the MIL one) and so it's worse than I had thought on my brain-dead day. Everything is very heavy distribution of failure related (heavy on the bathtub curve as an assumed distribution, but may not be true especially if the MTBF is projected on products that have been burned-in already so that the front-end is mostly gone already). So, depending upon assumptions one wants to make... things can be very different. If one would make the rash assumption of the burn-in mentioned above (for reliable devices) and that failure rate was mostly at the back-end of a steep bathtub curve, 100% of the devices could be defective at the MTBF number of hours (but in this scenario, ALL of them lasted until then too and dropped all together). So how things spread out just depends upon assumptions and how well they match the assumptions made in the prediction model. Which is pretty tricky and probably is why even the "standard" SR-332 is pretty hand-wavy at best in its discussions. :-)


"Can anyone tell me what MobileMe is supposed to do?... So why the f*** doesn't it do that?" -- Steve Jobs











botimage
Copyright 2014 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki