backtop


Print 73 comment(s) - last by Chernobyl68.. on Mar 15 at 6:01 PM

Study says failure rates 15 times that of what manufacturers indicate

A study released this week by Carnegie Mellon University revealed that hard drive manufacturers may be exaggerating their mean-time before failure (MTBF) ratings on hard drives. In fact, researchers at Carnegie indicated that on the average, failure rates were as high as 15 times the rated MTBFs.

Rounding-up roughly 100,000 hard drives across a variety of manufacturers, researchers at Carnegie tested the drives in various operating conditions as well as real world scenarios. Some drives were at Internet services providers, others at large data centers and some were at research labs. According to test results, the majority of the drives did not appear to be affected by their operating environment. In fact, researchers indicated that drive operating temperatures had little to no effect on failure rates -- a cool hard drive survived no longer than one running hot.

The types of drives used in the study ranged from Serial ATA drives, SCSI and even high-end fiber-channel (FC) drives. Typically, customers will be paying a much larger premium for SCSI and FC drives, which also happen to usually carry longer warranty periods and higher MTBF ratings.

Carnegie researchers found that these high-end drives did not outlast their mainstream counterparts:
In our data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks. This may indicate that disk-independent factors, such as operating conditions, usage and environmental factors affect replacement rates more than component specific factors.
According to the study, the number one cause of drive failures was simply age. The longer the drive has been in operation, the more likely it will fail. According to the study, drives tended to start showing signs of failure after roughly five to seven years of service, after which there was a significant increase in average failure rates (AFR). The failure rates of drives that failed in their first year of service or shorter was just as high as those after the seven year mark.

According to Carnegie researchers, manufacturer MTBF ratings are highly overrated. Take for example the Seagate Cheetah X15 series, which has a MTBF rating of 1.5 million hours. This equates to roughly over 171 years of constant service before problems. Carnegie's researchers said however that customers should expect a more reasonable 9 to 11 years. Interestingly, real world tests in the study showed a consistent average failure of about six years.

The average replacement rate of drives ranged from 2-percent to a whopping 13-percent annually, indicating that there is a need for manufacturers to reevaluate the way a MTBF rating is generated. Worst of all, these rates were for drives with MTBF ratings between 1 million and 1.5 million hours.

Garth Gibson, associate professor of computer science at Carnegie indicated that the study was proof that MTBFs are not a reliable way of measuring drive quality. "We had no evidence that SATA drives are less reliable than the SCSI or Fiber Channel drives," said Gibson.

Carnegie researchers concluded that backup measures are a necessity with critically important data, no matter what kind of hard drive is being used. It is interesting to note that even Google's own data centers use mainly SATA and PATA drives. At the current rate, it is only a matter of time before SATA will perform equal or better than SCSI and FC drives, offering the same reliability, and for much less money.


Comments     Threshold


This article is over a month old, voting and posting comments is disabled

Not really.
By retrospooty on 3/9/2007 8:12:18 PM , Rating: 5
I don't think they are so much exaggerated as the term MTBF is just misunderstood. Here is how they do it.

Since they cannot test drives for 10 years or more prior to release, they test many drives. For the purpose of easy Math lets use 1000.

The Manufacturer runs 1000 drives on a test bed for 1000 hours . If one fails it is 1,000,000 hours MTBF (1000x1000). In reality its a bit more complex, but that is the jist of it. MTBF is not a relative rating on how long your hard drive should last, but a benchmark of many drives tested. Would you rather have a drive that tested at a low or a high MTBF? Of course, the higher the better.

The other thing this cannot find is latent heat/longevity related issues. If a set of 1000 drives lastest for 1000 hours and none failed, that is great, but what if a certain component has a defect that causes 100% failure after 20-30,000 hours. The MTBF test would not find the defect. Still it is a great test and an absolute requirement. I with Ford and GM could be tested like that against Honda and Toyota.




RE: Not really.
By zippercow on 3/9/2007 8:28:31 PM , Rating: 5
You beat me to it. I actually wrote a short document for my company a while back that shows the formula (not sure if that will make more sense to anyone):

MTBF (Mean Time Between Failure) measures the average time that a device works properly without failure. The MTBF of any hardware is calculated using the following formula:

[short time period]*[number of pieces tested]/[number of pieces tested which failed within that time period]=MTBF

The MTBF rating of our DVR is 449,616.52 hours (~51 years).This means that if 51 DVRs were to be run for 1 year, 1 failure out of those 51 could be expected.


RE: Not really.
By BladeVenom on 3/9/2007 9:47:35 PM , Rating: 4
Which kind of underscores their point, since no hard drive is going to last 51 years.


RE: Not really.
By oab on 3/9/2007 10:21:49 PM , Rating: 1
It might last 51 years, but it won't be in service any more.

It still might mechanically work (though it might loose some information due to the that para magnetic effect that I don't understand.


RE: Not really.
By S3anister on 3/10/2007 3:22:19 AM , Rating: 1
superparamagnetic effect lol. I did an analytical paper on that... not too fun.


RE: Not really.
By TomZ on 3/9/2007 10:19:30 PM , Rating: 4
This formula shows precisely the flaw that the researchers empirically proved, which is that this calculation doesn't take into account the "bathtub curve" increase in failures at the end of the product's life.


RE: Not really.
By JeffDM on 3/10/2007 10:09:20 AM , Rating: 2
I thought that study proved that the "bathtub" curve doesn't exist for hard drives. The trailing edge does exist as you suggest, but not the leading edge.


RE: Not really.
By TomZ on 3/10/2007 10:45:37 AM , Rating: 3
The leading edge of the "bathtub curve" is eliminated by testing performed at the factory prior to shipping the product to customers, in addition to focused QA activities that help to manage the yield.


RE: Not really.
By rgsaunders on 3/9/2007 11:00:40 PM , Rating: 4
The explanation given by an engineer from Seagate or WD several years ago was a variation on your explanation, as I recall it went something like this; if an MTBF is 150,000 hours, then if you replace the drives on a regular recommended schedule (e.g. 3 years), then it will be 150,000 hours of operation before you have a failure. The critical point here which is always left out is that these figures are predicated on you replacing drives on a regular recommended schedule, ie every 3 to 5 years. I am not saying this is either ethical or technically correct, this is the rather lame explanation given by the hard drive manufacturing industry when this issue arose 10 or 15 years ago. The process of assigning an MTBF would appear to be relatively simple, but as you have seen, it is anything but.


RE: Not really.
By nothingtoseehere on 3/10/2007 4:43:03 PM , Rating: 3
That equation means nothing: Simple test for 0.1 second and if no drive smokes in the first fraction of a second, you have done the test to prove any MTBF number you wish to publish: You can choose to publish a number lower than the measured value of .... INFINITY... decisions decisions.

Should a drive fail, simply reduce the test time until it doesn't.

The 'Time period' in your equation should be at least the MTBF value itself, otherwise the published value is only a (grossly mistaken) estimate of the MTBF, as the article shows.

Perhaps it's time to rename MTBF to 'Misleading Term By Frauds' or something of that sorts, which would be a more accurate description of what the value stands for.


RE: Not really.
By mindless1 on 3/12/2007 2:40:23 AM , Rating: 2
You mean, MTBF _ATTEMPTS_ to measure the average time.

Just because one device can use a certain test methodology to reach a reasonable result applicable towards a MTBF, does not mean another device can use the same test. When the test produces data inconsistent with true MTBFR rates, the test is invalid, unapplicable, and potentially fraudulent when so obviously deceiving.


RE: Not really.
By tcsenter on 3/12/2007 4:15:17 AM , Rating: 3
Mean Time BETWEEN Failure(s) = the expected time between two successive failures of a system or sub-system.

It is misleading to consumers, yes, but not by design nor intent. Its misleading because uninformed consumers have read into it something it does not mean and has never meant.

MTBF and MTTF mean the same thing today they always have. Nothing has changed, except the level of education and understanding of the audience [erroneously] interpreting the meaning and practical significance.

Here is a great primer on statistical reliability models or standards:

http://www.relex.com/resources/art/art_mttf.asp

Some of the widely accepted reliability engineering or prediction standards:

MIL-HDBK-217
Telcordia
PRISM
RDF 2000
IEC 62380
NSWC-98/LE1 (Mechanical)
Chinese 299B
HRD5


RE: Not really.
By fic2 on 3/9/07, Rating: -1
RE: Not really.
By JeffDM on 3/10/2007 10:13:57 AM , Rating: 2
It's fine if you don't like them, then you can refer to the Google study that says a lot of the same things.


RE: Not really.
By fic2 on 3/12/2007 11:30:58 AM , Rating: 2
Man. This forum needs some kind of sarcasm thing since at least half the people don't know it when it hits them in the face.


RE: Not really.
By defter on 3/10/2007 4:31:52 AM , Rating: 5
quote:
I don't think they are so much exaggerated as the term MTBF is just misunderstood.


It's not misunderstood, it's exaggerated...

quote:
The Manufacturer runs 1000 drives on a test bed for 1000 hours . If one fails it is 1,000,000 hours MTBF (1000x1000). In reality its a bit more complex, but that is the jist of it. MTBF is not a relative rating on how long your hard drive should last


So why it's called: "MEAN time BETWEEEN failure"? Since the current metric has nothing to do with mean time between failures, manufacturers shouldn't call it "MTBF". Quoting unrealistic MTBF times is exaggerating.


RE: Not really.
By retrospooty on 3/10/2007 9:35:12 AM , Rating: 3
OK, perhaps it is named badly and should be changed, but my point remains the same.

What an MTBF of 1,000,000 hours means is not that your drive will last for 1,000,000 hours, but that after testing x amount of drives for a total of 1,000,000 hours combined, only one failed. that is what it means now, today and it is not necesarily false nor exaggerated, just misunderstood.


RE: Not really.
By retrospooty on 3/10/2007 3:54:17 PM , Rating: 3
Hilarious.

I get rated down to Zero for explaining what MTBF means. LOL.


RE: Not really.
By Hoser McMoose on 3/11/2007 12:40:02 PM , Rating: 2
Note that there are some caveats to the measure though. For example, at least certain drives are/have been rated with the assumption that the first 30 or 90 days ("burn in" period) will be excluded. As most people in IT know, drives are by far the most likely to fail when they are first installed, so if a drive company ignores those early failures they instantly get a much higher MTBF rating.

Also some desktop drives are rating assuming a limited usage pattern. This means that they basically assume that the computer will be turned off and therefore the drive not used for eg. 1/3rd of the time, of the time, allowing them to multiply the failure rating by 3.

Of course, in the end there actually is NO testing done to determine the MTBF at all! If they actually DID test the drives then we would see variability from one drive to the next, for example Seagate's Cheetah 15K.4 and 15K.5 would have different failure rates, and even drives with 1 platter vs. 2 or 3 platter drives would be different. However if you look at the ratings, they're always all the same (1,400,000 hours for the Seagate Cheetahs).

In reality MTBF ceased being any sort of ACTUAL measure ages ago! It's now just a marketing term. These drives are not really being tested to find to determine their MTBF, they've been assigned an estimated failure rate and then they are tested to see if they are likely to at least come close to that failure rate. The REAL estimated failure rate of these drives is something that never gets posted publicly, we only ever see the marketing-assigned figure.


RE: Not really.
By masher2 (blog) on 3/10/2007 11:46:50 AM , Rating: 2
> "So why it's called: "MEAN time BETWEEEN failure"? Since the current metric has nothing to do with mean time between failures..."

The term means exactly what it says. A MTBF of one million hours means that, if you operate a hard drive within its service life, you will average one failure every million hours.

The confusion occurs because drives have MTBFs greater than their service life. If the service life is, for example, 100,000 hours, then you'd expect a 10% chance of that drive failing during its service life.


RE: Not really.
By TomZ on 3/10/2007 4:55:58 PM , Rating: 2
Yes, that makes sense, however, most consumers would assume that MTBF gives some indication of the useful life, which it clearly does not, both in theory and empirically.

It is even interesting that the cited researchers and the writer of this DT article also infer such a connection. Maybe it is all a misunderstanding.


RE: Not really.
By TomZ on 3/10/2007 4:56:40 PM , Rating: 1
Oh, and how lame of folks to downrate your post. Your comment was spot on, if you ask me.


RE: Not really.
By Oregonian2 on 3/12/2007 1:25:39 PM , Rating: 2
MTBF is defined in the standards. Most common ones used are Telcordia's and there's a MIL spec that's used often. Those define what it is. How companies come up with those numbers for their products is the part in contention.

If a device's MTBF is a million hours, it's supposed to mean that if you've a thousand of them, after a million hours 500 of them will still be working. It's the MEAN time to failure using the word MEAN mathematically, it's not the same as AVERAGE (although can be sometimes). IOW - half take less time than MTBF to fail and half take more (that's what "mean" means). The first five hundred could have failed much earlier or could have all failed at the 990,000 hour point. It's the point where half of them will have failed which very obviously means that one has only a 50-50 chance of one's device lasting that long. It is NOT the expected lifetime of the device.


RE: Not really.
By martyspants on 3/12/2007 9:59:20 PM , Rating: 2
No.

Mean means the average (even mathematically). The term you are describing is the Median Time Between Failures.


RE: Not really.
By Oregonian2 on 3/13/2007 5:41:37 PM , Rating: 2
Yup, you're right. I'm wrong. I need to sleep more, my brain was went bonkers that day. I just took another peek at Telcordia SR-332 (been a few years), the most often used spec (along with the MIL one) and so it's worse than I had thought on my brain-dead day. Everything is very heavy distribution of failure related (heavy on the bathtub curve as an assumed distribution, but may not be true especially if the MTBF is projected on products that have been burned-in already so that the front-end is mostly gone already). So, depending upon assumptions one wants to make... things can be very different. If one would make the rash assumption of the burn-in mentioned above (for reliable devices) and that failure rate was mostly at the back-end of a steep bathtub curve, 100% of the devices could be defective at the MTBF number of hours (but in this scenario, ALL of them lasted until then too and dropped all together). So how things spread out just depends upon assumptions and how well they match the assumptions made in the prediction model. Which is pretty tricky and probably is why even the "standard" SR-332 is pretty hand-wavy at best in its discussions. :-)


RE: Not really.
By walk2k on 3/10/2007 1:56:15 PM , Rating: 2
I wonder what they mean by "extreme" heat because in my experience, hard drives running a server (ie nearly 100% usage, all the time) DO last a lot longer if they are kept cool. After installing cooling fans the rate of failure dropped to basically zero (the drives were lasting longer than those cheap fans actually...) This was in a non-airconditioned room with multiple drives stacked in a regular tower-PC case.


RE: Not really.
By nothingtoseehere on 3/10/2007 4:38:04 PM , Rating: 2
MTBF stands for 'mean time between failure' and should be the mean time between failure. 'Mean' is what statistics say when they mean 'average' (and they say 'mean' instead of 'average' to disambiguate with 'median').

MTBF should be the mean time between failure for the entire life of the drive, not the first N hours only, nor a convoluted result of a calculation based on a measurement of the first N hours only.

Your definition makes it a number that depends on the age of the drive... Sure, it may last an average of 1M hours for the first 1000 hours of operation, but that is only useful information if the MTBF-rating includes that first value of '1000', because it says nothing about the failure rate starting at hour 1001.

The article is absolutely correct in stating that MTBF ratings are misleading consumers.

Simply put, the warranty should be as long as the MTBF.


RE: Not really.
By nah on 3/11/2007 8:33:42 AM , Rating: 2
Reminds me of the Challenger debacle---when Richard Feynman proved that the MTBF of space shuttles was not 1 in 100,000 but a more realistic 1 in 100. For a fascinating insight on NASA and the way stuff works at Washington--read "What do You care what Other People Think ?" by Dick Feynman


“Then they pop up and say ‘Hello, surprise! Give us your money or we will shut you down!' Screw them. Seriously, screw them. You can quote me on that.” -- Newegg Chief Legal Officer Lee Cheng referencing patent trolls











botimage
Copyright 2014 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki