backtop


Print 73 comment(s) - last by Chernobyl68.. on Mar 15 at 6:01 PM

Study says failure rates 15 times that of what manufacturers indicate

A study released this week by Carnegie Mellon University revealed that hard drive manufacturers may be exaggerating their mean-time before failure (MTBF) ratings on hard drives. In fact, researchers at Carnegie indicated that on the average, failure rates were as high as 15 times the rated MTBFs.

Rounding-up roughly 100,000 hard drives across a variety of manufacturers, researchers at Carnegie tested the drives in various operating conditions as well as real world scenarios. Some drives were at Internet services providers, others at large data centers and some were at research labs. According to test results, the majority of the drives did not appear to be affected by their operating environment. In fact, researchers indicated that drive operating temperatures had little to no effect on failure rates -- a cool hard drive survived no longer than one running hot.

The types of drives used in the study ranged from Serial ATA drives, SCSI and even high-end fiber-channel (FC) drives. Typically, customers will be paying a much larger premium for SCSI and FC drives, which also happen to usually carry longer warranty periods and higher MTBF ratings.

Carnegie researchers found that these high-end drives did not outlast their mainstream counterparts:
In our data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks. This may indicate that disk-independent factors, such as operating conditions, usage and environmental factors affect replacement rates more than component specific factors.
According to the study, the number one cause of drive failures was simply age. The longer the drive has been in operation, the more likely it will fail. According to the study, drives tended to start showing signs of failure after roughly five to seven years of service, after which there was a significant increase in average failure rates (AFR). The failure rates of drives that failed in their first year of service or shorter was just as high as those after the seven year mark.

According to Carnegie researchers, manufacturer MTBF ratings are highly overrated. Take for example the Seagate Cheetah X15 series, which has a MTBF rating of 1.5 million hours. This equates to roughly over 171 years of constant service before problems. Carnegie's researchers said however that customers should expect a more reasonable 9 to 11 years. Interestingly, real world tests in the study showed a consistent average failure of about six years.

The average replacement rate of drives ranged from 2-percent to a whopping 13-percent annually, indicating that there is a need for manufacturers to reevaluate the way a MTBF rating is generated. Worst of all, these rates were for drives with MTBF ratings between 1 million and 1.5 million hours.

Garth Gibson, associate professor of computer science at Carnegie indicated that the study was proof that MTBFs are not a reliable way of measuring drive quality. "We had no evidence that SATA drives are less reliable than the SCSI or Fiber Channel drives," said Gibson.

Carnegie researchers concluded that backup measures are a necessity with critically important data, no matter what kind of hard drive is being used. It is interesting to note that even Google's own data centers use mainly SATA and PATA drives. At the current rate, it is only a matter of time before SATA will perform equal or better than SCSI and FC drives, offering the same reliability, and for much less money.


Comments     Threshold


This article is over a month old, voting and posting comments is disabled

Not really.
By retrospooty on 3/9/2007 8:12:18 PM , Rating: 5
I don't think they are so much exaggerated as the term MTBF is just misunderstood. Here is how they do it.

Since they cannot test drives for 10 years or more prior to release, they test many drives. For the purpose of easy Math lets use 1000.

The Manufacturer runs 1000 drives on a test bed for 1000 hours . If one fails it is 1,000,000 hours MTBF (1000x1000). In reality its a bit more complex, but that is the jist of it. MTBF is not a relative rating on how long your hard drive should last, but a benchmark of many drives tested. Would you rather have a drive that tested at a low or a high MTBF? Of course, the higher the better.

The other thing this cannot find is latent heat/longevity related issues. If a set of 1000 drives lastest for 1000 hours and none failed, that is great, but what if a certain component has a defect that causes 100% failure after 20-30,000 hours. The MTBF test would not find the defect. Still it is a great test and an absolute requirement. I with Ford and GM could be tested like that against Honda and Toyota.




RE: Not really.
By zippercow on 3/9/2007 8:28:31 PM , Rating: 5
You beat me to it. I actually wrote a short document for my company a while back that shows the formula (not sure if that will make more sense to anyone):

MTBF (Mean Time Between Failure) measures the average time that a device works properly without failure. The MTBF of any hardware is calculated using the following formula:

[short time period]*[number of pieces tested]/[number of pieces tested which failed within that time period]=MTBF

The MTBF rating of our DVR is 449,616.52 hours (~51 years).This means that if 51 DVRs were to be run for 1 year, 1 failure out of those 51 could be expected.


RE: Not really.
By BladeVenom on 3/9/2007 9:47:35 PM , Rating: 4
Which kind of underscores their point, since no hard drive is going to last 51 years.


RE: Not really.
By oab on 3/9/2007 10:21:49 PM , Rating: 1
It might last 51 years, but it won't be in service any more.

It still might mechanically work (though it might loose some information due to the that para magnetic effect that I don't understand.


RE: Not really.
By S3anister on 3/10/2007 3:22:19 AM , Rating: 1
superparamagnetic effect lol. I did an analytical paper on that... not too fun.


RE: Not really.
By TomZ on 3/9/2007 10:19:30 PM , Rating: 4
This formula shows precisely the flaw that the researchers empirically proved, which is that this calculation doesn't take into account the "bathtub curve" increase in failures at the end of the product's life.


RE: Not really.
By JeffDM on 3/10/2007 10:09:20 AM , Rating: 2
I thought that study proved that the "bathtub" curve doesn't exist for hard drives. The trailing edge does exist as you suggest, but not the leading edge.


RE: Not really.
By TomZ on 3/10/2007 10:45:37 AM , Rating: 3
The leading edge of the "bathtub curve" is eliminated by testing performed at the factory prior to shipping the product to customers, in addition to focused QA activities that help to manage the yield.


RE: Not really.
By rgsaunders on 3/9/2007 11:00:40 PM , Rating: 4
The explanation given by an engineer from Seagate or WD several years ago was a variation on your explanation, as I recall it went something like this; if an MTBF is 150,000 hours, then if you replace the drives on a regular recommended schedule (e.g. 3 years), then it will be 150,000 hours of operation before you have a failure. The critical point here which is always left out is that these figures are predicated on you replacing drives on a regular recommended schedule, ie every 3 to 5 years. I am not saying this is either ethical or technically correct, this is the rather lame explanation given by the hard drive manufacturing industry when this issue arose 10 or 15 years ago. The process of assigning an MTBF would appear to be relatively simple, but as you have seen, it is anything but.


RE: Not really.
By nothingtoseehere on 3/10/2007 4:43:03 PM , Rating: 3
That equation means nothing: Simple test for 0.1 second and if no drive smokes in the first fraction of a second, you have done the test to prove any MTBF number you wish to publish: You can choose to publish a number lower than the measured value of .... INFINITY... decisions decisions.

Should a drive fail, simply reduce the test time until it doesn't.

The 'Time period' in your equation should be at least the MTBF value itself, otherwise the published value is only a (grossly mistaken) estimate of the MTBF, as the article shows.

Perhaps it's time to rename MTBF to 'Misleading Term By Frauds' or something of that sorts, which would be a more accurate description of what the value stands for.


RE: Not really.
By mindless1 on 3/12/2007 2:40:23 AM , Rating: 2
You mean, MTBF _ATTEMPTS_ to measure the average time.

Just because one device can use a certain test methodology to reach a reasonable result applicable towards a MTBF, does not mean another device can use the same test. When the test produces data inconsistent with true MTBFR rates, the test is invalid, unapplicable, and potentially fraudulent when so obviously deceiving.


RE: Not really.
By tcsenter on 3/12/2007 4:15:17 AM , Rating: 3
Mean Time BETWEEN Failure(s) = the expected time between two successive failures of a system or sub-system.

It is misleading to consumers, yes, but not by design nor intent. Its misleading because uninformed consumers have read into it something it does not mean and has never meant.

MTBF and MTTF mean the same thing today they always have. Nothing has changed, except the level of education and understanding of the audience [erroneously] interpreting the meaning and practical significance.

Here is a great primer on statistical reliability models or standards:

http://www.relex.com/resources/art/art_mttf.asp

Some of the widely accepted reliability engineering or prediction standards:

MIL-HDBK-217
Telcordia
PRISM
RDF 2000
IEC 62380
NSWC-98/LE1 (Mechanical)
Chinese 299B
HRD5


RE: Not really.
By fic2 on 3/9/07, Rating: -1
RE: Not really.
By JeffDM on 3/10/2007 10:13:57 AM , Rating: 2
It's fine if you don't like them, then you can refer to the Google study that says a lot of the same things.


RE: Not really.
By fic2 on 3/12/2007 11:30:58 AM , Rating: 2
Man. This forum needs some kind of sarcasm thing since at least half the people don't know it when it hits them in the face.


RE: Not really.
By defter on 3/10/2007 4:31:52 AM , Rating: 5
quote:
I don't think they are so much exaggerated as the term MTBF is just misunderstood.


It's not misunderstood, it's exaggerated...

quote:
The Manufacturer runs 1000 drives on a test bed for 1000 hours . If one fails it is 1,000,000 hours MTBF (1000x1000). In reality its a bit more complex, but that is the jist of it. MTBF is not a relative rating on how long your hard drive should last


So why it's called: "MEAN time BETWEEEN failure"? Since the current metric has nothing to do with mean time between failures, manufacturers shouldn't call it "MTBF". Quoting unrealistic MTBF times is exaggerating.


RE: Not really.
By retrospooty on 3/10/2007 9:35:12 AM , Rating: 3
OK, perhaps it is named badly and should be changed, but my point remains the same.

What an MTBF of 1,000,000 hours means is not that your drive will last for 1,000,000 hours, but that after testing x amount of drives for a total of 1,000,000 hours combined, only one failed. that is what it means now, today and it is not necesarily false nor exaggerated, just misunderstood.


RE: Not really.
By retrospooty on 3/10/2007 3:54:17 PM , Rating: 3
Hilarious.

I get rated down to Zero for explaining what MTBF means. LOL.


RE: Not really.
By Hoser McMoose on 3/11/2007 12:40:02 PM , Rating: 2
Note that there are some caveats to the measure though. For example, at least certain drives are/have been rated with the assumption that the first 30 or 90 days ("burn in" period) will be excluded. As most people in IT know, drives are by far the most likely to fail when they are first installed, so if a drive company ignores those early failures they instantly get a much higher MTBF rating.

Also some desktop drives are rating assuming a limited usage pattern. This means that they basically assume that the computer will be turned off and therefore the drive not used for eg. 1/3rd of the time, of the time, allowing them to multiply the failure rating by 3.

Of course, in the end there actually is NO testing done to determine the MTBF at all! If they actually DID test the drives then we would see variability from one drive to the next, for example Seagate's Cheetah 15K.4 and 15K.5 would have different failure rates, and even drives with 1 platter vs. 2 or 3 platter drives would be different. However if you look at the ratings, they're always all the same (1,400,000 hours for the Seagate Cheetahs).

In reality MTBF ceased being any sort of ACTUAL measure ages ago! It's now just a marketing term. These drives are not really being tested to find to determine their MTBF, they've been assigned an estimated failure rate and then they are tested to see if they are likely to at least come close to that failure rate. The REAL estimated failure rate of these drives is something that never gets posted publicly, we only ever see the marketing-assigned figure.


RE: Not really.
By masher2 (blog) on 3/10/2007 11:46:50 AM , Rating: 2
> "So why it's called: "MEAN time BETWEEEN failure"? Since the current metric has nothing to do with mean time between failures..."

The term means exactly what it says. A MTBF of one million hours means that, if you operate a hard drive within its service life, you will average one failure every million hours.

The confusion occurs because drives have MTBFs greater than their service life. If the service life is, for example, 100,000 hours, then you'd expect a 10% chance of that drive failing during its service life.


RE: Not really.
By TomZ on 3/10/2007 4:55:58 PM , Rating: 2
Yes, that makes sense, however, most consumers would assume that MTBF gives some indication of the useful life, which it clearly does not, both in theory and empirically.

It is even interesting that the cited researchers and the writer of this DT article also infer such a connection. Maybe it is all a misunderstanding.


RE: Not really.
By TomZ on 3/10/2007 4:56:40 PM , Rating: 1
Oh, and how lame of folks to downrate your post. Your comment was spot on, if you ask me.


RE: Not really.
By Oregonian2 on 3/12/2007 1:25:39 PM , Rating: 2
MTBF is defined in the standards. Most common ones used are Telcordia's and there's a MIL spec that's used often. Those define what it is. How companies come up with those numbers for their products is the part in contention.

If a device's MTBF is a million hours, it's supposed to mean that if you've a thousand of them, after a million hours 500 of them will still be working. It's the MEAN time to failure using the word MEAN mathematically, it's not the same as AVERAGE (although can be sometimes). IOW - half take less time than MTBF to fail and half take more (that's what "mean" means). The first five hundred could have failed much earlier or could have all failed at the 990,000 hour point. It's the point where half of them will have failed which very obviously means that one has only a 50-50 chance of one's device lasting that long. It is NOT the expected lifetime of the device.


RE: Not really.
By martyspants on 3/12/2007 9:59:20 PM , Rating: 2
No.

Mean means the average (even mathematically). The term you are describing is the Median Time Between Failures.


RE: Not really.
By Oregonian2 on 3/13/2007 5:41:37 PM , Rating: 2
Yup, you're right. I'm wrong. I need to sleep more, my brain was went bonkers that day. I just took another peek at Telcordia SR-332 (been a few years), the most often used spec (along with the MIL one) and so it's worse than I had thought on my brain-dead day. Everything is very heavy distribution of failure related (heavy on the bathtub curve as an assumed distribution, but may not be true especially if the MTBF is projected on products that have been burned-in already so that the front-end is mostly gone already). So, depending upon assumptions one wants to make... things can be very different. If one would make the rash assumption of the burn-in mentioned above (for reliable devices) and that failure rate was mostly at the back-end of a steep bathtub curve, 100% of the devices could be defective at the MTBF number of hours (but in this scenario, ALL of them lasted until then too and dropped all together). So how things spread out just depends upon assumptions and how well they match the assumptions made in the prediction model. Which is pretty tricky and probably is why even the "standard" SR-332 is pretty hand-wavy at best in its discussions. :-)


RE: Not really.
By walk2k on 3/10/2007 1:56:15 PM , Rating: 2
I wonder what they mean by "extreme" heat because in my experience, hard drives running a server (ie nearly 100% usage, all the time) DO last a lot longer if they are kept cool. After installing cooling fans the rate of failure dropped to basically zero (the drives were lasting longer than those cheap fans actually...) This was in a non-airconditioned room with multiple drives stacked in a regular tower-PC case.


RE: Not really.
By nothingtoseehere on 3/10/2007 4:38:04 PM , Rating: 2
MTBF stands for 'mean time between failure' and should be the mean time between failure. 'Mean' is what statistics say when they mean 'average' (and they say 'mean' instead of 'average' to disambiguate with 'median').

MTBF should be the mean time between failure for the entire life of the drive, not the first N hours only, nor a convoluted result of a calculation based on a measurement of the first N hours only.

Your definition makes it a number that depends on the age of the drive... Sure, it may last an average of 1M hours for the first 1000 hours of operation, but that is only useful information if the MTBF-rating includes that first value of '1000', because it says nothing about the failure rate starting at hour 1001.

The article is absolutely correct in stating that MTBF ratings are misleading consumers.

Simply put, the warranty should be as long as the MTBF.


RE: Not really.
By nah on 3/11/2007 8:33:42 AM , Rating: 2
Reminds me of the Challenger debacle---when Richard Feynman proved that the MTBF of space shuttles was not 1 in 100,000 but a more realistic 1 in 100. For a fascinating insight on NASA and the way stuff works at Washington--read "What do You care what Other People Think ?" by Dick Feynman


Point is?
By Nik00117 on 3/10/2007 5:31:02 AM , Rating: 2
I work on PCs a lot, and I always like to ask how old the yare etc how old parts are that break.

Normally I get a avg of 6-8 years of operation. If a HDD lasts 6-8 years for me i'm happy. And normally they do last that long.

Quite frankly I really don't care if my HDD is still going be working 171 years. If anything in 171 years this HDD will be laughed at. Or in a mueseum one of the two.




RE: Point is?
By zsdersw on 3/10/2007 8:34:02 AM , Rating: 2
I doubt it. 171 years from now we may very well be using hard drives.. still, because every time something else comes along that could be a viable replacement, the naysayers chime in and say "it'll never work out".


RE: Point is?
By JeffDM on 3/10/2007 10:24:00 AM , Rating: 2
I think flash drives may very well superscede hard drives in a decade. The cost per gigabyte of flash is going down a lot faster than with hard drives and that trend doesn't seem to be slowing down. I think it might require a change in how operating systems are written to make sure they don't need to swap.

During the transition, I think it's possible that the operating system, apps and critical data will reside on a flash drive, and less critical information will be on a separate volume that's on a hard drive. Maybe hard drives will still be around, but not everyone will need one. Power users will probably be the first to start using flash drives for computers, but will probably be the last to stop using hard drives too.


RE: Point is?
By TomZ on 3/10/2007 11:34:10 AM , Rating: 2
You have to look at two factors - cost and technology. On the cost side, there is an oversupply in the flash market right now, and so prices are pretty low. But it isn't always that way.

When you look at the technology, HDDs have a couple of orders of magnitude more storage capacity than flash, and R&D continues for HDDs as well as flash. So looking into the future, we would expect both to continue the trends of higher capacity, higher density, and lower cost per unit of storage.

So when flash can replace current HDDs in terms of cost and capacity for a mainstream market segment, HDDs will be also larger and cheaper than now. So I think what you'll see is HDDs continuing to hold marketshare until some other technology can totally dominate, e.g., the PRAM that Intel is talking about.


RE: Point is?
By Oregonian2 on 3/12/2007 1:30:47 PM , Rating: 2
That's not so much the point as to whether you'd still be using a 40MB hard drive now that you thought was tremendously big when you bought it for $400.


hmmm...
By drezilla on 3/10/2007 1:58:36 AM , Rating: 3
In my experience, a typical hard-drive lasts 4-5 years (anything beyond this is a bonus). Most techies are aware of this and generally start replacing drives before they fail as part of preventive maintenance. You would be a fool to take the MTBF literally and use this to predict when your drive goes gaga.




RE: hmmm...
By TomZ on 3/10/2007 5:06:02 PM , Rating: 2
In my experience, a HDD typically lasts between 0 and 20 years, and fails randomly and at an inconvenient moment (is there any convenient moment for a failure?).

For me, proactively replacing older HDDs in order to be able to schedule changeout isn't worth the increased cost. And it doesn't actually solve the problem since a drive could just as likely fail before you change it out.

I personally think it is more prudent to (a) have spares readily available, (b) use RAID mirroring, and (c) back up your data. So basically treat HDDs as though they are unreliable, which they are.


RE: hmmm...
By Pirks on 3/12/07, Rating: 0
RE: hmmm...
By TomZ on 3/12/2007 3:28:48 PM , Rating: 2
I have a couple of problems specifically with SyncToy:

1. It doesn't synchronize files marked "read only" properly, because it refuses to overwrite RO files on the sync target

2. SyncToy, IIRC, won't delete files on the target that have been deleted from the source

Probably other sync utilities don't have this problem.

With mirroring, who cares if it duplicates temp files or not?

If performance is a must, then just have two drive pairs that are mirrored. Mirroring doesn't prohibit performance.

Your use of SyncToy is the same as my description of backup. The benefit of still using mirrored raid is that you still have backups (the redundant drive) in between your backup period. For example, suppose I run SyncToy, Winzip, or a Backup utility nightly, and I work all day on something, at I have a HDD failure at 4pm, then the days' work is basically lost since I can't restore it from a backup.

The other benefit of mirroring is the ability to continue to work even when one of your HDDs stops working. This helps your immediate productivity/convenience, and it also keeps you from having to reload the OS, drivers, and apps again. You just put in a new drive, and the RAID controller automatically synchronizes, usually in the background while you do other work.


misleading
By dallastx on 3/10/2007 12:45:08 PM , Rating: 2
If the calculation you guys mention for hard drive MTBF is correct (and it sounds like it is, given the high numbers from manufacturers), then this is pretty misleading. What consumers care about is "Hey, if I purchase this drive from you today, on average, how long will it be until it fails". However it seems the MTBF is far from this.

I'm curious though, I recently bought a water pump to cool my system, and the manufacturer listed 50,000 hours for MTBF. I took this to mean an average lifetime of ~5 years, which seems reasonable. So I wonder if their definition of MTBF is different. If so, they should really standardize the meaning of the term.

How the hard drive manufacturers calculate MTBF sort of reminds me of how some companies calculate their reliability. We offer (as does everyone in the industry) 99.999% reliability. However, it's an aggregated total (over all system components), and is pretty much meaningless if you care about how often you should expect a failure that incurs some form of data loss / data integrity issue. It's like saying "Hrmm... the mouse never fails so lets include its uptime in our failure rate calculation for the entire system.. That should minimize the effect of the processor board rebooting every week."




RE: misleading
By bldckstark on 3/11/2007 2:36:14 PM , Rating: 2
I read a while back that the company that invented the blue led was psyched becaue their 5 years of testing was up, and they were finally going to be able to sell it. An led's standard accepted life is 5 years minimum, so they have to test them for 5 years in order to market them successfully. There was no statistical inferation of the outcome. They lit a bunch of led's, then they left them on for 5 years. As far as I know some of them are still on. This was a big deal, because first comes the led, then the laser. That is where the BR-DVD came from.

My point is that not all companies/markets use this form of highly misleading statistical analysis, although many do. You might note that lightbulbs last about as long as they say they will on the box.

Samuel Clemson (Mark Twain), said in the 1800's that there are three kinds of lies -
1. Lies
2. Damned Lies
3. Statistics


RE: misleading
By Oregonian2 on 3/12/2007 1:33:54 PM , Rating: 2
Could also just mean that they didn't understand the failure mechanisms sufficiently so that they could use mathematical methods to extrapolate failure rates based upon shorter term testing.

So they didn't sell ANY until after 5 years?


.
By semo on 3/9/2007 8:35:00 PM , Rating: 2
quote:
...indicating that there is a need to manufacturers to reevaluate the way a MTBF rating is generated.
i always found these ratings to be fishy but i doubt they are going to change in the near future. the only way i can imagine that happening is if all the manufacturers revised their mtbf ratings at the same time (unlikely unless some sort of regulator orders them to). if just one said, "hey our drives have mtbf of 10,000 hours... no really they do, they last that long" who's going to take notice and not get one with 100,000 hours rating.

now let's all wait for grandpa to tell us how this is all bs and his 40mb hdd still works without a hitch.




RE: .
By mindless1 on 3/12/2007 2:46:23 AM , Rating: 2
The way it would happen is if there were enough class action lawsuits and a court order and penalty enforced. This is merely one area where consumers are not being protected, it is not enough for use tech-heads to "know" something that is contradictory to specs, rather that is what the specs were meant to show.

This doesn't mean we can ignore the alternate testing method for MTBF that does tend to show exaggerated lifespans, but not THAT inflated. We can safely say that practicaly none of the drives are operating till their MTBF rating, let alone enough beyond it to offset those failing under it. This discrepancy needs to be resolved, it was a questionable practice for marketing purposes all along but has been stretched to the point of uselessness for some products at this point.


A few good drives
By dare2savefreedom on 3/9/2007 11:39:15 PM , Rating: 4
Col. Hitachi: You want answers?
leet pc user: I think I'm entitled.
noob mac user: I don't ask questions.

Col. Hitachi: You want answers?
leet pc user: I want the truth.
Col. Hitachi: You can't handle the truth.




Platters
By Egglick on 3/10/2007 11:13:06 PM , Rating: 2
While this is slightly off-topic, I've often wondered how the usage of drive platters factors into the equation.

For instance, do drives with fewer platters last longer than those with more?? Does platter density have any effect on failure rate?




RE: Platters
By Oregonian2 on 3/12/2007 1:35:42 PM , Rating: 2
How about what day of the week it was made? Friday-afternoon disks? How about Monday morning ones? How about ones made between Christmas and New Year's (or some equivalent for the society it was made in)?


By Fritzr on 3/10/2007 4:25:37 AM , Rating: 2
Link to full article: http://news.bbc.co.uk/2/hi/technology/6376021.stm

Excerpt from article--
Google employs its own file system to organise the storage of data, using inexpensive commercially available hard drives rather than bespoke systems.

Lower temperatures are associated with higher failure rates
Google report

Hard drives less than three years old and used a lot are less likely to fail than similarly aged hard drives that are used infrequently, according to the report.

"One possible explanation for this behaviour is the survival of the fittest theory," said the authors, speculating that drives which failed early on in their lifetime had been removed from the overall sample leaving only the older, more robust units.

The report said that there was a clear trend showing "that lower temperatures are associated with higher failure rates".

"Only at very high temperatures is there a slight reversal of this trend."

But hard drives which are three years old and older were more likely to suffer a failure when used in warmer environments.

"This is a surprising result, which could indicate that data centre or server designers have more freedom than previously thought when setting operating temperatures for equipment containing disk drives," said the authors.

The report also looked at the impact of scan errors - problems found on the surface of a disc - on hard drive failure.

"We find that the group of drives with scan errors are 10 times more likely to fail than the group with no errors," said the authors.

They added: "After the first scan error, drives are 39 times more likely to fail within 60 days than drives without scan errors."
--end excerpt




By TomZ on 3/10/2007 10:49:06 AM , Rating: 1
If I had to guess, what Google is seeing is that hard drives work best under constant temperature conditions. Thermal cycling causes mechanical stresses throughout the drive assembly, and maybe that contributes to early failures.


By dare2savefreedom on 3/9/2007 10:42:34 PM , Rating: 2
The approach will not be easy. You are required to maneuver straight down this trench and skim the surface to this point. The target area is only two meters wide. It's a small thermal exhaust port, right below the main port. The shaft leads directly to the reactor system. A precise hit will start a chain reaction which should destroy the station. Only a precise hit will set off a chain reaction.




google
By EnzoFX on 3/9/2007 11:07:24 PM , Rating: 2
google came recruiting at our school, and told us much of this same information about hard drives, they're all about minimizing cost...




Huge gap in annual failiure rate
By ATWindsor on 3/10/2007 3:35:59 AM , Rating: 2
2% to 13% thats a pretty huge gap, I wonder if disk activity has anythin to do with the rate, is a disk going at 100% all of the time more likely to fail than one going at 50%?

AtW




A bit confusing . . .
By Spacecomber on 3/10/2007 8:54:55 AM , Rating: 2
quote:
The number one cause of drive failures according to the study was simply age. The longer the drive has been in operation, the more likely it will fail. According to the study, drives tended to start showing signs of failure after roughly five to seven years of service, after which there was a significant increase in average failure rates (AFR). The failure rates of drives that were in their first year of service or shorter was just as high as those after the seven year mark.


This paragraph was a bit confusing to me, at first. After re-reading it, I think it is describing the so-called "bathtub" curve, where you see high failure rates at the start of the drive's life cycle and at the end. But, after looking at the orignal article, I got the impression that the results didn't really fit this curve that well. Instead, they described a steadily increasing rate of failure that begins earlier than expected.

quote:
Observation 5: Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.

Observation 6: Early onset of wear-out seems to have a much stronger impact on lifecycle replacement rates than infant mortality, as experienced by end customers, even when considering only the first three or five years of a system's lifetime. We therefore recommend that wear-out be incorporated into new standards for disk drive reliability. The new standard suggested by IDEMA does not take wear-out into account


Looking at the bar charts representing this data in the study, only one set of data show much evidence of higher failure rates during the first year, followed by a period of lower - but gradually increasing - failure rates.

At least, this is how I read the data.




Riiiiight
By Polynikes on 3/10/2007 1:09:08 PM , Rating: 2
I think anyone with half a brain knows it's just a crap shoot. I've got a 20GB WD ATA100 HDD that's at least 8 years old that still works perfectly, yet 3 years ago I had one of two Seagate 80GB SATA1 HDD's fail on me, and the other followed within another year. Granted, those are the only hard drives I've had die on me out of many I've owned, but it just goes to show that since the MTBF is a MEAN, any given drive could fail tomorrow or not for a decade.




...
By Goty on 3/10/2007 5:11:24 PM , Rating: 1
Somehow the phrase, "No shit, Sherlock," comes to mind.




Just like people
By Hypernova on 3/9/07, Rating: 0
lies, damn lies, and statistics
By codeThug on 3/9/07, Rating: 0
MTBF numbers are a lie
By Beenthere on 3/9/07, Rating: -1
RE: MTBF numbers are a lie
By PandaBear on 3/9/2007 9:15:53 PM , Rating: 2
Agree, MTBF is usually useful only for things that fail on random in infant mortality. It is a prediction of how many DOA drives you get if you order a large quantity. Once you power it on, it is not a prediction on how many of them die within 1, 3, 5, 7 years.

Every drive is designed differently and eventually fail due to different reason. So there is no universal formula to quantify it. Most large OEM (i.e. Dell or HP) run their own qualification test on each new design/model before they take a huge order of millions of drives, so they know and always get the best prime drives.

The rest that failed, like 250GB rather than 300GB, or 120GB rather than 160GB, goes into retail for the average users. Big OEM won't accept them with 1/4 head clipped or the outer 1/4 ring disabled.

The ones that do very bad? goes to Fry's as white box. I once saw an IBM Deskstar with hand soldered resistor on the PCB, clearly a reject.


RE: MTBF numbers are a lie
By retrospooty on 3/9/2007 10:30:29 PM , Rating: 3
You dont undertand what MTBF is... Zippercow explained it best above.

[short time period]*[number of pieces tested]/[number of pieces tested which failed within that time period]=MTBF

The MTBF rating of our DVR is 449,616.52 hours (~51 years).This means that if 51 DVRs were to be run for 1 year, 1 failure out of those 51 could be expected.

This does not mean the DVR is expected to last 51 years.


RE: MTBF numbers are a lie
By nothingtoseehere on 3/10/2007 4:55:13 PM , Rating: 2
The 'between' from the 'B' in MTBF implies that the number represents a time period between 'F' failures... Which two failures are those in your equation? There is no 'failure A time' minus 'failure B time' in your equation, because there is no failure A and B to compare the time between, so there is no number to take the mean of, so your equation cannot lead to the MTBF.

Maybe the manufacturers determine their MTBF's that way, sure, but then they shouldn't call it MTBF because that is misleading.


RE: MTBF numbers are a lie
By TomZ on 3/10/07, Rating: 0
RE: MTBF numbers are a lie
By Bladen on 3/11/2007 6:56:33 AM , Rating: 2
Actually Mega, Kilo, etc are metric, and metric works to the power of 10. So they are right about that.

Don't get me wrong though, MTBF is misleading at the best, unless theu want to change it too "MTBF of 1000 drives for 1000 test hours" - or what ever sample size and time they use.


RE: MTBF numbers are a lie
By TomZ on 3/12/2007 8:44:08 AM , Rating: 2
Actually, that's wrong. The prefixes are Greek prefixes, and they were not invented for the metric system.

In the computer industry, powers-of-two prefix definitions been commonly used for about a half-century. You just need to understand that "megabyte" has two meanings - sometimes 1000^2 and sometimes 1024^2, depending on the context. For example, when I purchase a HDD, a megabyte = 1000^2; however, when I purchase DRAM a megabyte is 1024^2.

Engineers and computer scientists usually use powers-of-two definitions. It is mainly the marketing literature that is using powers-of-ten to inflate the apparent capacity of HDDs. Even with HDDs, the fundamental sector size is a power-of-two measure (e.g., 512 bytes), so the inherent design of the drive is powers-of-two, however it is marketed as powers-of-ten. This is a relatively recent development - in "olden days" (e.g., 10 years ago), most (all?) HDDs used powers-of-two measurements when they stated their capacity.


RE: MTBF numbers are a lie
By BikeDude on 3/12/2007 9:59:26 AM , Rating: 1
No, I remember drives from twenty years ago that were a tad optimistic in their size estimation. The only thing that changed ten years ago was a dramatic increase in total drive space (making the discrepancy even more apparent).

The communication industry also use the power of ten. Basically they count the number of bits that are transferred and do not pay heed to whether a byte (or word) is 5, 6, 7, 8 or 9 bits long.

And of course:
k - kilo (1000)
K - Kilo based on power of two (1024)
m - milli (meaningless as far as 'we' are concerned)
M - Mega (1000k or 1024K depending on context)
b - bit (!)
B - Byte

I.e. if you see someone using "mb" (millibit) as an unit, please hit them hard on the head.

--
Rune


RE: MTBF numbers are a lie
By TomZ on 3/12/2007 3:18:22 PM , Rating: 2
1. HDD size discrepencies in years past were due to the misunderstanding of the difference between "raw" capacity and "formatted" capacity. That problem still exists today, but the word is out more on that one.

2. The communication industry uses powers-of-ten measures for <prefix>bytes/s because they implement communication standards that use powers-of-ten crystals and signaling frequencies. These crystals are obviously more prevalent than powers-of-two crystals.

3. I don't think that 'k' vs 'K' meaning 1000 vs 1024 is really widely accepted or really even a good idea. It is too subtle of a distinction, since humans are pretty used to ignoring case.


RE: MTBF numbers are a lie
By Hoser McMoose on 3/11/2007 12:53:08 PM , Rating: 2
Obviously it is a bit self-serving of them, but in this case at least they are 100% accurate, more so then the OS definition which of Kilobyte being 1024 bytes, which is simply wrong.

Kilo, mega, giga, etc. are SI prefixes which are well defined. By the very definition of the prefix, "megabyte" = 1,000,000 bytes, and this definition predated computers by a LONG time. It only got subverted to mean 1,048,576 because in computers things usually come in powers of two and 2^20 is "close enough" to a million that we figured it was ok for us lazy folk.


RE: MTBF numbers are a lie
By TomZ on 3/11/2007 10:16:10 PM , Rating: 2
Kilobyte has never meant 1000 bytes - never.


RE: MTBF numbers are a lie
By Chernobyl68 on 3/15/2007 6:01:43 PM , Rating: 2
I'd rather they test 100 drives until all 100 failed - than tell me what the average failure time was.


RE: MTBF numbers are a lie
By mjcutri on 3/10/2007 8:29:52 AM , Rating: 3
Did you even READ the article, or just the headline? They tested all kinds of drives in all kinds of different environments and found that all of them performed about the same, regardless of their type.
"In our data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks."


RE: MTBF numbers are a lie
By goku on 3/12/2007 12:20:35 AM , Rating: 2
yeah I noticed that, but what about PATA drives? About 5 years ago, weren't they more likely to fail?


RE: MTBF numbers are a lie
By gsellis on 3/12/2007 7:59:10 AM , Rating: 2
I would assume that any PATA drive would perform as a similar SATA drive. The connector type means less than the moving parts.


RE: MTBF numbers are a lie
By PandaBear on 3/12/2007 2:47:30 PM , Rating: 2
SATA and PATA are just interfaces, it is the design and components of the HD that makes them good or bad. Example:

Raptor are much more reliable than Maxtor's SATA.


RE: MTBF numbers are a lie
By TomZ on 3/12/2007 4:43:10 PM , Rating: 2
Probably all you can say is the PATA connection/connector is more likely to fail than a SATA connection/connector. As the others have pointed out, the interface shouldn't have much bearing on the reliability of the drive itself.


"Can anyone tell me what MobileMe is supposed to do?... So why the f*** doesn't it do that?" -- Steve Jobs











botimage
Copyright 2014 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki