backtop


Print E-mail del.icio.us 43 comment(s) - last by Griswold.. on May 3 at 5:13 PM

Testing shows that some single-core Opterons have heat-related problems

It is reported that AMD is trying to track down as many as 3,000 Opteron processors which could experience erratic behavior under high-temperature conditions. Processors affected include a number of single-core Opteron processors manufactured within the past six months.

The chips were shown to experience higher than normal core temperatures when running in a high temperature environment. This caused the chips to flub some floating-point calculations. From InformationWeek:

Because of the tests, AMD has changed the screening process for rating the two product lines as the chips come off the production line, Taylor said. As a result, some chips that would have been rated with clock speeds of 2.8 MHz in the past would be listed at 2.6 MHz, making them less likely to be used in extreme computing environments.

This appears to be a separate issues that was earlier reported by The Register claiming that a bug in a batch of Opteron processors will result in incorrect results in iterations with millions of loops. Coupled with high ambient temperatures, the processor will corrupt data. The Register states:

The problem is believed to affect only a fraction - perhaps no more than 3,000 individual CPUs - which managed to slip through AMD's screening net. It is not known how this so-called 'test escape' ocurred, but it took place "in part of 2005 and early 2006", an AMD spokesman said.

Although only a few processors are defective, the fact that no one can place an exact bearing on which batch of processors has the problem is troubling at best. AMD claims measures have been put in place to prevent the bug from happening again, but also stresses that the condition is not likely to happen in financial environments. 

Intel made similar claims during the early Pentium days with the now infamous "F00F" bug.



Comments     Threshold


This article is over a month old, voting and posting comments is disabled

A big deal over nothing, as usual.
By cornfedone on 4/29/2006 3:14:01 PM , Rating: 2
AMD ships hundreds of thousands of CPUs a month and because a few chips (probably 3000 or fewer) MAY develop a floating point error if over-heated and run in a continuous loop, you'd think the sky was falling.

Typical nonsense. I'd like to have a dollar for every defective CPU, chipset, mobo, etc. that Intel has shipped. I'd be a millionaire many times over.

Give AMD credit for acknowledging a handful of marginal chips were shipped and are being replaced.




RE: A big deal over nothing, as usual.
By ViRGE on 4/29/2006 3:33:52 PM , Rating: 2
It's not nonsense in the least. Intel had a slightly more severe problem with the 1.13ghz Coppermine P3, and the combined efforts of AnandTech/HardOCP/TomsHardware got Intel to can the chip completely, even though the only thing the trio could get it to fail at was some Linux compile test. Any time you can come up with a scenario where a CPU can fail in reasonable environmental conditions, you have a problem. I'm just glad to hear that this isn't an inherient defect in the overall design like it was for the 1.13ghz Coppermine.


By Griswold on 4/30/2006 5:12:48 AM , Rating: 3
So you're comparing 3000 individual chips to an entire line of a processor? Very clever.


RE: A big deal over nothing, as usual.
By brownba on 4/29/2006 3:36:12 PM , Rating: 1
wow, typical fanboy response.

Even though only maybe 3000 units are affected,
that's 3000 too many.

Where do you think opterons are used?
In important 24/7 servers running continuously in hot condidtions... that's where the problem is.

'Give AMD credit...'? ummm... right.
Funny how people get credit for correcting a situation that never should have occurred.


RE: A big deal over nothing, as usual.
By BioRebel on 4/29/2006 4:07:53 PM , Rating: 2
And this coming from an intel fanboy...


RE: A big deal over nothing, as usual.
By AaronAxvig on 4/29/2006 4:23:58 PM , Rating: 2
No, he's right. This doesn't help their reputation at all. The server market is not a place to be seen shipping faulty chips that fail under any reasonable usage conditions. Any market, for that matter.


By fikimiki on 4/29/2006 4:46:10 PM , Rating: 2
You're right but AMD is trying to help as good as possibile to manage that. This is only a product made by people - they are not hiding this information - and the credit belongs to this.
I'm fanboy but more than that I respect people and companies who can admit to a failure. This gives more credibility than anything else.


RE: A big deal over nothing, as usual.
By Ringold on 4/29/2006 4:50:45 PM , Rating: 3
No, first poster is right.

How many times does this probably happen and not get reported? Or how often does a single chip slip through a crack?

The system isn't 100% effective, never is. This just happened to hit the news waves on a slow day. Almost like any other recall, a very, very, very small fraction of those with the chip will ever have a problem, but will get it replaced anyway.

Common sense says it happens to Intel, and due to their larger share, probably in greater numbers.

Again, grandstanding by Intel fanboys. "It's reached the end of its life!" All this other crap that nobody on this forum (unless they're AMD or Intel engineers) truly knows. No story in this at all.


By Burning Bridges on 4/29/2006 7:15:02 PM , Rating: 2
AIUI AMD knew there was this problem, and where screening all their chips for it, and unfortunatley 3000 of the chips got through. They are allso offering to replace anyone who has one of these chips, free of charge.


By aGreenAgent on 4/29/2006 4:50:46 PM , Rating: 2
Yeah, it's definitely 3000 too many chips. I don't think it says anything about AMD, though. They had a tiny batch screw up, they're fixing it - let's move on.


By saratoga on 4/29/2006 5:13:08 PM , Rating: 3
If you demand a 0% failure rate, its impossible to produce anything. Saying 3000 is 3000 too many is silly. All processes produce bugs, and no testing procedure is 100% accurate.


RE: A big deal over nothing, as usual.
By Griswold on 4/30/2006 5:15:07 AM , Rating: 2
Opterons usually dont run 24/7 under higher than rated temperatures, especially not for mission critical applications. Did you even read the article? You have to run them at higher temperature than you should be.


RE: A big deal over nothing, as usual.
By Phynaz on 5/1/2006 5:18:31 PM , Rating: 1
Ever been in a computer room in the tropics?
You wouldn't say this if you have.


By Viditor on 5/2/2006 2:53:08 AM , Rating: 2
quote:
Ever been in a computer room in the tropics?

Many! But I've never heard of anyone stupid enough to build one without climate control! Even clothes rot quickly in those climates...without climate control, any server made would last less than a year.


By Griswold on 5/3/2006 5:13:50 PM , Rating: 2
One thing is for sure, you have never been in any server room - that much is for sure.


By segagenesis on 4/29/2006 6:50:13 PM , Rating: 3
Yeah, lets forget about the Pentium FDIV and FOOF bugs while we bash AMD at not only owning up to the problem doing the right thing by offering replacements. Intel didnt even recognize or offer replacement processors for the forementioned bugs until there were millions of affected processors on the market. This is maybe 3000 processors, of which the problem only surfaces under extreme conditions. Yeah, I can see who has thier head in the bong on this one.

AMD must have pretty good QC in place if they managed to find this problem in internal testing which has yet to surface or be reproduced in the real world. Get real guys, at least they are bothering to admit to the problem. As far as the Pentium problems... the FDIV bug was pretty low key where it would not affect most of PC users, but the FOOF bug was a Really Bad Thing (tm) where anyone in user mode could hardlock the computer.

And probably the most ignorant comment of the day...
quote:

Even though only maybe 3000 units are affected,
that's 3000 too many.

Where do you think opterons are used?
In important 24/7 servers running continuously in hot condidtions... that's where the problem is.


If you have servers that are running continuously hot in a production environment, you are either current preparing your CV for future employment right now or forgot the basics of data center air conditioning.

And the most sensible
quote:

Yeah, it's definitely 3000 too many chips. I don't think it says anything about AMD, though. They had a tiny batch screw up, they're fixing it - let's move on.

If you demand a 0% failure rate, its impossible to produce anything. Saying 3000 is 3000 too many is silly. All processes produce bugs, and no testing procedure is 100% accurate.


It would be bad if the problems were showing up in server environments or on hardware review sites and then AMD admits there was no wrongdoing. This, however, is not the case.




By brownba on 4/29/2006 7:29:54 PM , Rating: 2
quote:
If you have servers that are running continuously hot in a production environment, you are either current preparing your CV for future employment right now or forgot the basics of data center air conditioning.

oh baloney.
By definition, a server is a machine that runs continuously.
And if you think all servers are located in wonderful climate controlled data centers, you're sadly mistaken.

And,
quote:
AMD must have pretty good QC in place if they managed to find this problem in internal testing which has yet to surface or be reproduced in the real world. Get real guys, at least they are bothering to admit to the problem.

Yes, they have 'pretty good' QC, and it's great that they found this before it possibly did any harm.
but... 'pretty good' QC, IMO, is not good enough.
And you'd better believe I'd be bashing Intel if they had another serious bug too.

(for the record, I'm currently running an XP2500+).


By lemonadesoda on 4/29/2006 7:54:10 PM , Rating: 2
quote:
AMD must have pretty good QC in place if they managed to find this problem in internal testing which has yet to surface or be reproduced in the real world. Get real guys, at least they are bothering to admit to the problem.


WTF nonesense statement is that?! If the QC had caught this problem then AMD wouldn't need to be "tracking down 3000 processors" (it fact it seems the 3000 is a very rough guestimate, probably on the marketing-damage-limitation-low-side based on some analyst making an assumption on the number of possible faulty processors that ARE in fact in hot environments. A hell of a lot of assumptions. I'm sure the Summer heat will reveal more).

Clearly, someone in the real world discovered the problem AFTER the CPU's were released. AMD have replicated the error and found the issue to be correlated with CPU core temp and are now in the process of attempting a recall-and-replacement of released CPU's, and a "rebadging" to a lower clock for non-released CPU's. (Will they put a sticker over the top, saying 2.4Ghz maximum, warranty void on attempts to overclock this CPU even just 1Hz? I though AMD was entusiastically overclockable...)

Remember the intel utility to discover the math-error? They honoured a replacement to any person that found the error on their CPU - which would affect their ability to use the CPU and specific software correctly.

I think AMD better come forward with a do-loop stress test of this issue and likewise honour a replacement CPU.


By Viditor on 4/29/2006 10:20:03 PM , Rating: 2
quote:
If the QC had caught this problem then AMD wouldn't need to be "tracking down 3000 processors"

QC did catch the problem (there weren't any reports from customers), they just caught it after a bad batch had been shipped. When you test chips at those extremes, you never do it with every chip, you only do it with a small sample every few 100k or so.

BTW, this is AMD's first ever recall...Intel has had 5 now.
I'd cut AMD some slack here...


By Phynaz on 5/1/2006 5:22:40 PM , Rating: 1
Bad batch?

More like hundreds of bad batches. They were shipping these bad chips for months.


By Viditor on 5/2/2006 2:38:21 AM , Rating: 2
quote:
More like hundreds of bad batches. They were shipping these bad chips for months

3000 chips is the wafer-out equivalent of ~27 wafers (after accounting for even minimal yields)...
At over 6000 wafer starts per week, it's pretty easy to miss 27 of them when it's an errata that only occurs under absolute worse case conditions...


By Griswold on 5/3/2006 5:12:55 PM , Rating: 2
Waste of time, some fellers just cant do the math.


By Viditor on 5/2/2006 3:06:08 AM , Rating: 2
quote:
if you think all servers are located in wonderful climate controlled data centers, you're sadly mistaken

True (though all mission critical servers do...). However, any server that is an environment where the ambient temp is over 50C (122 Farenheit) I can guarantee is in some form of climate control (even if it's basic room A/C). The problems exist above that ambient temperature...

I agree with your sentiment that all manufacturers should be called to task when there's an errata, but at the end of the day you just have to ask one question...if you were about to purchase an Opteron server, would this story make you reconsider your choice?


By defter on 4/30/2006 4:16:33 AM , Rating: 2
quote:
If you have servers that are running continuously hot in a production environment


Lets not forget that CPUs DON'T run too hot if they are running within a spec.

According to AMD's documentation, maximum case temperature for 2.6-2.8GHz Opterons is 67 degrees Celcius. This means that AMD GUARANTEES that those CPUs will work continuosly at this temperature around the clock, under full load. If they don't, then they are buggy and need to be replaced.


It is happening with 248 processors as well
By davecason on 4/30/2006 9:30:59 AM , Rating: 2
Check out the screen-shots in the AMD forums:
http://forums.amd.com/index.php?showtopic=68503

So far, two of us have had unusual heat problems with the newer E4, .90 micron Opteron 248 processors as well. The problem may not be limited to the 25x series processors.

I think the 25x series chips were identified because they were sold to major manufacturers where the 248 chips were simply OEM purchases by regular consumers who could not afford the higher-end chips.




RE: It is happening with 248 processors as well
By johnsonx on 4/30/2006 11:41:05 AM , Rating: 2
You've got to be kidding with that thread. You want us to believe that because some dipshit had heat problems with a cheap 1U chassis he bought off E-Bay, that indicates there's a fundamental problem with Opteron 248's? Please tell me that's not your angle.
The whole setup is wrong in that case. I can't tell for sure from the photos, but it looks like the heatsink fibs on the front CPU run perpendicular to the airflow rather than parallel. That case shouldn't be run at all with the cover off, and the HSF on the back CPU is an obvious piece of crap. Don't even get me started on the wiring; it's blocking airflow all over the place. He's lucky he's gotten it working at all; 1U chassis are not DIY projects kids.


RE: It is happening with 248 processors as well
By davecason on 4/30/2006 5:31:12 PM , Rating: 2
Hey man, take it back a notch.

That thread has two screen shots from my 4U case with the same problem. I have 1 processor that is just much hotter than the other and swapping the position or HSF doesn't change things and the rest of the case is not hot. Check my screen shots at the bottom of his thread (note the board temp under load vs. CPU2).


By davecason on 4/30/2006 5:32:50 PM , Rating: 2
Better yet, check the shots directly:
At idle:
http://www.geocities.com/davcason/heatproblem.jpg
At load:
http://www.geocities.com/davcason/heatproblem_atlo...

This is a real problem and I have gone to extremes to try and solve it.


By Burning Bridges on 4/30/2006 12:04:03 PM , Rating: 2
I agree with the previous reply - WTF r u doin? ;)

Seriously, the problem is that if you run an opteron out of a certain batch at above rated temps for very long running a continuous loop you might get some issues. The problem, as such, IS NOT that opterons run hot. I wish that fellow luck if he wants to make a dual Xeon 1U server....


RE: It is happening with 248 processors as well
By hstewarth on 5/1/2006 12:01:42 AM , Rating: 1
quote:
I wish that fellow luck if he wants to make a dual Xeon 1U server....


No problem, just run Xeon LV's - they actually run at 31Watt. there SFF machines that generate more heat than this guys..

http://www.supermicro.com/products/system/1U/6014/...


By Viditor on 5/2/2006 3:12:50 AM , Rating: 2
quote:
No problem, just run Xeon LV's - they actually run at 31Watt. there SFF machines that generate more heat than this guys..

Fine...as long as you don't need floating point power or 64 bit.
Of course that's a pre-built (XeonLV isn't available for DIY)...


Has AMD hit end of generation
By hstewarth on 4/29/06, Rating: 0
RE: Has AMD hit end of generation
By peternelson on 4/29/2006 5:42:17 PM , Rating: 2
AMD plans to deliver Opteron in Socket F with DDR2 in Q4 2006.

It *MIGHT* support Hypertransport 3 but I don't know.

AMD also have 65nm as an option to move to, which will help thermally and to scale frequency further. Also AMD has the HE variants of the chips which use less power (and thus are cooler too).

I think AMD are doing the right thing with this announcement and I respect them for it. On occasions Intel has been aware of a bug and kept quiet. If I know there is a possibility of problems I can either

a) use extra cooling
b) stress test the machine in a hot room to compare calculation results with expected. I can thus discover if the chip I have is one of the lame ones before running production calculations.
c) choose to buy a dualcore opteron instead where there is no problem.


RE: Has AMD hit end of generation
By hstewarth on 4/29/2006 8:17:04 PM , Rating: 3
The term for bug in cpu chip is "erratum" and Intel usesly has a list of these provide on there website.

Before an actual processor is release, there is usual several erratum's created and several revisions of the chip.

During my first job, I actually discover erratum in IBM clone of Intel's chip. It was strange problem between 16bit and 32bit protected mode.

I don't believe the Socket F and DDR2 are actually major architextur changes. So its still possible that this architexture is still speed limited and only way to raise performance is a new design in the architexture. This is what likely happen with Intel and Netburst.


RE: Has AMD hit end of generation
By themelon on 4/30/2006 1:41:07 AM , Rating: 2
The RevF/Socket L1 Opterons are comming out somtime during the summer not at the end of the year.

And they will not support Hypertransport V3 because that spec was just ratified last week. Going to be hard to get that into a chip that has been in development for at least 18 months.


RE: Has AMD hit end of generation
By Viditor on 4/29/2006 10:10:44 PM , Rating: 3
quote:
AMD is coming near end of this generation

Ummm...this is not a design problem but a manufacturing mistake on 3,000 chips.
It didn't get noticed because you have to heat the thing up to a temp WAY past what you'd run it at to find the problem.
So no, it's not yet "near the end of it's generation", but I guess you could call it faulty testing (though at those temps I'm inclined to cut them some slack).
The reason they are having a hard time finding them is that nobody has noticed any problems.


Here's what I find scariest
By mikecel79 on 4/29/2006 8:32:03 PM , Rating: 4
quote:
Although only a few processors are defective, the fact that no one can place an exact bearing on which batch of processors has the problem is troubling at best. AMD claims measures have been put in place to prevent the bug from happening again, but also stresses that the condition is not likely to happen in financial environments.


This part bothers me the most. They say they don't know which batch the "bad" processors came from but then claim that it's 3000 CPUs. How can they tell which ones are bad if they don't even know the batch they came from? How can they put a number on the amount of bad processors without this information.




RE: Here's what I find scariest
By Viditor on 4/29/2006 10:23:26 PM , Rating: 4
quote:
They say they don't know which batch the "bad" processors came from but then claim that it's 3000 CPUs. How can they tell which ones are bad if they don't even know the batch they came from? How can they put a number on the amount of bad processors without this information


Excellent question...the answer is that they use a proprietary software called APM which knows exactly what settings are on each die of each wafer (down to the nm). The tough part will be matching the wafer out to the dice that were packaged.


Mistakes
By toattett on 4/29/2006 2:06:23 PM , Rating: 3
quote:
As a result, some chips that would have been rated with clock speeds of 2.8 MHz in the past would be listed at 2.6 MHz,


This is apparently incorrect.




RE: Mistakes
By Trisped on 5/1/2006 12:53:02 PM , Rating: 2
That is there solution to the problem. The chips that escaped did not have this applied to them, but the chips now being produced do.


RE: Mistakes
By oTAL (blog) on 5/1/2006 9:17:53 PM , Rating: 2
He meant it's supposed to read GHz instead of MHz...
And by the way, I upped your ratting dude... fixing typos is an important part of a self-regulated community - it allows other to read the article with the typos fixed :)


Amd & the 3000
By crystal clear on 4/30/2006 12:50:21 AM , Rating: 2
Looks like you wrote your article in a rush efforts-some correction & addition to your article.This is a portion of AMD release-
AMD and its partners are directly contacting customers who may have potentially susceptible AMD Opteron x52 and x54 products. These single-core AMD Opteron™ processor models x52 and x54 (2.6 GHz and 2.8 GHz) are limited to 152, 252, 852, 154, 254, and 854 only.
I commented on this sublect yesterday on your website-AMD&Intel rush effiorts.
Even SUN is involved in AMD efforts to correct the problem.
Another topic that get very good response -fight it out..




"A lot of people pay zero for the cellphone ... That's what it's worth." -- Apple Chief Operating Officer Timothy Cook











botimage
Copyright 2010 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki