backtop


Print 91 comment(s) - last by Procurion.. on May 7 at 10:09 AM

AMD engineers reveal details about the company's upcoming 45nm processor roadmap, including plans for 12-core processors

"Shanghai! Shanghai!" the reporters cry during the AMD's financial analyst day today. Despite the fact that the company will lay off nearly 5% of its work force this week, followed by another 5% next month, most employees interviewed by DailyTech continue to convey an optimistic outlook.

The next major milestone for the CPU engineers comes late this year, with the debut of 45nm Shanghai. Shanghai, for all intents and purposes, is nearly identical to the B3 stepping of Socket 1207 Opteron (Barcelona) shipping today.  However, where as Barcelona had its HyperTransport 3.0 clock generator fused off, Shanghai will once again attempt to get HT3.0 right.

Original roadmaps anticipated that HT3.0 would be used for socket-to-socket communication, but also for communication to the Southbridge controllers. Motherboard manufacturers have confirmed that this is no longer the case, and that HT3.0 will only be used for inter-CPU communication.

"Don't be disappointed, AMD is making up for it," hints one engineer.  Further conversations revealed that inter-CPU communication is going to be a big deal with the 45nm refresh.  The first breadcrumb comes with a new "native six-core" Shanghai derivative, currently codenamed Istanbul.  This processor is clearly targeted at Intel's recently announced six-core, 45nm Dunnington processor.

But sextuple-core processors have been done, or at least we'll see the first ones this year.  The real neat stuff comes a few months after, where AMD will finally ditch the "native-core" rhetoric.  Two separate reports sent to DailyTech from AMD partners indicate that Shanghai and its derivatives will also get twin-die per package treatment.  

AMD planned twin-die configurations as far back as the K8 architecture, though abandoned those efforts.  The company never explained why those processors were nixed, but just weeks later "native quad-core" became a major marketing campaign for AMD in anticipation of Barcelona.

A twin-die Istanbul processor could enable 12 cores in a single package. Each of these cores will communicate to each other via the now-enabled HT3.0 interconnect on the processor.  

The rabbit hole gets deeper.  Since each of these processors will contain a dual-channel memory controller, a single-core can emulate quad-channel memory functions by accessing the other dual-channel memory controller on the same socket.  This move is likely a preemptive strike against Intel's Nehalem tri-channel memory controller.
 
Motherboard manufacturers claim Shanghai and its many-core derivatives will be backwards compatible with existing Socket 1207 motherboards.  However, processor-to-processor communication will downgrade to lower HyperTransport frequencies on these older motherboards. The newest 1207+ motherboards will officially support the HyperTransport 3.0 frequencies.

Shanghai is currently taped out and running Windows at AMD.


Comments     Threshold


This article is over a month old, voting and posting comments is disabled

By kkwst2 on 4/17/2008 10:09:42 PM , Rating: 3
Well, it depends on the application. Fluent (computational fluid modeling software) scales pretty well to well over 100 cores. Most modern computer modeling software packages do.

It costs me around $700 per core to assemble a high-end cluster right now using dual proc nodes. I had to use Xeons because the Quad Opterons just weren't available.

Understand that the Opterons still scale better than Xeons on Fluent and many other HPC applications, I guess because of their better FPU performance. I would have used Opterons if I could have.

With 6 cores per processor and thus 12 cores per node, I could probably cut the cost to around $500 per core. The question of course is whether the bus is going to be fast enough to utilize the increased cores effectively. Even at 8 cores per node and the newer 1600 MHz FSB, the scaling appears to be somewhat limited by the FSB.

IF HT3.0 helps solve this and allows better scaling with increased core density in my cluster, this could be huge for my application.


By tinyfusion on 4/18/2008 12:13:54 AM , Rating: 2
By the time AMD starts shipping its 6-core processors, Intel has sold millions of Nehalem processors which are free of the limitations imposed by aging FSB, including those 8-core CPUs.


By spluurfg on 4/18/2008 2:41:33 AM , Rating: 3
To my knowledge, the Nehalem uses an on-die memory controller, which was first implemented on the Opteron... Also, both will need to have a bus to the main memory, which can still serve as a bottleneck...


By spluurfg on 4/18/2008 7:25:08 AM , Rating: 2
I am guessing that you are suggesting that some other processor implemented an on-die memory controller first, though I can't be sure just from your comment -- perhaps you could enlighten me?

At any rate, my point was that, to my knowledge, the Nahelem's use of an on-die memory controller will not transcend bandwidth limitations between the main system memory and the processor, and that the Opteron's implementation of an on-die memory controller was the first in this market (x86 multi-socket server processor).

Though the caveat from my original reply was 'to my knowledge', so I'm welcome to any correction here.


By josmala on 4/18/2008 8:06:44 AM , Rating: 2
Ondie memory controllers?
EV7, 80128, Timna.


By josmala on 4/18/2008 8:08:01 AM , Rating: 2
Typo I meant. 80186 The processor intel made between the Original 8086 And 80286.


By Amiga500 on 4/18/2008 7:32:28 AM , Rating: 2
Even at 8 cores per node and the newer 1600 MHz FSB, the scaling appears to be somewhat limited by the FSB.

Are you talking about Intel Xeons there?

On the Xeons the Speedup of going from 4 to 8 cores is minimal in CFX (it should be the same in fluent). I guess you'd already know about actually allocating your processors to reduce cache flushing and get the best out of the architecture.

For 2 thread jobs, allocate to separate sockets to take advantage of both shared cache and memory bandwidth - for instance use CPUs 0 and 4.

For 4 thread jobs, to take advantage of shared cache allocate to CPUs 0, 2, 4 and 6 (or variant).

You'll see a significant speedup doing that - over 30% in some cases.


"We don't know how to make a $500 computer that's not a piece of junk." -- Apple CEO Steve Jobs

Related Articles
AMD Finally Ships "B3" Opterons
March 12, 2008, 1:13 PM
Gearing Up For AMD Revision G
May 24, 2006, 5:35 AM













botimage
Copyright 2014 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki