backtop


Print 91 comment(s) - last by Procurion.. on May 7 at 10:09 AM

AMD engineers reveal details about the company's upcoming 45nm processor roadmap, including plans for 12-core processors

"Shanghai! Shanghai!" the reporters cry during the AMD's financial analyst day today. Despite the fact that the company will lay off nearly 5% of its work force this week, followed by another 5% next month, most employees interviewed by DailyTech continue to convey an optimistic outlook.

The next major milestone for the CPU engineers comes late this year, with the debut of 45nm Shanghai. Shanghai, for all intents and purposes, is nearly identical to the B3 stepping of Socket 1207 Opteron (Barcelona) shipping today.  However, where as Barcelona had its HyperTransport 3.0 clock generator fused off, Shanghai will once again attempt to get HT3.0 right.

Original roadmaps anticipated that HT3.0 would be used for socket-to-socket communication, but also for communication to the Southbridge controllers. Motherboard manufacturers have confirmed that this is no longer the case, and that HT3.0 will only be used for inter-CPU communication.

"Don't be disappointed, AMD is making up for it," hints one engineer.  Further conversations revealed that inter-CPU communication is going to be a big deal with the 45nm refresh.  The first breadcrumb comes with a new "native six-core" Shanghai derivative, currently codenamed Istanbul.  This processor is clearly targeted at Intel's recently announced six-core, 45nm Dunnington processor.

But sextuple-core processors have been done, or at least we'll see the first ones this year.  The real neat stuff comes a few months after, where AMD will finally ditch the "native-core" rhetoric.  Two separate reports sent to DailyTech from AMD partners indicate that Shanghai and its derivatives will also get twin-die per package treatment.  

AMD planned twin-die configurations as far back as the K8 architecture, though abandoned those efforts.  The company never explained why those processors were nixed, but just weeks later "native quad-core" became a major marketing campaign for AMD in anticipation of Barcelona.

A twin-die Istanbul processor could enable 12 cores in a single package. Each of these cores will communicate to each other via the now-enabled HT3.0 interconnect on the processor.  

The rabbit hole gets deeper.  Since each of these processors will contain a dual-channel memory controller, a single-core can emulate quad-channel memory functions by accessing the other dual-channel memory controller on the same socket.  This move is likely a preemptive strike against Intel's Nehalem tri-channel memory controller.
 
Motherboard manufacturers claim Shanghai and its many-core derivatives will be backwards compatible with existing Socket 1207 motherboards.  However, processor-to-processor communication will downgrade to lower HyperTransport frequencies on these older motherboards. The newest 1207+ motherboards will officially support the HyperTransport 3.0 frequencies.

Shanghai is currently taped out and running Windows at AMD.


Comments     Threshold


This article is over a month old, voting and posting comments is disabled

By LumbergTech on 4/17/2008 7:31:26 PM , Rating: -1
how difficult is it to make a program that can actually use 12 cores? it seems like most developers are still trying to learn to deal with quad core properly




By KristopherKubicki (blog) on 4/17/2008 7:33:21 PM , Rating: 5
Meet every webserver app you've ever heard of :)


By onwisconsin on 4/17/2008 9:59:44 PM , Rating: 5
Sorry if I misunderstood your post, but 32bit (x86) OSes will work fine on 64bit CPUs. Also, 32bit programs will work on a 64bit OS, but 32bit DRIVERS won't.


By Jellodyne on 4/18/2008 9:45:54 AM , Rating: 2
You're technically right, but the performance hit is very slight. But there are also times where the address space from a 64 bit Windows OS will cause a 32 bit program to run faster.

See, in most cases under a 32 bit OS, your program is limited to 2GB ram. There's a flag you can set, if the program is written for it, to extend that to 3GB. However, under a 64 bit Windows OS, each 32 bit app can have 4GB of unfettered address space. If you've got more than 2GB of RAM and a memory 32 bit hungry app, the odds are good you'll run faster on 64 bit OS despite the very minor hit from the 32-to-64 translation layer.


By darkpaw on 4/18/2008 9:55:42 AM , Rating: 2
On a fully native 64bit processor like Itanium, you have to emulate the whole 32bit architecture, which comes with a huge performance hit.

On the x64 processors, the entire 32bit architecture is still present. Nothing is actually emulated. A software interface to the 32 bit APIs is provided (Windows on Windows), which is more like virtualization then emulation. This does not cause a significant performance hit.


By freeagle on 4/18/2008 6:59:50 AM , Rating: 2
Where do get this information people?

Programming does not get harder when trying to utilize more than 2 cores. The problem is, that people are trying to do things in parallel, that should not be done in that way. When you need a lot of data sharing between threads, there is a good chance that you are going the wrong way. Example from game engines could be rendering the scene, calculating physical interactions when you sort your objects into disjunct groups, etc.

Oscalcido, the whole point of having extra cores is to compute appropriate tasks faster, like the rendering mentioned above, or archiving, decoding... Another point is that the whole system gets more responsive, especially when you are used to running multiple applications at once.

Freeagle


By Aikouka on 4/18/2008 9:10:35 AM , Rating: 2
I'm not sure what you mean, because programming can inherently become more complex as you raise the number of threads. Now, the reason why I boldfaced "can" is because not all tasks are hard to break down. A simple computation can typically be broken down into parts and combined later on, but that's just a simple computation. But there are tasks that are harder to break down. I remember taking a parallel processing course back in college and one of the focuses early on in the class was learning how to break down a task into parts that could be split amongst the cluster of machines. Some were pretty simple yet even tasks that looked easy to break down sometimes proved to be a bit tricky.


By freeagle on 4/18/2008 11:45:52 AM , Rating: 2
What I mean is that people tend to "force" parallel programming where it's not appropriate, that's why it seem to be getting harder. If you know how to break a task into parts, that can run in parallel, then you know whats the maximum number of threads you can utilize without getting into synchronization nightmares.


By Aikouka on 4/18/2008 12:25:05 PM , Rating: 2
Ahh, sorry then. I must've misunderstood your post. I agree with what you said there about how people tend to push multi-threading or think multi-threading is apt for any application.


By Sulphademus on 4/18/2008 9:16:11 AM , Rating: 2
Multi-Apps will be the biggest one ATM.
Some things just arent well designed for SMP. But given that my Vista machines are running 50 to 80 processes, spreading the load of single threaded processes helps alot. However these types of things will only get faster via more MHz or more efficient core architecture.

The advantage for programs that do work well in parallelism should be huge this next round.


By freeagle on 4/18/2008 11:49:37 AM , Rating: 2
Some applications can run faster only with increased performance of single core. That's because the way they execute is extremely hard or purely impossible to do in parallel


By boogle on 4/18/2008 9:38:27 AM , Rating: 2
I for one hope you're not a programmer working on multithreaded applications. If you're going to randomly spawn off a load of threads because the class looks like it can work on its own, you're going to end up with a scheduling & syncing nightmare. And as anyone whose run into lots of syncing knows - performance goes to below that of a single threaded app.


By freeagle on 4/18/2008 12:01:14 PM , Rating: 2
quote:
If you're going to randomly spawn off a load of threads because the class looks like it can work on its own, you're going to end up with a scheduling & syncing nightmare


I have absolutely no idea how you deduced this from my post.

quote:
scheduling .... nightmare

Number of threads in your application has nearly zero effect on the performance of system scheduler. What can really slow your application below the execution of a single threaded app is when you dynamically create and destroy threads. An example could be matrix multiplication. But I'm sure you, as someone that a lot about parallel programming, have heard of something called thread pool, or futures concept


By inighthawki on 4/17/2008 7:42:42 PM , Rating: 3
I believe the majority of the hardships for making a program multi-core compatible is simply utilizing multiple process threads in the application, allowing it to work on more than one thing at a time.


By osalcido on 4/18/2008 5:41:45 AM , Rating: 1
You really think there are software programmers out there making core-total tailored software?

I havent heard of this


By kkwst2 on 4/17/2008 10:09:42 PM , Rating: 3
Well, it depends on the application. Fluent (computational fluid modeling software) scales pretty well to well over 100 cores. Most modern computer modeling software packages do.

It costs me around $700 per core to assemble a high-end cluster right now using dual proc nodes. I had to use Xeons because the Quad Opterons just weren't available.

Understand that the Opterons still scale better than Xeons on Fluent and many other HPC applications, I guess because of their better FPU performance. I would have used Opterons if I could have.

With 6 cores per processor and thus 12 cores per node, I could probably cut the cost to around $500 per core. The question of course is whether the bus is going to be fast enough to utilize the increased cores effectively. Even at 8 cores per node and the newer 1600 MHz FSB, the scaling appears to be somewhat limited by the FSB.

IF HT3.0 helps solve this and allows better scaling with increased core density in my cluster, this could be huge for my application.


By tinyfusion on 4/18/2008 12:13:54 AM , Rating: 2
By the time AMD starts shipping its 6-core processors, Intel has sold millions of Nehalem processors which are free of the limitations imposed by aging FSB, including those 8-core CPUs.


By spluurfg on 4/18/2008 2:41:33 AM , Rating: 3
To my knowledge, the Nehalem uses an on-die memory controller, which was first implemented on the Opteron... Also, both will need to have a bus to the main memory, which can still serve as a bottleneck...


By spluurfg on 4/18/2008 7:25:08 AM , Rating: 2
I am guessing that you are suggesting that some other processor implemented an on-die memory controller first, though I can't be sure just from your comment -- perhaps you could enlighten me?

At any rate, my point was that, to my knowledge, the Nahelem's use of an on-die memory controller will not transcend bandwidth limitations between the main system memory and the processor, and that the Opteron's implementation of an on-die memory controller was the first in this market (x86 multi-socket server processor).

Though the caveat from my original reply was 'to my knowledge', so I'm welcome to any correction here.


By josmala on 4/18/2008 8:06:44 AM , Rating: 2
Ondie memory controllers?
EV7, 80128, Timna.


By josmala on 4/18/2008 8:08:01 AM , Rating: 2
Typo I meant. 80186 The processor intel made between the Original 8086 And 80286.


By Amiga500 on 4/18/2008 7:32:28 AM , Rating: 2
Even at 8 cores per node and the newer 1600 MHz FSB, the scaling appears to be somewhat limited by the FSB.

Are you talking about Intel Xeons there?

On the Xeons the Speedup of going from 4 to 8 cores is minimal in CFX (it should be the same in fluent). I guess you'd already know about actually allocating your processors to reduce cache flushing and get the best out of the architecture.

For 2 thread jobs, allocate to separate sockets to take advantage of both shared cache and memory bandwidth - for instance use CPUs 0 and 4.

For 4 thread jobs, to take advantage of shared cache allocate to CPUs 0, 2, 4 and 6 (or variant).

You'll see a significant speedup doing that - over 30% in some cases.


By HighWing on 4/17/2008 11:20:22 PM , Rating: 2
I think at some point it's how the OS supports it that matters more. It's my understanding that even if the program is not written to take advantage of a multi-core system, the OS should be able to interpret this and still either split up the processing as needed or be managing background apps with the other cores.


By mindless1 on 4/18/2008 7:02:52 AM , Rating: 2
No, OS can't split up the load anymore that it was already multi-threaded. Yes background apps could and would run on othe rcores but it's often fairly irrelevant as background apps often take up insignificant CPU time.


By Locutus465 on 4/18/2008 12:12:23 AM , Rating: 2
All depends on the application but I've written code my self (in my college days) which could scale well past 12 cores. Of coures not every work load will benifit from this, my chosen workload just happend to be very condusive to scaling well. I think game developers will have the hardest time utilizing this many cores.


By DarkElfa on 4/18/2008 10:39:33 AM , Rating: 2
I agree, I barely have anything other than rendering that uses the 4 cores I have now, why do I need 12?


By Locutus465 on 4/21/2008 9:54:14 AM , Rating: 2
Oh, I wasn't saying "forget more cores"! Qutie the contrary I think they should bring them on!! In fact I think I might start dabling in threaded programming again (even though I concentrate on web now) just for fun. I'm just saying the beneifit will vary by app and some genera's of applications (i.e. games) may take a little while longer to feel the full effect.


"It seems as though my state-funded math degree has failed me. Let the lashings commence." -- DailyTech Editor-in-Chief Kristopher Kubicki

Related Articles
AMD Finally Ships "B3" Opterons
March 12, 2008, 1:13 PM
Gearing Up For AMD Revision G
May 24, 2006, 5:35 AM













botimage
Copyright 2014 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki