Print 112 comment(s) - last by MrPoletski.. on Jan 27 at 11:45 AM

Sandia simulations reveal memory is the bottleneck for some multi-core processors

Years ago, the hallmark of processor performance was clock speed. As chipmakers hit the wall on how far they could push clock speeds processor designs started to go to multiple cores to increase performance. However, as many users can tell you performance doesn't always increase the more cores you add to a system.

Benchmarkers know that a quad core processor often offers less performance than a similarly clocked dual-core processor for some uses. The reason for this phenomenon according to Sandia is one of memory availability. Supercomputers have tried to increase performance by moving to multiple core processors, just as the world of consumer processors has done.

The Sandia team has found that simply increasing the number of cores in a processor doesn't always improve performance, and at a point the performance actually decreases. Sandia simulations have shown that moving from dual core to four core processors offers a significant increase in performance. However, the team has found that moving from four cores to eight cores offers an insignificant performance gain. When you move from eight cores to 16 cores, the performance actually drops.

Sandia team members used simulations with algorithms for deriving knowledge form large data sets for their tests. The team found that when you moved to 16 cores the performance of the system was barely as good as the performance seen with dual-cores.

The problem according to the team is the lack of memory bandwidth along with fighting between the cores over the available memory bus of each processor. The team uses a supermarket analogy to better explain the problem. If two clerks check out your purchases, the process goes faster, add four clerks and things are even quicker.

However, if you add eight clerks or 16 clerks it becomes a problem to not only get your items to each clerk, but the clerks can get in each other's way leading to slower performance than using less clerks provides. Team member Arun Rodrigues said in a statement, "To some extent, it is pointing out the obvious — many of our applications have been memory-bandwidth-limited even on a single core. However, it is not an issue to which industry has a known solution, and the problem is often ignored."

James Peery, director of Sandia's Computations, Computers, Information, and Mathematics Center said, "The difficulty is contention among modules. The cores are all asking for memory through the same pipe. It's like having one, two, four, or eight people all talking to you at the same time, saying, 'I want this information.' Then they have to wait until the answer to their request comes back. This causes delays."

The researchers say that today there are memory systems available that offer dramatically improved memory performance over what was available a year ago, but the underlying fundamental memory problem remains.

Sandia and the ORNL are working together on a project that is intended to pave the way for exaflop supercomputing. The ORNL currently has the fastest supercomputer in the world, called the Jaguar, which was the first supercomputer to break the sustained petaflop barrier.

Comments     Threshold

This article is over a month old, voting and posting comments is disabled

This is not all that surprising...
By Motoman on 1/17/2009 12:52:18 PM , Rating: -1
...Intel and AMD shifting to multi-core CPUs as a way to make "better" products is problematic in a couple ways. First, as noted here, you have to deal with too many cores trying to access the same memory - I guess in theory (and I'm not an EE guy, so largely just speculating here), that you could put a dedicated set of RAM slots for each core on a motherboard...but with an 8-core machine you'd have a 3-foot square motherboard. So that's not going to work...

...the other problem is that the vast majority of what most programs do is fundamentally, and inalterably, linear...which is to say, it can't be split up and run across multiple CPUs/cores. To some degree, some programs (like games) can use multiple cores - you can send basic logic to one core, and physics to another, for example. Say you can come up with 4 fundamental threads within the program, so you can leverage 4 cores.

Then you are presented with an 8-core system. Looking at each of your 4 threads from before, you realize they can't be split any further...physics processing, for example, is exceedingly need the output from the last calc as input to the current calc, which then feeds the next calc, all in series - can't be performed in parallel. So there's no value in the next 4 cores.

Granted, I'm targeting desktop computers. Sure, we can get Oracle and SAP to run across multiple cores, but those are completely different types of applications. From a desktop standpoint, I honestly think that (barring some new magic I can't concieve of), 4 cores is likely to be the maximum that anyone is ever going to get any benefit from. And in all likelihood, 2 cores at X speed (2X total) will be essentially as fast as 4 cores at X speed (4X total) for the majority of applications, probably forever.

I think multicore CPUs are cool just for the sake of the technology. I just think that we are losing the plot in some regards...we need to make the applications we use faster, not just build a 32-core CPU simply because we can.

RE: This is not all that surprising...
By PrinceGaz on 1/17/2009 3:21:53 PM , Rating: 5
physics processing, for example, is exceedingly need the output from the last calc as input to the current calc, which then feeds the next calc, all in series - can't be performed in parallel. So there's no value in the next 4 cores.

Actually physics processing isn't like that at all... it almost always consists of doing similar calculations on a large amount of data, and they can all be done in parallel. That's why things like PhysX can be handled so much better by a modern GPU than on any x86 CPU.

RE: This is not all that surprising...
By Motoman on 1/17/09, Rating: 0
RE: This is not all that surprising...
By kkwst2 on 1/17/2009 10:27:26 PM , Rating: 3
As someone who does a lot of computer modeling, I'm certainly biased, but I'd disagree with your point. A large portion of users who really need high performance computers benefit greatly from multiple cores. This includes images processing, video processing, physics modeling, 3D rendering, biological modeling, stochastic modeling, etc.

In my case (fluid modeling) the programs I use scale very nicely well above 50 cores using clustering. In a single node, two 4-core Xeons are nearly twice as fast as a single one, so scaling is quite good all the way up to 8 cores. The number of cores in each node depends on the architecture used and the efficiency of more cores per node certainly is quite dependent on the memory architecture. So, the article has a point but is probably an oversimplification. It seems to assume that memory architectures are not going to advance and scale with increasing cores, which I'm not sure is true.

By masher2 on 1/18/2009 12:12:18 AM , Rating: 2
Back when I did MHD modeling, the massively parallel supercomputer we used supposedly had several times as much silicon devoted to node-to-node communication as it did to actual computing on each node itself. I can see Intel having to make serious architectural changes to get decent performance from a 16+ core cpu.

In the case of your two 4-core Xeons, though, you have to remember that this is slightly different than one 8-core CPU. The 2x4 option gives you twice as much cache bandwidth, which if you code fits in cache, is going to negate pretty much all the bandwidth crunch from scaling beyond 4 cores.

By Fritzr on 1/18/2009 1:27:03 AM , Rating: 2
When scaling using clusters each node has it's own dedicated memory. The article is talking about multiple cores using a single memory which is what you get with the current multicore processors.

There is one memory connection reached through the memory controller and each core has to share that connection.

Assuming all cores are busy and reading/writing main memory then for a dual core the memory is half speed per core, quad core is quarter speed per core, eignt core is 1/8 speed per core ... as the number of cores goes up, the average available memory bandwidth per core drops.

One work around is larger unshared cache. The bigger the cache dedicated to each core the less that core is likely to need to go to main memory. As new code is written that is optimized in such a manner as to minimize main memory access the performance of multicore will go up.

For now when comparing multicore CPUs you need to look at per core dedicated cache. Larger cache boosts performance of multicore CPUs by reducing memory contention. This was the original solution used for supercomputers...each processor node has a large dedicated memory.

RE: This is not all that surprising...
By Motoman on 1/18/2009 11:06:43 AM , Rating: 1
...your case is wildly different from the normal desktop usage of the average consumer, which is the point I was trying to make. And I was thinking about applications like yours when I mentioned an ERP and an RDBMS. What you're doing is highly specialized, and is well-suited for multi--core/CPU usage.

For the normal, average statement applies. And frankly, so does the guy's post currently at the top of the list, who has now been rated down to zero.

It's like people are already zombified to the idea that more cores is always going to be better.

Pick your favorite game, all you gamers out there. Benchmark it on a single core, dual-core, quad-core, and then 8-core CPU (all of the same family to keep things even, at the same speed, etc.). It's virtually certain that the benchies will fall on their face at the quad-core...and even as newer games come out, there will not be things that can be spread over 8 cores, or 16, or whatever.

...unless, as I've said, some kind of currently unconcievable technomagic can be invented to allow serial processing to occur over multiple (parallel) cores. Which as far as I know, is impossible.

But look at the benchmarks. *Look at them.* And please don't start with the wonky synthetics that have a PC do 40 things at a time - that proves nothing. Average conusmers and gamers don't encode video while ripping MP3 tracks while folding proteins while compiling C# code while typing a letter to Grandma. Bench your favorite individual games and applications.

RE: This is not all that surprising...
By mathew7 on 1/18/2009 3:00:49 PM , Rating: 2
While you are right with the games, there is one point that I have not read until now: the SW will have to adapt.
The games from last 2 years all are adapted to 2-cores (at least those which need CPU performance). I could even say that they ignored quad-cores, because not many gamers had quad-cores (they were developed while quads were very expensive). Switching from 1 to 2 cores was easy for games. But now splicing the workload again will not benefit as much. So doing this on currently released games would have been a waste of time/resources. Probably the games that are half-way in development now can benefit from 4-cores. But that has to be decided from an early stage.
One of the problem is that the current programmers are not used to think with paralel algorithms. Also, paralelism cannot be applied to everything.

Current desktop applications do not require much performance. I mean you could have a big excel file with lots of data, which I'm sure MS had it designed to benefit of as many cores as you have. But the point is that the file should be very big and very complex for you to be affected by current processors (I mean the CPU workload to be timed in minutes, not seconds). At that dimensions, you would be better with a DB application.

By William Gaatjes on 1/19/2009 1:56:22 PM , Rating: 2
True, Since windows NT Version 6 (yes vista) microsoft updated the schedular, interrupt and thread handeling mechanisms to take use of hardware features modern processors have since K7 or the P4 at the least. Windows XP (NT5) uses an anciënt schedular, interrupt and thread handeling mechanisms based on software loops togther with interrupt timers while vista does these things in hardware.

See this link :

The multimedia class service is useless tho in my opinion.
If microsoft would just use a large enough memory buffer for audio data and the audio chip DMA's the data from memory and the cpu get's to update that data before the audiochip runs into the end of the memory region it was assigned to DMA, then you will never notice a glitch.

As is readyboost useless.

Superfetch seems handy but we need more bandwidth from HDD to main memory before superfetch is really interesting.

RE: This is not all that surprising...
By Reclaimer77 on 1/17/2009 8:11:21 PM , Rating: 3
There is NOTHING surprising here. Sandia must enjoy wasting their time.

This is really no big deal. Intel and AMD have already dealt with this in the real world.

Nice job Sandia. I await your next breakthrough when you inform us of something else equally obvious and meaningless.

RE: This is not all that surprising...
By Motoman on 1/18/2009 11:09:22 AM , Rating: 2
Intel and AMD have already dealt with this in the real world.

Really? Please illucidate this topic for us.

RE: This is not all that surprising...
By Reclaimer77 on 1/18/2009 11:54:53 AM , Rating: 1
It's a non topic. You think Intel and AMD are a bunch of idiots who blindly add cores to CPU's without taking memory usage into account ?

I'm not sure what you want me to say. The article is simply stating the obvious, and it sure as hell isn't news to Intel or AMD. Why do you think we have on die memory controllers and dual and triple channel memory now ?

How do you explain that software WRITTEN for 8 threads runs faster in the i7 than quad cores ?

RE: This is not all that surprising...
By Motoman on 1/18/2009 12:08:10 PM , Rating: 2
...How do you explain that we can expect *all* applications to benefit from an 8-core processor? Or 16-core?

I think that Intel and AMD are geniuses...they ran into a wall and found a way around it. But I think people like you are either far too into specialized niches that *will* benefit from lots of cores, or too far bought into the marketing to actually think about the ramifications for the typical consumer.

Applications and games that are used by the typical consumer are simply not going to be able to spread across a whole lot of parallel cores. They just aren't. So if there is some magic that will allow purely serial processes to run across multiple parallel cores, please let me know. If there isn't, please stop apparently pretending that more cores is better for everything...because it isn't.

By retrospooty on 1/19/2009 8:49:05 AM , Rating: 2
"...How do you explain that we can expect *all* applications to benefit from an 8-core processor? Or 16-core?"

??? I dont... Because we dont. Who expects that?

What we ALL know is that only mutithreaded apps benefit from multi cores and we ALL know that most games and high end apps that need extra CPU power ARE being written for multiple threads. Apps that dont need the CPU power are generally left alone.

By Jeff7181 on 1/18/2009 1:57:12 PM , Rating: 3
Ever heard of double data rate memory? Dual memory channels? Quad memory channels? Quad pumped busses?

CPU manufacturers understood a ALONG time ago that as the processing power of CPU's increase, the demands on the external bus increase also. All those things mentioned above are designed to provide the CPU with more memory bandwidth to allow the CPU to operate to it's potential.

Do you think Intel is using three memory channels for their newest chips because they got sick of seeing either 2 or 4 memory slots on a motherboard and wanted to mix it up a little with 3 or 6? Of course not... it's because they've already identified a problem feeding their new dual and quad core processors with enough data for them to crunch so they increased the memory bandwidth by adding a third channel.

By fri2219 on 1/18/2009 10:01:54 PM , Rating: 2
No kidding, welcome to 1988.

By retrospooty on 1/19/2009 8:46:16 AM , Rating: 2
"This is really no big deal. Intel and AMD have already dealt with this in the real world."

Yup... I dont know if I blame sandia for saying it though... Kind of not worth posting on this site though. Considering Sandia is a huge govt. funded science lab and this is a consumer site...

By SmartWarthog on 1/19/2009 12:36:15 PM , Rating: 2
First, as noted here, you have to deal with too many cores trying to access the same memory

The Phenom's split DRAM controller probably accounts for much of its performance improvements over its predecessor.

RE: This is not all that surprising...
By Oregonian2 on 1/19/2009 3:14:59 PM , Rating: 2
You've had your posting's point score knocked down a few because your posting, although I think written with good honest intention also showed a rather large, uh, lack of knowledge of processor architecture and usage. I personally think your ideas should be argued against instead, but that's me. And yes, I am an "EE Guy" who professionally designed computer systems starting with the Intel 8008 back when it was hot stuff. In terms of this thread, hunt down photos of CPU die -- there usually are some when new processors come out. Note how the "CPU proper" usually takes only a minority portion of the chip! Most of the area is usually taken up by cache memory. Think about the implications of that observation in the context of this thread.

By William Gaatjes on 1/19/2009 3:34:22 PM , Rating: 2
For the interested :

and to top it off with some tests :

Why is cache so important ?
Well, triple 3 channel memory is around 16 times slower then the cache of Intels fastest offering i7 965. Imagine that the execution unit's inside the core i7 965 would be just waiting for data wihtout cache. the cpu's would be terribly slow. And the x86 complete(thus meaning including decoders load and store ) execution unit's are still big when compared to other modern architectures. But they only need to be because they need to decode the variable lenght x86 instruction set (meaning instructions can be for example 8 bits or 16 bits or 32 bits or 64 bits long ) This makes it less easy to feed the instructions as easy digestive food to the execution unit's.

"Death Is Very Likely The Single Best Invention Of Life" -- Steve Jobs
Related Articles

Copyright 2016 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki