backtop


Print 41 comment(s) - last by murphyslabrat.. on Jul 30 at 12:03 PM


AMD talks Bulldozer  (Source: AMD)

AMD details "Falcon," a mainstream processor for "Copperhead"  (Source: AMD)
AMD talks details of "Bulldozer," the first completely new architecture since K8

AMD plans to launch its third-generation Opteron platform in 2009 with the Sandtiger octal-core processor. Beneath Sandtiger is AMD’s M-SPACE modular approach towards CPUs. M-SPACE allows AMD to mix and match CPU features for specific tasks.

The definition for M-SPACE is as follows:
  • Modular: Reconfigurable “building blocks” for design speed/agility
  • Scalable: Linear scaling of multi and single-thread performance
  • Portable: Energy-efficiency for increased mobility/portability
  • Accessible: Ongoing commitment to open innovation
  • Compatible: Backward compatibility and ease of upgrade
  • Efficient: Optimal on-chip and system level I/O efficiency
Sandtiger’s eight cores consist of eight AMD Bulldozers. Bulldozer is the name AMD has given to one of the CPU cores for its M-SPACE architecture. AMD claims dramatic performance-per-watt improvements in HPC applications with Bulldozer cores. Unlike Barcelona and Shanghai, which have evolved from AMD’s K8 architecture, Bulldozer is a completely new design developed from the ground up.

AMD installs eight Bulldozer CPU cores in Sandtiger with a memory control. AMD optimizes the design for servers and raises the performance-per-watt bar for single and multithreaded applications.

The modular M-SPACE technology also finds its way into Fusion. AMD plans to mix and match M-SPACE components for Falcon, a Fusion processor optimized for mobile and mainstream desktops. Falcon forms the basis of AMD’s planned Copperhead mainstream desktop platform. Falcon features four Bulldozer CPU cores with an integrated graphics processor. The integrated graphics processor features DirectX 10, possibly 11, support with AMD’s Universal Video Decoder, or UVD, technology. Falcon also features integrated PCIe.

In addition to Bulldozer, AMD has the Bobcat CPU core for Fusion processors designed for mobile, ultra-mobile and consumer electronics applications. Bobcat is also a completely new design and has greater power scaling capabilities. Bobcat-based processor designs can consume as low as one watt of power. AMD has not announced any details of Bobcat-powered Fusion processors yet.

Expect AMD to introduce Fusion designs based on Bulldozer and Bobcat beginning in 2009.


Comments     Threshold


This article is over a month old, voting and posting comments is disabled

hahah
By yacoub on 7/26/2007 7:43:07 PM , Rating: 2
As noted in my comment to the Anandtech AMD article just posted which uses the same image, that top graphic supposedly showing 'how much better' Bulldozer will be just cracks me up because it completely lacks any numbers on the chart. They must want us to measure its improvement in pixels ;)




RE: hahah
By yacoub on 7/26/2007 7:43:54 PM , Rating: 5
"How much better is it, Bob?"

"I dunno Bill, but it's got a much longer arrow so that's gotta be "a lot better" right?"


RE: hahah
By mmarq on 7/26/2007 10:25:07 PM , Rating: 5
quote:
that top graphic supposedly showing 'how much better' Bulldozer will be just cracks me up because it completely lacks any numbers on the chart. They must want us to measure its improvement in pixels ;)


Well i don't want to enter in much speculation but it seems that sketchs were around since 2001:

http://www.chip-architect.com/news/2001_10_02_Hamm...

Its not uncommon for manufacturers to do prototypes, but that one is really amazing for *2001* ;

It seems to be entering in the camp of decoupled architecture, with separated dual integer/floating point execution cores. http://citeseer.ist.psu.edu/rd/69795829%2C445663%2... But that is not an entirely decoupled architecture, is a clustered one with much capabilities for multi-threading

It seems a 5-6 way width (K8/K10 are 3), issuing 6 instructions per clock instead of the 3 of K8/K10.

Superpipelined with at least 15 stages instead of the 12 of K8/K10. (like IBM, meaning perhaps more than 5GHz at the 45nm process)

Multi Level cache with a L0 with 1 cycle latency , fetching *4 instruction lines per clock!* from this L0 and 1 one more from the L1. So this beast fetches most likely 80 Bytes per cicle to the pipeline, aginst 16B for K8 and 32B for K10 and Core 2

Forward Collapse Unit together with the Branch Predictor can increase ILP by effectively removing up to 2 conditional branches per cycle

Branch prediction: "going both ways before deciding on the prediction"... branch and the destination address in the same "run" of code...

The "forward collapse unit" can handle up to two short forward branches per cycle to handle these nested "if-then-else-statements"

A huge 64k entry branch history table is used for branch prediction. The "taken/not taken" results of the 16 latest conditional branches are used to calculate an index in the 64k table. The table contains 65536 2 bit bimodal counters that hold the predictions: 0 strongly not taken, 1: weakly not taken, 2 weakly taken, 3: strongly taken. Such a large table can store the characteristic branch patterns of many different branches in a larger program without much interference.

Instruction pre-decoding:Each byte in the instruction caches has 2 bit of pre-decode information

ESP Look Ahead unit.: "pre-executes" some operations simultaneous to the decoding of instructions , long before the instructions enter the Out-Of-Order execution pipeline. It co-operates with a future (register) file that indicates if an x86 register is still valid if all preceding instructions still in the pipeline are executed. The ESP look ahead unit Increases Instruction Level Parallelism, multiple PUSHes and POPs can be executed simultaneously.

Stack sideband optimization: Instructions that add an immediate value to the stack pointer like PUSH; POP; ADD ESP, IMM; can be handled in parallel. So-called "constant generators" determine the constants to be added to the stack pointer for up to six stack instructions per cycle.

Memory Loads can be performed earlier on, meaning pre-fetching.

Relaxed Load / Store Ordering; Loads before stores.

OoO engines.

The most remarkable feature is that it seems that the L0 and L1 are not sequential but somehow parallel,... and if that deserves more discussion, L0 must have 'hot code' scanned from a pre- decoded L1, because otherwise how could it have:

" [L0]... simultaneously provides the code that has to 'be' executed when a conditional branch is taken as well as the code that has to be executed if the branch is not taken."

IMO pre-scanning the L1 somehow because in L0 must be code in for branch is 'taken' an 'not-taken' at the same time , and pre-execution makes that design absolutely brilliant... BUT THAT WAS 2001... it surely could see improvements in more than a couple of places.

And with that Pre-execution or a-head execution of the ESP Look Ahead unit, there is a remarkable branch unrolling and elimination so important for Streaming code. It seems that the designers wanted a chip that never has to see its pipelines flushed because of a wrong guess, and the same circumstances when stalled because of a cache miss.

So K8 was an enough pale resemblance of K8-1. From K8-1, K10 only now will introduce sideband stack optimization (like core2), loads before stores (like core 2) and 128bit SSE units (like core 2), which K8-1 doesn't have because in 2001 people were only dreaming of it a no one dared to put that on paper.

Roughly as it is, and extrapolating, it seems this beast could have surely more than 50 % advantage over a k10.


RE: hahah
By mmarq on 7/26/2007 10:27:34 PM , Rating: 5
Continue from above, because DaylyTech doesn't allow to long a post:

Now if they also go for Clustered Speculative Multhithreading, http://citeseer.ist.psu.edu/rd/69795829%2C227934%2...
that is the possibility of a mechanism for breaking monolithic workloads into multithreaded ones On the Fly than a BullDozer could accelerate the big INT applications of today by a factor up to 1,6x. This forced multithreading, like in the reverse hyperthreading rumor, is what the author of hardocp seems to indicate(he was there asking questions) http://www.hardocp.com/article.html?art=MTM2NywsLG...

" Bulldozer seems to be able to unite its core to work together on a single threaded application "

Now a BullDozer on the lines of a Clustered Speculative Multithreading K8-1, could have 80% better performance than a K10 and 2x the performance of a core 2.

Now everybody can collect signs that CPU manufacturers are heavy on the field of helping software developers at multithreading, parallelize and stream their work loads. CTM , EXOCHI, TBB and other stuff..

http://arstechnica.com/news.ars/post/20070724-inte...
http://www.hardocp.com/image.html?image=MTE4NTQ1OT...

And as stated here; http://www.edn.com/article/CA6459066.html with compiler automatic vectorization optimization, they could reach over 100% for some benchmarks, expectable number in average since they are claiming over 1500% for other hand tuned bechmarks, meaning that in a generic load with 40% of codes "streamables", if improvement can reach an average of 100%, then by Amdahl's law, we can get 1/(0.6+0.004) = 1,665x.

> 66% is in average what is expected to be achieved with *2* micro-arquitecture upgrades... that is a LOT considering that schemes like CTM could put that number much higher.

So a a Fusion chip, a real integrated fusion without a separated CPU and GPU, but one with the streaming GPU in another pipeline of the CPU, like happened with the FP x87 in todays...

" He also described the merging of CPUs and GPUs in detail. His vision sees AMD's GPU technology being totally integrated into the CPU, much like we saw the floating point processor integrated into our current CPUs. "

" As Fusion moves forward we are going to be seeing CPU and GPU sharing transistors and actually becoming “fused” together in a more direct sense or at least that is how Phil Hester has explained his vision to HardOCP."

http://www.hardocp.com/article.html?art=MTM0MCwsLG...
http://www.hardocp.com/article.html?art=MTM2NywsLG...

So a fusion chip with clustered speculative multithreading based on lines of a k8-1 could have >145% better performance than a k10, and >150%(2,5x)better performance than a core 2, at the same clock with the same number of cores, specially true for 8 cores and not far for 4 cores.

All in all, that graph bar is not pixels and it seems not far fetched at all. Depending on the implementation it could even be a little conservative.

So its not only AMD, but Intel to, that have always guarded their best designs, trying to squeeze the most money possible out of the market, while the enthusiasts get to each other throats over their preferences, while in theory they could deliver much much better.

Of course the only excuse they have is time and money, because radical designs require both.


RE: hahah
By crystal clear on 7/27/2007 5:41:22 AM , Rating: 2
quote:
Of course the only excuse they have is time and money, because radical designs require both.


yes to the above I would add-ones(radical designs) that really work & feaseable-there is no guarrantee of success !

Its a gamble that can backfire.


Even if you take a very optimistic stand-do we have the software for these radical design.

Software & hardware dont come along at the same time,rather the software lags far behind the hardware.

Intel & AMD certainly do not talk about their projects that go flop & scrapped altogether.

Great plans is one thing- to deliver is another.

To summarize-I would say its not only time & money, but the ability to deliver in time

Can AMD deliver ?


RE: hahah
By mmarq on 7/27/2007 1:20:20 PM , Rating: 1
quote:
To summarize-I would say its not only time & money, but the ability to deliver in time

Can AMD deliver ?


In a unilateral point of view... YES

In a sense they all could deliver much more. The decisive point is not a window of opportunity based on theoretic maximum performance possible against the competition, but profit

A leapfrogging design is only introduced when the competition is clearly ahead, because no one is about to trash the value of current propositions by introducing a much more performant part. So manufacturers only introduce variations that don't do that trashing.

Most of times they could deliver more and in time, the problem is that they don't want to.

Enthusiasts care about performance, they care about profit. And in that sense is absurd to pay more than the double for an enhanced part that only do a few more FPS or seconds in some benchmarks.

Its akin to squeeze the gullible


"Can anyone tell me what MobileMe is supposed to do?... So why the f*** doesn't it do that?" -- Steve Jobs

Related Articles













botimage
Copyright 2014 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki