Print 41 comment(s) - last by murphyslabrat.. on Jul 30 at 12:03 PM

AMD talks Bulldozer  (Source: AMD)

AMD details "Falcon," a mainstream processor for "Copperhead"  (Source: AMD)
AMD talks details of "Bulldozer," the first completely new architecture since K8

AMD plans to launch its third-generation Opteron platform in 2009 with the Sandtiger octal-core processor. Beneath Sandtiger is AMD’s M-SPACE modular approach towards CPUs. M-SPACE allows AMD to mix and match CPU features for specific tasks.

The definition for M-SPACE is as follows:
  • Modular: Reconfigurable “building blocks” for design speed/agility
  • Scalable: Linear scaling of multi and single-thread performance
  • Portable: Energy-efficiency for increased mobility/portability
  • Accessible: Ongoing commitment to open innovation
  • Compatible: Backward compatibility and ease of upgrade
  • Efficient: Optimal on-chip and system level I/O efficiency
Sandtiger’s eight cores consist of eight AMD Bulldozers. Bulldozer is the name AMD has given to one of the CPU cores for its M-SPACE architecture. AMD claims dramatic performance-per-watt improvements in HPC applications with Bulldozer cores. Unlike Barcelona and Shanghai, which have evolved from AMD’s K8 architecture, Bulldozer is a completely new design developed from the ground up.

AMD installs eight Bulldozer CPU cores in Sandtiger with a memory control. AMD optimizes the design for servers and raises the performance-per-watt bar for single and multithreaded applications.

The modular M-SPACE technology also finds its way into Fusion. AMD plans to mix and match M-SPACE components for Falcon, a Fusion processor optimized for mobile and mainstream desktops. Falcon forms the basis of AMD’s planned Copperhead mainstream desktop platform. Falcon features four Bulldozer CPU cores with an integrated graphics processor. The integrated graphics processor features DirectX 10, possibly 11, support with AMD’s Universal Video Decoder, or UVD, technology. Falcon also features integrated PCIe.

In addition to Bulldozer, AMD has the Bobcat CPU core for Fusion processors designed for mobile, ultra-mobile and consumer electronics applications. Bobcat is also a completely new design and has greater power scaling capabilities. Bobcat-based processor designs can consume as low as one watt of power. AMD has not announced any details of Bobcat-powered Fusion processors yet.

Expect AMD to introduce Fusion designs based on Bulldozer and Bobcat beginning in 2009.

Comments     Threshold

This article is over a month old, voting and posting comments is disabled

RE: hahah
By mmarq on 7/26/2007 10:25:07 PM , Rating: 5
that top graphic supposedly showing 'how much better' Bulldozer will be just cracks me up because it completely lacks any numbers on the chart. They must want us to measure its improvement in pixels ;)

Well i don't want to enter in much speculation but it seems that sketchs were around since 2001:

Its not uncommon for manufacturers to do prototypes, but that one is really amazing for *2001* ;

It seems to be entering in the camp of decoupled architecture, with separated dual integer/floating point execution cores. But that is not an entirely decoupled architecture, is a clustered one with much capabilities for multi-threading

It seems a 5-6 way width (K8/K10 are 3), issuing 6 instructions per clock instead of the 3 of K8/K10.

Superpipelined with at least 15 stages instead of the 12 of K8/K10. (like IBM, meaning perhaps more than 5GHz at the 45nm process)

Multi Level cache with a L0 with 1 cycle latency , fetching *4 instruction lines per clock!* from this L0 and 1 one more from the L1. So this beast fetches most likely 80 Bytes per cicle to the pipeline, aginst 16B for K8 and 32B for K10 and Core 2

Forward Collapse Unit together with the Branch Predictor can increase ILP by effectively removing up to 2 conditional branches per cycle

Branch prediction: "going both ways before deciding on the prediction"... branch and the destination address in the same "run" of code...

The "forward collapse unit" can handle up to two short forward branches per cycle to handle these nested "if-then-else-statements"

A huge 64k entry branch history table is used for branch prediction. The "taken/not taken" results of the 16 latest conditional branches are used to calculate an index in the 64k table. The table contains 65536 2 bit bimodal counters that hold the predictions: 0 strongly not taken, 1: weakly not taken, 2 weakly taken, 3: strongly taken. Such a large table can store the characteristic branch patterns of many different branches in a larger program without much interference.

Instruction pre-decoding:Each byte in the instruction caches has 2 bit of pre-decode information

ESP Look Ahead unit.: "pre-executes" some operations simultaneous to the decoding of instructions , long before the instructions enter the Out-Of-Order execution pipeline. It co-operates with a future (register) file that indicates if an x86 register is still valid if all preceding instructions still in the pipeline are executed. The ESP look ahead unit Increases Instruction Level Parallelism, multiple PUSHes and POPs can be executed simultaneously.

Stack sideband optimization: Instructions that add an immediate value to the stack pointer like PUSH; POP; ADD ESP, IMM; can be handled in parallel. So-called "constant generators" determine the constants to be added to the stack pointer for up to six stack instructions per cycle.

Memory Loads can be performed earlier on, meaning pre-fetching.

Relaxed Load / Store Ordering; Loads before stores.

OoO engines.

The most remarkable feature is that it seems that the L0 and L1 are not sequential but somehow parallel,... and if that deserves more discussion, L0 must have 'hot code' scanned from a pre- decoded L1, because otherwise how could it have:

" [L0]... simultaneously provides the code that has to 'be' executed when a conditional branch is taken as well as the code that has to be executed if the branch is not taken."

IMO pre-scanning the L1 somehow because in L0 must be code in for branch is 'taken' an 'not-taken' at the same time , and pre-execution makes that design absolutely brilliant... BUT THAT WAS 2001... it surely could see improvements in more than a couple of places.

And with that Pre-execution or a-head execution of the ESP Look Ahead unit, there is a remarkable branch unrolling and elimination so important for Streaming code. It seems that the designers wanted a chip that never has to see its pipelines flushed because of a wrong guess, and the same circumstances when stalled because of a cache miss.

So K8 was an enough pale resemblance of K8-1. From K8-1, K10 only now will introduce sideband stack optimization (like core2), loads before stores (like core 2) and 128bit SSE units (like core 2), which K8-1 doesn't have because in 2001 people were only dreaming of it a no one dared to put that on paper.

Roughly as it is, and extrapolating, it seems this beast could have surely more than 50 % advantage over a k10.

"If they're going to pirate somebody, we want it to be us rather than somebody else." -- Microsoft Business Group President Jeff Raikes
Related Articles

Copyright 2016 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki