Print 32 comment(s) - last by verndewd.. on Feb 13 at 4:05 PM

Intel's Teraflops Research Chip runs on an LGA socket at 62W
Six months after its initial debut, Intel sheds more light on the massively parallel teraflop-on-a-chip project

With no Spring Intel Developer Forum in the U.S., Intel is showing off its newest technologies this week at the annual Integrated Solid State Circuits Conference (ISSCC) in San Francisco. At the forefront of Intel’s announcements is its success in developing the world’s first 80-core processor, currently presented at ISSCC.

Intel’s chief technology officer, Justin Rattner, states "Our researchers have achieved a wonderful and key milestone in terms of being able to drive multi-core and parallel computing performance forward. It points the way to the near future when Teraflop-capable designs will be commonplace and will reshape what we can all expect from our computers and the Internet at home and in the office."

The project until now was previously dubbed Tera-scale at public Intel events.  The proper name is now the Intel Teraflops Research Chip -- alluding to the fact the processor can achieve one trillion FLoating-point Operations Per Second. Tera-scale made its first appearance during the Fall 2007 Intel Developer Forum in September 2006.  The ISSCC agenda published last month shed more details on the architecture, but this past weekend Intel pulled out most of the stops.

The Teraflops Research Chip is composed of a total of 80 independent processing cores, which Intel refers to as tiles. The tiles are organized in rectangular fashion, with 8 tiles placed across and 10 down, adding up to a total of 80.

Each individual tile in the chip features a processing engine (PE) and a 5-port router. The router passes data and instructions to other tiles, while the processing engine, as the name indicates, processes data. To save power, each processing engine can power down independently of its router, meaning that a tile can theoretically only be used to pass data when its processing engine is not needed. The processing engine can then be turned on to process additional data on-demand.  Intel's guidance claims the processor can achieve one teraflops performance on just 62W of power.

The chip itself uses an LGA package similar to Intel’s Core 2 and Pentium 4 processors. A clear difference, however, is that is uses 1248 pins in place of 775. Intel's guidance states that 343 pins are used for signaling, while the rest are used for power and ground.

The minimum clock speed the chip needs to run at in order to process one teraflop is 3.16 GHz per core at 0.95V, but Intel's guidance already alludes at frequencies in excess of 5.7 GHz.  Performance, at this time, appears linear; a 5.7 GHz Teraflops Research Chip has an output of 1.81 teraflops.

Intel has large plans for its Teraflops Research Chip. The primary purpose of the chip and project is less to make record performances, and more to serve as a vessel to test future Intel technologies. The next major technologies Intel will be implementing in the Tera-scale research project will be 3-D stacked memory and introducing more general purpose and capable cores.

A major limitation of the current 80-core chip is that it is not based on the X86 architecture. Instead, it uses a 96-bit Very Long Instruction Word (VLIW) architecture, another architecture currently used in the Itanium server processors. A major hurdle that Intel hinted at will be moving from VLIW to X86 on its 80-core chip.

Although Intel currently has no plans to commercialize the 80-core chip, technologies used in it will definitely be making their way into multi-core desktop chips. So how long until the technologies are expected to finally come to fruition? Intel estimates that it will take 5 – 10 years until we actually begin seeing the benefits of the Tera-scale research project.

Comments     Threshold

This article is over a month old, voting and posting comments is disabled

Itanium is a lost cause...
By notinsane on 2/12/2007 11:28:48 AM , Rating: 2
Interesting that Intel recognizes this is a dead project as long as it's tied to Itanium. X86 will outlive us all.

RE: Itanium is a lost cause...
By Tyler 86 on 2/12/2007 12:39:24 PM , Rating: 2
Where's masher at? He'll love this crap.
... but he's seen it before, this is a re-preview.

It's ferocious sounding, but that's about it - all bark.

Instead of taking the more entire-processor cores route, Intel and AMD should be taking the more specialized-core route.

In one x86 core, we have an ALU core (general purpose), MMX/FPU core (64-bit math+), and an SSE core (128-bit math+).

Let's see larger extension cores, with more out-of-order operations per cycle - damn the core count if it doesn't directly translate to performance.

RE: Itanium is a lost cause...
By raven3x7 on 2/12/2007 5:55:11 PM , Rating: 2
I guess you havent heard about Fusion. Or that AMD intents to make its architecture modular

RE: Itanium is a lost cause...
By Tyler 86 on 2/13/2007 5:29:45 AM , Rating: 3
I stopped investigating Fusion after I read "According to Speed, Fusion will be targeted at mainstream and low-end computing - as long as graphics are concerned - initially"... those mammajammas stole the fire...

RE: Itanium is a lost cause...
By Hoser McMoose on 2/12/2007 4:14:11 PM , Rating: 2
This chip actually has little to no connection with Itanium other then that they are both VLIW chips (much like most DSPs in the world today). Saying that this chip is tied to Itanium because both are VLIW is like saying that a PIC is tied to Alpha because they are both RISC.

From what I can see of the chip it uses only a VERY simplified design, the only real goal seems to be to do a whole lot of FMAC operations. It probably wouldn't be of too much use in general computing. At best it looks more like a co-processor, not entirely unlike the ClearSpeed Accelerator chips.

RE: Itanium is a lost cause...
By Viditor on 2/12/2007 6:44:41 PM , Rating: 2
This chip actually has little to no connection with Itanium other then that they are both VLIW chips (much like most DSPs in the world today). Saying that this chip is tied to Itanium because both are VLIW is like saying that a PIC is tied to Alpha because they are both RISC.
From what I can see of the chip it uses only a VERY simplified design, the only real goal seems to be to do a whole lot of FMAC operations. It probably wouldn't be of too much use in general computing. At best it looks more like a co-processor, not entirely unlike the ClearSpeed Accelerator chips

Excellent summation and analogy...good post Hoser.

distributed computing
By paydirt on 2/12/2007 8:51:35 AM , Rating: 2
This will be huge for business, science projects, and distributed computing applications like BOINC.

RE: distributed computing
By Master Kenobi on 2/12/2007 9:16:52 AM , Rating: 2
Think of the rendering if they can get this designed right. /drool

RE: distributed computing
By nurbsenvi on 2/12/2007 9:46:50 AM , Rating: 1
How about real-time rendering from Max or Maya
I can't wait!!

RE: distributed computing
By Viditor on 2/12/2007 10:15:53 AM , Rating: 2
How about real-time rendering from Max or Maya

Just 2 problems there...
1. It's not a design for a chip that will ever be released, it's a platform for study
2. It uses VLIW and not x86 (think EPIC or Itanium...).

RE: distributed computing
By Master Kenobi on 2/12/2007 11:20:11 AM , Rating: 2
Hence why I stated "Designed right". They could use this prototype to build one that will render. Just depends on how they build it.

RE: distributed computing
By Viditor on 2/12/2007 6:40:39 PM , Rating: 2
They could use this prototype to build one that will render

They'd have to start from scratch, actually.
Using this test platform as a prototype for an x86 80 core production chip would be like trying to use a model car to help build a real car...
While there have been cases where that can occur (at least in reverse)
it is still basically a seperate project and requires starting over.
The thing that this test platform will help with (a lot!) is understanding how to tweak intercore communication and performance.

By Operandi on 2/12/2007 2:12:11 PM , Rating: 2
To put this into perspective a quad core Core2 running at the same frequency is capable of how many floating point operations per second?

RE: Perspective
By Tyler 86 on 2/12/2007 4:09:07 PM , Rating: 2
Supposedly 2 Quad Core Xeons (each core being based on Core 2) do between 120 and 210 double-precision GFLOPS... but that's not peak performance, but realizable performance...
I can't find many reliable numbers on it for some reason.

8,000 Opterons + 8,000 Cell processors do >1,000 double-precision TFLOPS, peak...

The R850 (Radeon X1900 core) does 375 GFLOPS, x3 for 1.125 TFLOPS, peak... I think that's double precision, but it can do greater than double precision, and even less than single precision, so I dunno what the full story is with it... but realization of the peak performance is easy with graphics applications.

RE: Perspective
By Tyler 86 on 2/12/2007 4:11:22 PM , Rating: 2
oo.. actually, the 8000 Cell + Opts may be single-precision, uncertain...

RE: Perspective
By Hoser McMoose on 2/12/2007 4:44:42 PM , Rating: 2
Intel's Core architecture contains a 128-bit SSE SIMD engine that can execute one instruction per clock cycle. There is no FMAC instruction though (as this TereFLOP chip uses to get two FLOPS per instruction), only single adds and multiplies. So this gets 4 FLOPs per clock cycle per core.

(Note: All FLOPs here refer to only single-precision FLOPs).

At 3.16GHz a quad core Core2 chip would therefore give:

4 FLOPs/Hz core * 4 cores * 3.16GHz ~= 50GFLOPs.

Surprise, surprise, this chip is actually doing the same number of theoretical FLOPs/Hz core as a Core2 chip is. Actually the only difference is that it does 2 FMACs vs. 4 FAdds or 4 FMult instructions on the Core2.

The Core2 is really a poor chip to compare to though since it is an actual microprocessor while the TereFLOP chip is basically just an FMAC co-processor. A better comparison might be something like the Cell processor, which can manage about 200GFLOPs at 3.2GHz (25GFLOPs per SPE with 8 SPEs in current Cell chips).

Similarly a GeForce 8800 GTX can manage a little more than 520GFLOPs theoretical performance, so a pair of these cards in a system has the same theoretical Linpack performance as Intel's TeraFLOP chip. Of course, in reality these designs are ALL heavily dependent on memory bandwidth and latency. 1 TereFLOP requires at least 4TB/s of memory bandwidth, and that is significantly more than of these solutions are going to provide.

Single Precision
By Griswold on 2/12/2007 6:13:57 AM , Rating: 2
I think its worth to mention that these figures are single precision, to put it in perspective with other technologies, even if Intel does not have plans to use this in a commercial way, as mentioned in the last paragraph of the article.

RE: Single Precision
By StevoLincolnite on 2/12/2007 7:41:52 AM , Rating: 2
It was never, something to break performance records, Still the possibility's in the future, Just by taking a look at this, is enough to make any computer hardware enthusiast wet themselves.

RE: Single Precision
By Griswold on 2/12/2007 10:49:37 AM , Rating: 2
I see what you mean. It must be these "enthusiasts" who moderate a posting down that states a simple fact that was missing in the original article, wetting themselves over something they dont understand anyway.

RE: Single Precision
By Griswold on 2/12/07, Rating: 0
Just imagine a Beowulf Cluster of these....
By jskirwin on 2/12/2007 4:22:43 PM , Rating: 2
Had to say it...

I like how the cores can run independently, thereby keeping the heat down. However, I wonder how hard it would be to go 3D on these things. Instead of 8x10, 4x4x5...

Would heat dissipation be a problem?

RE: Just imagine a Beowulf Cluster of these....
By Tyler 86 on 2/12/2007 4:40:37 PM , Rating: 2
If it's 2 layers, one offset a bit on the other (less concentration), 7x10 over 8x10, I wouldn't imagine it much of a problem, but imagination is a problem, but considering the low power requirements of them... meh. Speculation.

A layer (underside) consisting of cache & interface... that's what's really promising, near future, and stuff.

Heat is always a problem, but we do have phase change coolers...

By Tyler 86 on 2/12/2007 4:41:34 PM , Rating: 2
er, 7x9 over 8x10, or some other similar arrangement... like stacking bricks, interleaving between interconnects and cores per layer...

By rokoroko on 2/12/2007 11:08:56 AM , Rating: 2
I have been waiting to this from the time I have registered TeraPC domain name.And Intel guys have done it finally!I am looking forward to all consequences like AI for examle.
How long it will takes to the real application stage is a just question of time.
My private guess is 10-20 years


RE: TeraPC
By Tyler 86 on 2/12/2007 12:50:32 PM , Rating: 2
Tera-FLOP/s personal computing won't come with 80 cores when it comes, it'll be under 5 years.

One R580 GPU can deliver 375 GFLOP/s, two at 700 GFLOP/s, three at 1.125 TFLOP/s... That's a graphics processor released last year.

Both nVidia and ATI are exponentially increasing their performance with each new core release, and the mean-time between release is shrinking as well.

They're branching over from graphics-specific to general purpose, because there's huge demand for their performance.

I hate to burst your bubble, but 10-20 years is ridiculous...
It's closer to sometime this year - and 2 or 3 graphics cards.

pushing the envelop
By lazyinjin on 2/12/2007 9:53:44 AM , Rating: 2
those seem like some very promising numbers(considering its size), its nice to see the Intel continuing to push the envelope with testbeds like this and 45nm/Penryn tech, rather than recline on its C2D success.

I like these projects..
By DeepBlue1975 on 2/12/2007 3:13:37 PM , Rating: 2
Just from the R&D point of view.
I think the key to exploit this kind of design has not as much to do with today's applications as with future ones:
Imagine something like "a neural net on a chip" or other types of applications we don't have today simply because we don't have the technology for them, and a design like this could be a door opener.
Here I think it's all about massive parallelism, not just raw performance.
I guess a "perfect" natural speech recognition app, one that CAN stand for you to talk while having quite a bit of ambient noise and other people talking, could benefit so much more from a "networked CPU" which behaves really more like a brain than a simple number cruncher.

By crystal clear on 2/13/2007 7:49:56 AM , Rating: 2
BBC gives it a different approach-"software programming"
(not covered in this article)


"The challenge is to find a way to program the many cores simultaneously. "

"But to take advantage of the extra processing power, programmers need to gives instructions to each core that work in parallel with one another."

"It is going to require quite a revolution in software programming."

Traffic Jam Central
By Acemantura on 2/13/2007 11:21:47 AM , Rating: 2
What is happening with chip makers. It isn't about multiple cores, its all about parallel computing. If you only have one pipeline going through, wouldn't that just be traffic jam central? multiple processors not multiple cores. I guess what i really trying to say is processors working in tandem, that is the only efficient way to do anything. Right now the only computing is done through one socket(Cylinder). If the cylinder gets larger it powerful but how efficient is it? Multiple processors, parallel architecture, procs working in tandem, thats what its all about folks, not tera.

This is a very significant chip
By verndewd on 2/13/2007 4:05:39 PM , Rating: 2
I wonder how many ways they can incorporate this type of chip.Even if it never makes a cpu it is still relevant for its throughput/output.I wonder how feasable it is to use the concept in their bus tecnologies.I am grasping at air but its exciting in its level of performance.

Xbox 720----> meet your CPU
By xuimod on 2/12/07, Rating: -1
RE: Xbox 720----> meet your CPU
By Tyler 86 on 2/12/2007 4:34:07 PM , Rating: 2
Great way to show your support of Daily Tech.
Continue contributing such detailed messages.

"A politician stumbles over himself... Then they pick it out. They edit it. He runs the clip, and then he makes a funny face, and the whole audience has a Pavlovian response." -- Joe Scarborough on John Stewart over Jim Cramer
Related Articles

Copyright 2016 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki