backtop


Print E-mail del.icio.us 35 comment(s) - last by PlasmaBomb.. on May 4 at 6:05 AM


The new e620 brings math to PCI Express

The CSX600's design; 96 dedicated math units each with their own 6KB cache

Comparing a 3.0Ghz Intel Xeon based system with the same system containing between 1 and 4 ClearSpeed Advance X620 boards

Base System: HP DL380 G5, Intel Xeon 5160 x 2 @ 3GHz, 14GB
The new ClearSpeed math processor now comes in PCI Express

ClearSpeed Technology today announced the new Advance e620 PCI Express, an accelerator board for high-performance number-crunching required in financial services, universities and national labs. Also new are enhancements to CSXL software libraries and the Visual Profiler.

Building on the success of ClearSpeed’s current PCI-X based Advance X620 accelerator, the introduction of the smaller form factor PCIe based Advance e620 accelerator brings acceleration technology to the latest generation of multi-core industry standard servers that incorporate the PCIe standard.

The CSX600 processor core found on the accelerator boards is composed of 96 processing cores capable of 64-bit double precision and can output more than over 55 GFLOPS DGEMM. The 15mm-square die size is composed of 128 million transistors, 47 percent of which is logic—about half of that is dedicated to floating-point units—and the remaining 53 percent is memory. IBM manufactures the processor on an eight-layer copper 0.13µm FSG process and Flextronics assembles the board.

ClearSpeed says that its technology can dramatically increase processing speed in a server without significantly affecting power consumption. ClearSpeed claims that the new CSXL libraries consolidate deliver 20 times the performance per watt compared with industry standard servers when running the high performance LINPACK benchmark. Each board averages 25 watts power dissipation.

“Large consumers of compute power are looking for ways to improve both their system performance and performance per watt,” said Steve Conway, research vice president of technical computing systems at IDC. “There is strong and increasing interest in acceleration technologies that could deliver improved performance without exceeding power, cooling and facilities constraints. ClearSpeed’s acceleration technology is making advances in this area.”

In addition to the new hardware, the new 2.50 release of CSXL software libraries introduces performance enhancements to the core linear algebra routines for matrix multiplication. Also included in the 2.50 release are the new Vector Math Library and Random Number Generators that support additional functionality such as Monte Carlo simulation for option pricing in the financial services industry.

According to numbers from ClearSpeed, performance comparisons based on benchmark code for European Option pricing provided by a major international bank showed up to 20 times performance speedup using a ClearSpeed Advance accelerator compared with an industry server.

For developers, the new ClearSpeed Visual Profiler toolset provides a view at every level of the system, including the interactions between multiple host processors and one or more ClearSpeed Advance accelerator boards.

“The world’s leading financial institutions and research organizations that depend upon the availability of compute power to maintain their competitive edge are struggling with the constraints of facilities space, power and cooling,” said Stephen McKinnon, ClearSpeed’s chief operating officer.  “The enhancements to our product family are delivering three, five or even twenty times the application performance of unaccelerated systems, while adding less than five percent to the total energy bill. Acceleration technology is causing a radical rethink of datacenter design.”

Financial institutions and research organizations are not the only ones looking at ClearSpeed chips. AMD revealed over a year ago that it was interested in reviving the math co-processor, perhaps employing the services of the CSX600 chip alongside its own Opteron CPUs. The latest development on the AMD front is that ClearSpeed intends to create a socket plugin version for Torrenza, the upcoming AMD "accelerated computing" platform.



Comments     Threshold


This article is over a month old, voting and posting comments is disabled

Just like old times...
By JonnyBlaze on 5/1/2007 11:12:26 PM , Rating: 2
Back when we needed 8087 math co-processors to run AutoCAD r8 and other professional programs.




RE: Just like old times...
By RyanHirst on 5/2/2007 12:38:26 AM , Rating: 4
I can see the appeal to financial institutions that have workstation clusters crunching FP code 100% of the time... the space and power savings are paramount.

But at $7000, it can ONLY be for space and power savings. Look at their chart. 2 cpus generate 52M samples/s, and 1 Clearspeed620 generates 183 million.

The problem is the quad-core cpu. A pair of quad-core xeons is 8 cores, or 208 million samples according to their chart. That's more than a clearspeed board for 1/4 the cost (you only have to compare CPU price, because you have to buy the motherboard and RAM anyway: you can't run the clearspeed board without them). And in a few months, the cost of the quad Xeons wil PLUMMET. And the Xeons are MASSIVELY more flexible-- Clearspeed's site lists 3 programs as 100% compatible with their boards (http://www.clearspeed.com/products/applicationsupp... Even then, half of the check boxes are either x'd (not compatible) or superscripted (limited compatibility).

So... does a dual-socket Xeon tower consume enough power to to eat up $5000 in electricity in any reasonable time frame? Um. No. So we're down to space. Space. Space. If the space is worth the cost, you buy one. Now THAT's a limited marketshare.

It just ain't like the old times. FP code runs geometrically faster on an FP unit versus an integer unit... which was the old days. Today, every processor has an FP unit. Clearspeed can expect linear speedup at best versus general processors. But they're not even getting that:
A single 620 board has two cx600 coprocessors, with 96 pipes/fp units each= 192 discrete simultaneous operations. Two C2Quads can handle 32 (4 for each core) simultaneous 64-bit double-precision calculations (in addition to 16 x87 calculations). But 2 C2 quads beat a Clearspeed 620. Better performance with 1/6 the architecture width. Even clearspeed's raw manufacturing technology within their specialized field is inferior to generalized CPUs --
their ONLY advantage is specialization.


RE: Just like old times...
By RyanHirst on 5/2/2007 12:40:11 AM , Rating: 2
agggh.... the close parenthesis got added to the link. Delete the last character and the link will work. Sorry.


RE: Just like old times...
By Rulother on 5/2/2007 7:00:26 AM , Rating: 2
Still in the long run getting one of these cards, that's dedicated to just performing these functions would be more ideal then a pair of quad-core Xeons chips, as this would have the OS and other functions interfering with its duties, plus if you wanna go that route... You gotta buy all the other hardware which would add up equal to or even beyond the price of that one card. Its actually a good deal as in it saves the consumer from having to deal with all the other junk they otherwise would. Also think about this, if you ever upgrade that server's processor, then all you have to do is just move the card to the new server or nothing at all.


RE: Just like old times...
By GoatMonkey on 5/2/2007 8:32:19 AM , Rating: 2
I wonder if instead of going with dual quad core xeons they could use multiple GPUs with custom software to do this kind of processing. I mean if you got one of those motherboards they were talking about somewhere else today with 4 way SLI and 4 8800 ultras you have some serious processing power for a whole lot less than $7K or $8K.


RE: Just like old times...
By SmokeRngs on 5/2/2007 2:34:29 PM , Rating: 2
Or, you could just buy one of these and toss it into the current system you have (if it has the requisite PCI-e slot) running this. Then you save the total cost of a new dual CPU quad core system not to mention the space and the vast majority of power.

Quad core Xeon systems are not as cheap as you make then out to be and dual processor quad core Xeon systems definitely aren't as cheap as you make them out to be. A single processor motherboard isn't too bad in price but a dual processor motherboard is much more costly. Then you have to add in the price of the FBDIMMs which isn't cheap The redundant PSUs aren't the cheapest things out there either.

Your logic is flawed since it does not address some of the relevant options available and the price structure you mention does not exist.

Don't forget that if you need more speed, you can toss another one of these cards in the system if it has a second PCI-e slot which will take it. Your route requires another system to be built if you need more speed and there is no guarantee that what you are doing will be able to properly utilize two different system whereas these cards are made to work together in a single system if I read correctly.

One last thing. The numbers you put up for quad core Xeons is the theoretical best they can do. It's not likely they would be able to reach those numbers much less maintain them 24/7 especially in a multithreaded environment. Scaling with more cores is not a linear jump for each core you add.


RE: Just like old times...
By glitchc on 5/2/2007 3:00:36 PM , Rating: 2
quote:

Quad core Xeon systems are not as cheap as you make then out to be and dual processor quad core Xeon systems definitely aren't as cheap as you make them out to be. A single processor motherboard isn't too bad in price but a dual processor motherboard is much more costly.


More costly than $7000??

quote:

Don't forget that if you need more speed, you can toss another one of these cards in the system if it has a second PCI-e slot which will take it.


Thus raising a cost to $14000. Shoot, I can build at least 4 quad-core Core 2 rigs and have them in a distributed computing setup. As the original poster claimed, the only potential benefit is space.


RE: Just like old times...
By SmokeRngs on 5/3/2007 9:01:21 AM , Rating: 2
quote:
More costly than $7000??


Considering TCO, definitely. The initial hardware purchase of a dual CPU quad core Xeon setup would get pretty damn close to begin with. I have little corporate IT experience but it seems no one else seems to have any since everyone has left out TCO. Since TCO is one of the biggest factors in any type of corporate purchasing decision it cannot be left out of the equation.

quote:
Thus raising a cost to $14000. Shoot, I can build at least 4 quad-core Core 2 rigs and have them in a distributed computing setup. As the original poster claimed, the only potential benefit is space.


You've just made the argument against the multiple systems even stronger. You've just raised the TCO a lot more. It costs a hell of a lot more to maintain two full systems instead of just two PCI-e cards. The cards save space, electricity and manpower.

Businesses don't buy parts for systems off of newegg and slap them together for the cheapest price which I assume is how you are getting your figures of being able to build systems with. A business is going to buy a premade server from an established system builder that comes with warranty and service contracts. This costs a bit more than just buying the parts and putting them in a box.


RE: Just like old times...
By RyanHirst on 5/2/2007 6:56:06 PM , Rating: 2
"scaling for multiple cores is not just a linear jump...."

Do you use mathematica? UltraFractal? Or run Folding at Home?

1. COde that is massively parallel DOES achieve nearly 100% scaling. See 5.
2. I compared Clearspeed's own numbers, starting with a multithread example (2+ cores)
3. You are implying that clearspeed's own p.r. team listed the maximum theoretical performance for the competition, while listing only real-world performance for their own product. I raise an eyebrow and ask you to justify this.
4. hardware is hardware. There is no inherent speed benifit to accessing an FPU via a PCI slot versus the CPU. In fact, while PCI-e IS on the same order-of-magnitude of latency as inter-cpu communications (both for amd and intel), it IS still a bit SLOWER. You pay the overhead penalty with the clearspeed board.
5. I don't know how to emphasize this more: Windows Idle process is 0-1% on any modern, unmanned workstation. Code that is 100%, massively parrallel calculations-- mathematics libraries developed specifically to run at 98-100% parralelization over a given workload-- CAN and DO achieve that scaling rate.
6. If that were not true, the clearspeed board would suffer even more than the intel CPU because it requires far more simultaneous calculations to achieve the same performance level. I.e-- you are saying the clearspeed board can be expected to scale fine with multiples of 192 separate FPU channels, but that the Intel cpu will have scaling problems with 32? This kind of unwarranted conclusion needs to be substantiated.

Most of my interests (fractal math, chess, distributed computing) are parallelizable tasks. I experience 98-100% scaling on a daily basis on a wide variety of apps. And the fact is, if you're running FP code that DOESN'T scale well.... the Clearspeed board is going to suffer first.

Finally. To repeat a point. You cannot compare a clearspeed board to the price of a new system because you NEED a system to run the board in. If you are comparing a NEW system with a plug-in board, your accountant will quickly remind you to credit the value of selling your current computers(into which you plan to plug the clearspeed board) towward the price of the new system. Otherwise you're comparing a functioning system with a pci card with no power, no RAM, no OS and no applications.


RE: Just like old times...
By SmokeRngs on 5/3/2007 9:23:27 AM , Rating: 2
quote:
Do you use mathematica? UltraFractal? Or run Folding at Home?


Folding? Why yes I do. And if you go do some searching you'll see Stanford is not confident on scaling past four cores. In most cases a quad core or dual CPU dual core setup does not finish the SMP work units in half the time of the equivalent dual core processor.

quote:
1. COde that is massively parallel DOES achieve nearly 100% scaling


In a few cases you might start getting near this scaling but it's actually few and far between. Also, the more cores you add, the less efficient the whole setup is. You're going to lose efficiency with eight cores especially compared to a single card.

quote:
3. You are implying that clearspeed's own p.r. team listed the maximum theoretical performance for the competition, while listing only real-world performance for their own product. I raise an eyebrow and ask you to justify this.
4. hardware is hardware. There is no inherent speed benifit to accessing an FPU via a PCI slot versus the CPU. In fact, while PCI-e IS on the same order-of-magnitude of latency as inter-cpu communications (both for amd and intel), it IS still a bit SLOWER. You pay the overhead penalty with the clearspeed board.


You raise an eyebrow? You expect a general purpose CPU to match and exceed the performance of a dedicated processor designed for just this? Hardware is not hardware as anyone who knows anything about hardware knows this. Otherwise, why are there different CPUs and architectures? Why do some perform better than others? What do you expect me to justify? The assumptions you make are based on flawed logic from the beginning. Fix your logic before raising an eyebrow at me.

quote:
Most of my interests (fractal math, chess, distributed computing) are parallelizable tasks. I experience 98-100% scaling on a daily basis on a wide variety of apps. And the fact is, if you're running FP code that DOESN'T scale well.... the Clearspeed board is going to suffer first.


The Clearspeed board is a dedicated setup specifically designed to run certain tasks and only those tasks. It doesn't need to worry about parallelization like a dual, quad or octal core setup does. It's sending the data through no matter what. It doesn't have to worry about being broken up and sent to different processors so it won't suffer at all. The multi core CPU setup is the one that will suffer. I'd also best in most cases latency will not be an issue as this would probably be a task where bandwidth is the concern and not latency so the PCI-e bus would not harm performance.

quote:
Finally. To repeat a point. You cannot compare a clearspeed board to the price of a new system because you NEED a system to run the board in. If you are comparing a NEW system with a plug-in board, your accountant will quickly remind you to credit the value of selling your current computers(into which you plan to plug the clearspeed board) towward the price of the new system. Otherwise you're comparing a functioning system with a pci card with no power, no RAM, no OS and no applications.


Again, you're wrong. You don't need a new system to run the board, not when you have a current system already sitting there in which you only have to drop the board in. Why increase TCO when you don't have to? Credit value of selling the current computers? Who says they are being sold and how much do you expect to get out of them? It's unlikely it's anywhere near the savings of just putting the Clearspeed board in a current system.


RE: Just like old times...
By FastLaneTX on 5/2/2007 7:36:16 PM , Rating: 2
There are few places where a product like this makes sense at this price; if you don't already have several thousand nodes in your cluster, don't bother.

However, there are folks with clusters tens of thousands of nodes, some upwards of a hundred thousand. I know one company that does number-crunching for oil companies, and they have several buildings the size of your average Wal-Mart just to hold all the computers. They spend millions per month on power and millions more cooling all that power. A product like this, even at that price tag, would easily pay for itself in a year from power savings, increased capacity, and possibly even being able to sell off some land. And let me tell you, if they ordered 100,000 of these things, they wouldn't be paying $7k a pop...


What do Normal People use this for?
By Assimilator87 on 5/1/2007 5:39:03 PM , Rating: 2
Are there any consumer applications that can take advantage of these e.g. Folding@Home?




RE: What do Normal People use this for?
By fk49 on 5/1/2007 5:44:51 PM , Rating: 2
Haha if the other poster's figure of $7000 is correct, this is an awfully expensive way to increase your F@H score.


RE: What do Normal People use this for?
By BladeVenom on 5/1/2007 10:10:32 PM , Rating: 2
But if it did make it to the consumer market, prices would come down dramatically.


By Griswold on 5/2/2007 2:54:04 AM , Rating: 3
Chicken and egg scenario. There is no consumer market for it therefore it will not make its way to the consumer market and thus the prices wont fall because of that.


By Spyvie on 5/1/2007 5:48:08 PM , Rating: 5
Forget curing disease, Imagine the Super Pi 1m scores!


RE: What do Normal People use this for?
By rsmech on 5/1/2007 5:50:11 PM , Rating: 2
I thought folding@home was for using someones (consumer)hardware while it was idle. So is folding@home a consumer use of hardware. It's a great use for hardware downtime but is it a reason someone buy certain hardware. You buy a hardware for your use & "donate" it's downtime. So I don't think folding@home would be a consumer drive behind this.

Just my 2 cents.


RE: What do Normal People use this for?
By bunnyfubbles on 5/1/2007 8:35:36 PM , Rating: 2
Yet I know of guys who build rigs that are pretty much only used for DC stuff like F@H. While its true that their main rig is used mostly for their own use and thus they "donate its downtime", but they'll have another rig (if not multiple ones) set aside just for 24/7 crunching.

I'd agree with the other guy though, unless these things are as fast as their price tag, it'd be much wiser for such guys to just buy a couple of PS3s or Radeon X1900 rigs for something like F@H...