Gordon Moore's prediction of doubling transistor counts every 2 years revolutionized the computer industry and his company, Intel. (Source: New York Times)

An NVIDIA VP is declaring Moore's Law dead and GPUs the only hope for the industry. (Source: TechCrunch)

In NVIDIA's eye the parallelism of the GPU is the only future for computing

NVIDIA
has struggled this time around in the GPU war. Its first
DirectX 11 products were
delivered a full seven months after AMD's.
While its new units are at last trickling onto the market and are
very powerful, they're also hot, loud, and power hogs. However,
NVIDIA is staking much on the prediction that the computer industry
will be ditching traditional architectures and moving towards
parallel designs; a movement which it sees its CUDA GPU computing as
an ideal solution for.

Intel and NVIDIA have long
traded jabs, and Intel's recent failed
GPU bid, Larrabee, does
little to warm to the ice. In a recent op-ed entitled "Life
After Moore's Law", published in Forbes,
NVIDIA VP Bill Dally attacks the very foundation of Intel's business
-- Moore's Law -- declaring it dead.

Moore's Law stemmed from
a paper [PDF]
published by Gordon Moore 45 years ago this month. Moore,
co-founder of Intel, predicted in the paper that the number of
transistors per area on a circuit would double every 2 years (later
revised to 18 months). This prediction was later extend to
predict that computing power would roughly double every 18 months, a
prediction that became known as Moore's Law.

Now with die
shrinks becoming
more problematic, NVIDIA is convinced the end is nigh for Moore's
Law (and Intel). Writes Dally:

Moore's
paper also contained another prediction that has received far less
attention over the years. He projected that the amount of energy
consumed by each unit of computing would decrease as the number of
transistors increased. This enabled computing performance to scale up
while the electrical power consumed remained constant. This power
scaling, in addition to transistor scaling, is needed to scale CPU
performance.

But
in a development that's been largely overlooked, this power scaling
has ended. And as a result, the CPU scaling predicted by Moore's Law
is now dead. CPU performance no longer doubles every 18 months. And
that poses a grave threat to the many industries that rely on the
historic growth in computing performance.

Dally
says that the only near-term hope for the computer industry now that
Moore's Law is "over" is parallel computing -- splitting
workloads up among a variety of processors. However, he derides
multi-core efforts by AMD and Intel, stating, "Building a
parallel computer by connecting two to 12 conventional CPUs optimized
for serial performance, an approach often called multi-core, will not
work. This approach is analogous to trying to build an airplane by
putting wings on a train. Conventional serial CPUs are simply too
heavy (consume too much energy per instruction) to fly on parallel
programs and to continue historic scaling of performance."

He
concludes, "Let's enable the future of computing to fly--not
rumble along on trains with wings."

In other words, he
hopes you will buy NVIDIA GPUs and join the "Moore's Law is
dead" party.

An example in C: for (i=2; i < n; i++) { x[i] = x[i-1] * 2; } This cannot be auto-paralellized by any compiler or pre-processor that I'm aware of. It appears at first to be sequential by nature because x[i] is dependent on the previous iteration. However, the same problem can be solved with: for (i=2; i < n; i++) { x[i] = x[1] * (2 ** (i-1)); } which can be auto-parallelized. The auto-parallelization is not smart enough to find a different solution to the same problem so that it may be parallelized. Thus, it is either up to the programmer to write sequential code that can be auto-parallelized, or it is up to the programmer to write parallel code to begin with. Is one more difficult than the other? I believe it depends on training and mindset.

We can't wait around on some clever computer scientist to solve this sort of AI problem, just as physicists can't wait around for the next Einstein to reconcile quantum mechanics and general relativity. In the meanwhile, it makes more sense to struggle through the learning curve and make the best use of what we have, rather than continuing to train students solely in sequential methods in the hope that someone will soon devise a really smart auto-parallelization tool.

for (i=2; i < n; i++) { x[i] = x[1] * (2 ** (i-1)); }

Actually that is worse than the original, even on a parallel machine, since you still have an implicit loop in the power of two. Using shifts, what you really want to do is evaluate:

for (i=2; i < n; i++) { x[i] = x[1]<<(i-1); }

Of course, you really should put an overflow check in there, since it will overflow pretty quickly if x[1] /= 0.

A better (and eminently practical) example is to try to implement the Simplex algorithm for solving linear programming problems efficiently. Even if you have a very good BLAS, such as the goto BLAS, or a version of the ATLAS BLAS tuned to your system, a good deal of optimization in Simplex code concerns itself with representing the basis as a (square) matrix times a series of vectors.* Every so often you need to "regenerate" the basis by inverting a matrix, and how often you need to do that depends on how stiff the current basis is, so you want to keep the determinant of the basis around from every pivot since the last regeneration. (Or some growing function thereof.)

You cannot, unfortunately, combine multiple changes to the basis or do them in parallel. You can, and must extract some of the parallelism from the problem for any serious sized LP problem, but the best way to do that may be a property of the actual LP problem being solved. For example, transportation problems are a special case of LP problems which allow for faster solutions, and oil companies typically have huge models including both transportation and refinery operations. In that case the current solution is used as a starting point for the next iteration.

In other words, the major job of the programmer is not to produce some correct implementation of the Simplex algorithm, but to produce a valid implementation which is well suited to the actual hardware and problem domain. Writing the final code is probably the easiest part of the job. Or you can get "canned" programs, and play with the optimization settings. It is often easier to write the code yourself, since you need to know it almost by heart to do a decent job on the tuning.

* If you are not familiar with linear programming, it is the problem of finding a maximum value of the objective function over a--usually large--set of variables, subject to a set of linear constraints. It was discovered in the 1940's that the optimal solution to a linear programming problem can will always have at most one non-zero variable for each constraint. Eliminating all the other (zero) variables, you now have a simple linear algebra problem to find the correct values for the "free" variables. The problem, of course, is first determining what that set is.