backtop


Print 84 comment(s) - last by saratoga.. on Feb 2 at 6:27 PM

More "Penryn" details emerge

Despite the plethora of attention Penryn received over the last few weeks, Intel's newest roadmaps put the processor launch for Q1'08.  This indicates the launch has not necessarily accelerated even though the initial tape-out proved extremely successful.

On the other hand, Intel's 2008 roadmap shows every segment simultaneously deploying 45nm products.  Like AMD's recent 65nm Brisbane launch, Intel guidance notes the processors will start shipping Q4'07 but the actual launch will come as a coordinated 2008 event.

The first Intel 45nm treatments will come from the quad-core Yorkfield and dual-core Wolfdale desktop processors.  Wolfdale has two physical cores on a single die and up to 6MB of L2 cache.  Yorkfield is then two Wolfdale dice on a single package. Also worth noting: Wolfdale ships with a 1333MHz front-side bus and Yorkfield ships with a 1066MHz front-side bus.  Chipset support will largely come from Bearlake-family that was previously disclosed on DailyTech.

Perhaps the most interesting thing about these two processors is the return of Hyper-Threading.  This, however, does not mean that Yorkfield will appear as eight logical cores, nor does it mean Wolfdale will appear as four logical cores. Intel's internal guidance on the subject specifically claims the processor will ship with Hyper-Threading, but will only utilize 4 threads.  On every Intel roadmap in the past, Hyper-Threading doubles the amount of listed threads in the guidance documentation.  Clearly, there is more of a mystery here still.  (Update: Please read the retraction below.)

"The official company policy is that our engineers have left the door open for Hyper-Threading, but we cannot confirm or deny any future plans for the technology," adds Intel Public Relations Manager Dan Snyder.

All Penryn cores also include Intel TXT, previously known as Intel LaGrande Technology.  TXT stands for Trusted Execution Technology and refers to the collection of devices.  The Trusted Platform Module, or TPM, is one component. DMA page protection is another. 

Alas, even if 2008 seems like a long time away for the 45nm platform, it's important to note that all Intel platforms will have 45nm SKUs in Q1'08.  Penryn, the family name for Intel's first generation 45nm consumer CPUs, also refers specifically to the 45nm dual-core mobile CPU.  Intel's current roadmap claims this processor will lead the Q1'08 mobile push with several low voltage models coming one quarter later.

For servers, Wolfdale will make an appearance as a dual and single socket Xeon.  It's been long-standing Intel policy to separate desktop, mobile and server chipsets into different products; Conroe was the Core 2 desktop CPU and Woodcrest, though physically nearly identical, was the Xeon counterpart.  Wolfdale as a server and a desktop CPU indicates the chips are electrically identical -- though each will likely receive different packaging for the different sockets. 

Yorkfield will not receive the same codenaming treatment as Wolfdale on the server. Instead, Harpertown will be the quad-core Xeon for two socket servers.  Yorkfield will still be the company‚Äôs single-socket quad-core Xeon offering.

Update 01/31/2007:  Channel sources have reached out to DailyTech to emphasize that the addition of Hyper-Threading to Penryn-family processors in 2008 is incorrect and the result of dated channel data.  My feelings and thoughts about the retraction can be read on my blog.


Comments     Threshold


This article is over a month old, voting and posting comments is disabled

RE: Until there is proper software, HT is over rated
By rqle on 1/30/2007 8:12:36 PM , Rating: 2
it does also help on many as well, with the option to disable it if you dont like it. free is better than none imo.


By Viditor on 1/30/2007 8:35:00 PM , Rating: 2
On a multicore chip, I can't see HT doing very much for performance until CSI and Nehalem is released in 2008/9 (remember that the bottleneck nere is the FSB).

On another note though, it's nice that Intel has finally put the wild speculation about an early Penryn release to bed...

Intel's newest roadmaps put the processor launch for Q1'08. This indicates the launch has not necessarily accelerated even though the initial tape-out proved extremely successful


By saratoga on 1/30/2007 8:38:53 PM , Rating: 2
One of the nice things about HT is that it can help hide slow memory, provided you've got enough cache. Look at what MS/IBM did with the Xenon CPU on the Xbox360. Tons of higher latency memory, combined with P4-like clock speed, but very well done cache and HT to help hide that.


By Viditor on 1/30/2007 9:00:42 PM , Rating: 3
quote:
One of the nice things about HT is that it can help hide slow memory, provided you've got enough cache


But C2D's more intelligent prefetch mechanism and memory disambiguation (along with the huge cache) have already accomplished this quite well. I don't see HT making any improvements here...


By saratoga on 1/30/2007 9:33:30 PM , Rating: 2
You're kidding right? You really think Conroe has solved the memory latency problem? It's got good OOOE, but still no where near good enough to cover missing an L2 cache line. Not to mention the issue of ILP limitations on a wide core like that. Theres still enormous room for improvement.


By Viditor on 1/30/2007 10:10:28 PM , Rating: 2
Agreed, but I don't see where HT will help with those issues...
Could you elaborate?


By saratoga on 1/31/2007 3:20:18 PM , Rating: 2
HT lets you fill in unused issue width with OPs from another thread. So if you stall waiting on a cache miss (or even a cache hit) you can continue to work on another thread rather then letting the CPU grind to a halt for 10-100 clock cycles.


By Viditor on 1/31/2007 9:48:08 PM , Rating: 3
quote:
HT lets you fill in unused issue width with OPs from another thread. So if you stall waiting on a cache miss (or even a cache hit) you can continue to work on another thread rather then letting the CPU grind to a halt for 10-100 clock cycles


True...and in a single core environment this would (and is) usually a very good thing. But the scheduler (and apps) sees the HT virtual core as just another core. So if you are running 2 simultaneous threads, there is no method by which you can direct them to the actual core instead of the virtual core. I'm sure that you'd agree that only using a single core efficiently instead of splitting the work between 2 cores kind of defeats the purpose of dual (or more significantly quad) core chips.
It's true that you can write affinity code for the OS, but the applications all need to know what that means as well...and that's just not the case at the moment.

In a nutshell, in this case the whole is not the sum of it's parts...:)


By intangir on 2/1/2007 6:40:23 PM , Rating: 3
The applications don't need to know. Even the process scheduler in Windows XP SP2 will automatically prioritize thread assignments taking into account whether the core is already occupied. Where's the problem?


By saratoga on 2/2/2007 6:13:24 PM , Rating: 2
quote:
True...and in a single core environment this would (and is) usually a very good thing. But the scheduler (and apps) sees the HT virtual core as just another core. So if you are running 2 simultaneous threads, there is no method by which you can direct them to the actual core instead of the virtual core.


Actually I think Vista can do this, however its not really all that useful.

quote:
I'm sure that you'd agree that only using a single core efficiently instead of splitting the work between 2 cores kind of defeats the purpose of dual (or more significantly quad) core chips.


Absolutely not. If you have 8 threads available on a quad chip, you sure as hell want to issue all 8. See my point above. Having more threads means less pain when you have to hit main memory, or when you mispredict a branch. Essentially, if you want to use a wide/deep core efficiently, you basically need to have SMT (as Sun, MS and IBM have shown with their products).

Intel may have botched their SMT with HT thanks to the crappiness of the P4, but thats not generally the case, which is why so many new cores coming out have SMT.



By scrapsma54 on 1/31/2007 6:56:49 PM , Rating: 3
Ht is useful. Anybody who is computer science in knows that processor parallelism has always been the best road to take when developing a computer. HT is a simulation of parallelism. Ht takes 2 threads and splits them up. then the processor processes the code and alternating between each split. This allows near parallel operation and conserves cpu usage. With core 2 duo series the cpu usage is split and the second core processes the data. With hyper-threading, I would expect cpu usage when playing a game would be cut down 1/3 of what it would do without it.


By scrapsma54 on 1/31/2007 7:02:58 PM , Rating: 2
But anyway, the whole hyperthreading will probably never happen since it was incorrect information.


By Thorburn on 2/1/2007 2:41:27 PM , Rating: 2
Not on Core, but it should be back with Nehelam (sp)


By Viditor on 1/31/2007 9:52:13 PM , Rating: 2
quote:
Ht is useful. Anybody who is computer science in knows that processor parallelism has always been the best road to take when developing a computer

I don't think anyone disputes that...
It's just not useful in ALL circumstances. In this case, it's a question of comparitive efficiencies. Is it better to have more efficient individual cores, or is it better to have the multicore chip be more efficient as a whole?


By intangir on 2/1/2007 6:32:12 PM , Rating: 3
False dilemma. There is no reason you cannot have both. Having both is better than having either in isolation.


By SacredFist on 1/30/2007 10:09:35 PM , Rating: 4
Problem was, Windows didn't know core 1 and core 2 were technically on the same core. So instead of assigning threads to a different core like Core 1 Core 3 THEN core 2 for HT, it crammed in two threads on a loaded processor leaving one unused.


By Viditor on 1/30/2007 10:18:01 PM , Rating: 2
quote:
Problem was, Windows didn't know core 1 and core 2 were technically on the same core. So instead of assigning threads to a different core like Core 1 Core 3 THEN core 2 for HT, it crammed in two threads on a loaded processor leaving one unused

That's it exactly...
In TomZ's case, he's running so many threads that it really doesn't matter that much and can enhance performance...but for average use, HT can slow things down in a MC environment.


By intangir on 2/2/2007 11:34:37 AM , Rating: 3
If you're running a modern Windows, that should no longer be a problem. From a May 2003 Microsoft whitepaper:

quote:
To take advantage of this performance opportunity, the scheduler in the Windows Server 2003 family and Windows XP has been modified to identify HT processors and to favor dispatching threads onto inactive physical processors wherever possible.


http://www.microsoft.com/whdc/system/CEC/HT-Window...


RE: Until there is proper software, HT is over rated
By Phynaz on 1/31/2007 11:51:06 AM , Rating: 3
quote:
On a multicore chip, I can't see HT doing very much for performance until CSI and Nehalem is released in 2008/9 (remember that the bottleneck nere is the FSB).


Sigh...

I'ts been shown again and again that there is no FSB bottleneck.

It would be really refreshing if you quit spreading your misinformation.


By Griswold on 1/31/2007 2:08:55 PM , Rating: 2
In multi-socket systems, thanks to cache coherency traffic that has to go over FSB, there seems to be your non-existant bottleneck. It materializes in mediocre scalability.

See also:
http://www.anandtech.com/IT/showdoc.aspx?i=2897&p=...

This may not be relevant to somebody like you, but to others it is.


RE: Until there is proper software, HT is over rated
By Phynaz on 1/31/2007 2:13:12 PM , Rating: 2
Backup what you are saying. State how much bandwidth cache coherency traffic takes up on the FSB.


By saratoga on 1/31/2007 3:23:30 PM , Rating: 2
quote:
Backup what you are saying.


He did . . .

Let me guess, you didn't actually read the link?

quote:
State how much bandwidth cache coherency traffic takes up on the FSB.


That depends on the load chosen. Theres no one number, so your question doesn't even have a specific answer. You might as well ask how much memory bandwidth is enough, or many cores benchmarks use and how much cache is idea. You need to define your load, since the answer depends on what you're doing.


By Viditor on 1/31/2007 8:55:06 PM , Rating: 2
Phynaz, I don't know if you realize it but you've asked this question before in another thread.
There, we were talking about scalability...
Here was my reply to your request for proof on that thread:

"Here is a first indication that quad core Xeon does not scale as well as the other systems. Two 2.4GHz Opteron 880 processors are as fast as one Xeon 5345, but four Opterons outperform the dual quad core Xeon by 16%. In other words, the quad Opteron system scales 31% better than the Xeon system"
http://tinyurl.com/2cgnj8

Please note that as you add cores and sockets that use the FSB, in other words scale, the relative performance of the Opteron gains dramatically.
If you think about it you'd realize that since the cores haven't changed, only the load on the FSB could account for this.


By coldpower27 on 2/1/2007 9:03:55 AM , Rating: 2
Interesting, this problem doesn't seem to manifest itself till you reach the 2P System with Dual Clovertown's.

Now I can understand how AMD can claim the 40% improvement over Clovertown, with Barcelona "in a wide variety of workloads".

It will be interesting to see how much this increase diminishes, assuming AMD's numbers are currently correct for Bareclona vs Clovertown performance, when you make the comparison with the Agena Quad Core to the Kentsfield Quad in Single Socket Desktop systems.

The FSB issue isn't a issue on a Single Socket, however it seems Clovertown suffers from poorer scaling as you increase the number of Sockets.

So overall is the FSB an issue, not on the Single Socket Arena, when you are talking 2 Sockets or more there is something that is weakening Clovertown's scaling ability. It could also be because Clovertown has to work with FB-DIMM technology which is of higher latency compared to Unbuffered DDR2 on the desktop.


By Viditor on 2/1/2007 12:33:32 PM , Rating: 2
Good points CP.
I agree that AMD probably chose Cloverton as a comparison very carefully, and 40% really isn't unbeleivable on this specific comparison .
One thing I've been saying all along is that if the K10 core is only equivalent to C2D, then AMD will have a much better spec because of the platform. Of course I was most incomplete in my comments...
I should have qualified that I was speaking of servers...mainly because of the scaling.

As to FBDs being an issue, there is something to that...but it seems to me that the latency doesn't come close to accounting for the large difference in scaling.
I am still wondering why Intel went with a high latency/high bandwidth model for memory instead of a low latency/low bandwidth one...


By Phynaz on 2/1/2007 1:14:59 PM , Rating: 2
K10?

Ummmm....This is K8L we're talking about, not some fanciful chip that's always five years away.


By Viditor on 2/1/2007 1:30:11 PM , Rating: 2
quote:
Ummmm....This is K8L we're talking about, not some fanciful chip that's always five years away


Ummm...there is no K8L. The next-gen chip coming out next quarter (Barcelona) is a K10 chip...

"Again, AMD has explicitly told me its native quad-core chips will be K10, not K8. That's from their Technical Director - Sales and Marketing EMEA, so isn't likely to be wrong"
http://forums.hexus.net/showthread.php?t=92137&pag...

But a "Rose by any other name"...let's just call it Barcelona.


By Phynaz on 2/1/2007 10:33:07 AM , Rating: 2
Yes, I have asked this question before.

I hardly consider one article from Anandtech to be proof of anything.

For example, where in the article is empirical evidence that the FSB has become saturated or is a bottleneck?

Answer: There isn't any, it's all conjecture.


By Viditor on 2/1/2007 12:17:51 PM , Rating: 2
quote:
where in the article is empirical evidence that the FSB has become saturated or is a bottleneck?
Answer: There isn't any, it's all conjecture


Well no actually, it's a hypothesis backed by scientific data that corroborates the conclusion.

It might be helpful if instead of us taking your word for it, you could offer some indication of substance for your assertion that:
"I'ts been shown again and again that there is no FSB bottleneck"


By Phynaz on 2/1/2007 1:13:30 PM , Rating: 2
Yeah, prove a negative, that will work.

Please point me to this scientific evidence you speak about.

Even though I'm not the one making the statements (you are), I'll give you my evidence.

Using Intel analysis tools, running business applications, I see FSB utilization in the 15% percent area on a dual core HP system. Running the same on a quad bumps the utilization to 18%-20%. Business applications run out of cpu long before then run out of FSB bandwidth.



By Viditor on 2/1/2007 1:38:14 PM , Rating: 2
quote:
Yeah, prove a negative, that will work

Sigh...it's not proving a negative, it's demonstrating that data throughput on a FSB modeled system is equivalent to an HT modeled system.

I remind you that you are the one who said:

"I'ts been shown again and again that there is no FSB bottleneck"

All I'm asking for is for you to show that...seems reasonable to me.


By saratoga on 2/2/2007 6:24:55 PM , Rating: 2
quote:
Please point me to this scientific evidence you speak about.


What kind of logic is this? You said there was evidence, so you provide it. Don't say it exists and then expect other people to find it for you.

quote:
Even though I'm not the one making the statements (you are)


You mean "making claims". This whole post of yours is a statement.

Anyway, remember when you posted this:

"I'ts been shown again and again that there is no FSB bottleneck. "

So you did make a claim. Now back it up or retract it.

quote:
Using Intel analysis tools, running business applications, I see FSB utilization in the 15% percent area on a dual core HP system. Running the same on a quad bumps the utilization to 18%-20%. Business applications run out of cpu long before then run out of FSB bandwidth.


All that proves is that your specific business app isn't constrained. No one even doubted that. Rather, your claim that the Core 2 was not limited on all workloads is in doubt. Unless you've some reason to think its relevant to the rest of the world, theres no sense in even mentioning it.


"What would I do? I'd shut it down and give the money back to the shareholders." -- Michael Dell, after being asked what to do with Apple Computer in 1997

Related Articles
Recent Intel Tidings, Retractions
January 31, 2007, 9:38 AM
Life With "Penryn"
January 27, 2007, 12:01 AM
Intel 45nm "Penryn" Tape-Out Runs Windows
January 10, 2007, 2:13 AM
AMD Announces "Brisbane" 65nm Processors
December 5, 2006, 1:27 AM













botimage
Copyright 2014 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki