Go Back   Xtreme CPU > Hardware > CPUs

Memory and Storage Devices
Motherboard and CPU
Graphic Cards
Networking Devices
General Reviews

Contemporary CPU Architectures Compared
  1 links from elsewhere to this Post. Click to view. #1 (permalink)  
Old 06-16-2007, 12:53 AM
qcmadness's Avatar
qcmadness qcmadness is online now
Super Member
 
Join Date: Jun 2007
Location: Hong Kong
Posts: 1,143
Send a message via MSN to qcmadness
Contemporary CPU Architectures Compared

Introduction
Here are some of the architectural highlights about the current and future Intel / AMD CPUs. The following CPU architectures will be compared:
1. AMD K8 / Hammer (released in 2003) - Hammer
2. Intel Core Architecture (released in 2006) - Core Arch.
3. AMD K8L (?) / K10 (?), the next generation architecture (to be released in 2007) - NGA
4. Intel Core Architecture update, the Penryn / Wolfdale family (to be released in 2007 / 2008) - Penryn

Last updated: 22nd August, 2007

Special thanks to Pippero and Clue69Less for corrections.

The architectural highlights:
1. Processor manufacturing technology:

Hammer: 130nm / 90nm / 65nm SOI, 9 metal layers
Core Arch.: 65nm, 45nm in 2007 H2, 8 metal layers
NGA: 65nm SOI, 45nm SOI in mid-2008, 11 metal layers
Penryn: 45nm with high-K design in 2007 H2, unknown number of metal layers

2. Cache system
Hammer:
L1 cache: 64KB data + 64KB instruction, 2-way, latency: 3 cycles
L2 cache: 512KB, 16-way, 128-bit (32GB/s at 2GHz), latency: 12 cycles (90nm version)
L3 cache: absent
Core Arch.:
L1 cache: 32KB, 8-way, latency: 3 cycles
L2 cache: 2-4MB shared for 2 cores, 16-way, 256-bit (64GB/s at 2GHz), latency: 12-14 cycles
L3 cache: absent
NGA:
L1 cache: 64KB data + 64KB instruction, 2-way, latency: 3 cycles
L2 cache: 512KB, 16-way, 256-bit (32GB/s at 2GHz), latency: unknown
L3 cache: 2MB shared, 32-way, unknown width and latency
Penryn:
L1 cache: 32KB, 8-way, latency: 3 cycles (expected to be the same as Core Arch.)
L2 cache: 3-6MB shared for 2 cores, 24-way (?), 256-bit (96GB/s at 3GHz), latency: slightly lower than Core. Arch.
L3 cache: absent
Special feature: "Split Load Cache Enhancement"

3. x86 decoding ability
Hammer:
x86 decoders: 3 complex
Out-of-order execution buffer: 72 general instructions, 36 FP instructions and 24 Integer instructions
Core Arch.:
x86 decoders: 3 simple + 1 complex (the complex decoder can decode 2 simple codes in a pass)
Out-of-order execution buffer: 96 instructions
NGA:
x86 decoders: 3 complex
Out-of-order execution buffer: 72 general instructions, 36 FP instructions and 24 Integer instructions
Penryn:
x86 decoders: 3 simple + 1 complex (the complex decoder can decode 2 simple codes in a pass)
Out-of-order execution buffer: 96 instructions
(expected to be the same as Core Arch.)

4. ALU, FPU and SSE units
Hammer:
ALU units: 3
SSE units: 2 units, 64-bit
SSE versions supported: SSE, SSE2 (all Hammer versions), SSE3 (for Rev. E and later)
Core Arch.:
ALU units: 3
SSE units: 3 units, 128-bit
SSE versions supported: SSE, SSE2, SSE3, SSSE3 (part of SSE4)
NGA:
ALU units: 3
SSE units: 2 units, 128-bit
SSE versions supported: SSE, SSE2, SSE3, SSE4A (part of SSE4 with some Core Arch. specific codes removed)
Penryn:
ALU units: 3
SSE units: 3 units, 128-bit
SSE versions supported: SSE, SSE2, SSE3, SSE4

5. Pre-fetch and other tune-ups
Hammer:
Out-of-order loads: absent
Stack manager: absent
Pre-fetchers: 1 data, 1 instruction (to L2 cache)
Instruction fetch width: 16 byte per cycle
Core Arch.:
Out-of-order loads: present
Stack manager: present
Pre-fetchers: 2 data, 1 instruction (to core), 2 pre-fetchers (to L2 cache)
Instruction fetch width: 24 byte per cycle
NGA:
Out-of-order loads: present
Stack manager: present
Pre-fetchers: 1 data, 1 instruction (to L1 cache), 1 DRAM pre-fetcher (to dedicated buffer)
Instruction fetch width: 32 byte per cycle
Penryn:
Out-of-order loads: present, with ashuffle engine to optimize for SSEx
Stack manager: present
Pre-fetchers: 2 data, 1 instruction (to core), 2 pre-fetchers (to L2 cache)
Instruction fetch width: 24 byte per cycle

6. Memory controller
Hammer: 1x128-bit memory controller (1 operation per cycle)
Core Arch.: absent
NGA: 2x64-bit memory controller with NUMA (max 2 operations per cycle), can change back to 1x128-bit mode
Penryn: absent

7. Power management
Hammer: Cool'n'Quiet (min. x5 multiplier)
Core Arch.: EIST (min. x6 multiplier), switch off transistor when not in use
NGA: improved C'n'Q, two separate power planes for crossbar and cores, separate clocks for each core
Penryn: EIST (?), switch off transistor when not in use, C6 state, separate clocks for each core (the core frequency may exceed the rated frequency)

Reference:
AnandTech: Intel Core versus AMD's K8 architecture
AnandTech: Barcelona Architecture: AMD on the Counterattack
AnandTech: Intel: More Details on Penryn and Nehalem
DailyTech - DailyTech Digest: Intel's "Penryn"
http://www.amd.com/us-en/assets/cont...docs/25112.PDF
http://www.amd.com/us-en/assets/cont...docs/40546.pdf
http://www.intel.com/design/processo...als/248966.pdf
AnandTech: The Penryn Preview - Part I: Wolfdale Performance

Last edited by qcmadness; 02-21-2008 at 06:11 AM. Reason: Major corrections

  #2 (permalink)  
Old 06-16-2007, 01:08 AM
verndewd's Avatar
verndewd verndewd is offline
Forum Master
 
Join Date: Jun 2007
Posts: 5,341
suhweet. I swear does it get any better among peers?
__________________

  #3 (permalink)  
Old 06-16-2007, 02:16 AM
ColonelCain's Avatar
ColonelCain ColonelCain is offline
Member
50,000 Points
 
Join Date: Jun 2007
Location: Probably on a football field somewhere in Arizona
Posts: 696
Send a message via AIM to ColonelCain
Nice article! Everyones been trying to compare these architectures in long, lengthy articles, but here we finaly have the info in a comparison table.

I second the suhweet vern.
__________________
For all of your Watercooling needs, head over to Petra's Tech Shop
"Never skimp on the Power Supply" -Me
Build
E8400 @4.0(for now) | DFI LT X38-T2R | 4x1GB Ballistix PC8500 @Stock(for now) | HD3870 | PCP&C 610 | Creative X-Fi | NEC 1970GX LCD | Heavily modded Aspire X-Cruiser
Watercooling
MCP655 -> Swiftech MicroRes v1 -> Dtek FuZion v1 -> BIP 2X120 [suspended off case DIY] ->MCW-60 -> PA160 w/shroud [Inside of case ] Tygon 5/16, PT_Nuke + Pentosin, Yate Loon D12SM-12's for rad's.

  #4 (permalink)  
Old 06-16-2007, 12:21 PM
qcmadness's Avatar
qcmadness qcmadness is online now
Super Member
 
Join Date: Jun 2007
Location: Hong Kong
Posts: 1,143
Send a message via MSN to qcmadness
but there are still uncertainty in this post...

  #5 (permalink)  
Old 06-16-2007, 02:02 PM
JumpingJack's Avatar
JumpingJack JumpingJack is offline
XCPUs.com Moderator
500,000 Points
 
Join Date: Jun 2007
Posts: 2,885
Send a message via MSN to JumpingJack
QC, again, compliments on a great job.
__________________
"Few things are harder to put up with than the annoyance of a good example."
Mark Twain (1835 - 1910)

  1 links from elsewhere to this Post. Click to view. #6 (permalink)  
Old 06-18-2007, 07:48 AM
Pippero's Avatar
Pippero Pippero is offline
Member
 
Join Date: Jun 2007
Posts: 595
Nice job!
But some data in sections 3. and 4. seems inaccurate to me.
I'll get back to this later.
__________________
Special thanks to Verndewd:


I'm a fanboy of my own userbars:



And watch out for these guys:

  1 links from elsewhere to this Post. Click to view. #7 (permalink)  
Old 06-18-2007, 09:06 AM
qcmadness's Avatar
qcmadness qcmadness is online now
Super Member
 
Join Date: Jun 2007
Location: Hong Kong
Posts: 1,143
Send a message via MSN to qcmadness
Quote:
Originally Posted by Pippero View Post
Nice job!
But some data in sections 3. and 4. seems inaccurate to me.
I'll get back to this later.

Your contribution to the thread is welcomed

  1 links from elsewhere to this Post. Click to view. #8 (permalink)  
Old 06-25-2007, 01:12 PM
Pippero's Avatar
Pippero Pippero is offline
Member
 
Join Date: Jun 2007
Posts: 595
Ok, so:

Quote:
NGA:
ALU units: 3
Maximum dual-precision FP per cycle: 3
SSE units: 2 units, 128-bit
SSE versions supported: SSE, SSE2, SSE3, SSE4A (part of SSE4 with some Core Arch. specific codes removed)
Max SSE executions per cycle: 2
Penryn:
ALU units: 3
Maximum dual-precision (64-bit) FP per cycle: probably higher than 4 with "Radix 16"
SSE units: 3 units, 128-bit
SSE versions supported: SSE, SSE2, SSE3, SSE4
Max SSE executions per cycle: probably higher than 2 with "Radix 16"
Radix 16 reduces the latency (nearly halves) required to perform a division, or any operation which needs the divider to be performed (link).
This does not increase the peak number of floating point operations performed per clock, as reported (probably higher than 4 with "Radix 16")
The peak number of floating point operations can be reached when the CPU is executing an addition, a multiplication, a load and a store in the same cycle.
The division is usually executed in the multiplication port, takes tens of clocks, and it's execution cannot be overlapped with other operations in the same port.
What this mean is, that having a divider with a shorter latency, does not improve the peak FP throughput of a CPU, but it improves the real world performance, because the pipelines are busy for less clocks which can be used to process other (fast) FP operations.
From this old comparison it turns out that K7's division had half the latency of P4 (here we are not talking about the general latency of the pipeline, which is even longer in case of the P4, but only of the execution latency of the divider), which makes me think that AMD was already using a Radix 16 divider since the K7 days (but i have no information about the DIV latency on K10).


Concerning the peak FP throughput of K10 VS Core2/Penryn, it is incorrectly reported an AMD's disadvantage of 3 instructions per clock VS 4.
In fact, they both can perform 1 FADD and 1 FMUL in parallel; C2D can also perform one 128 bit load and one 128 bit store in parallel; K10 at peak should be able to perform 2 128 bit loads, but data on this is a bit vague.

Supershuffle:
Penryn introduces 128 bit SSE shuffle operations.
K10 can perform 2x 128 bit SSE shuffle operations, in fact it is 4x faster than K8 in this (AMD).

Quote:
3. x86 decoding ability
NGA:
x86 decoders: 3 complex
Out-of-order execution buffer: 72 instructions with improvements
Penryn:
x86 decoders: 3 simple + 1 complex (the complex decoder can decode 2 simple codes in a pass)
Out-of-order execution buffer: 96 instructions
(expected to be the same as Core Arch.)
In fact, the out of order buffers of Intel and AMD are organized in a very different way.
Intel uses a single reorder buffer with 96 entries, while AMD has a general scheduler with 72 entries, plus a dedicated 24 entries buffer for the integer pipelines and 36 entries for FP.
(link)

Quote:
5. Pre-fetch and other tune-ups
NGA:
Out-of-order loads: present
Stack manager: present
Pre-fetchers: 1 data, 1 instruction (to L1 cache), 1 DRAM pre-fetcher (to dedicated buffer)
Instruction fetch width: 32 byte per cycle
According to Realworldtech, Barcelona has 8 prefetchers per core, for a total of 32.
__________________
Special thanks to Verndewd:


I'm a fanboy of my own userbars:



And watch out for these guys:

Last edited by Pippero; 06-25-2007 at 01:27 PM.

  #9 (permalink)  
Old 06-26-2007, 05:49 AM
qcmadness's Avatar
qcmadness qcmadness is online now
Super Member
 
Join Date: Jun 2007
Location: Hong Kong
Posts: 1,143
Send a message via MSN to qcmadness
Quote:
Originally Posted by Pippero View Post
Radix 16 reduces the latency (nearly halves) required to perform a division, or any operation which needs the divider to be performed (link).
This does not increase the peak number of floating point operations performed per clock, as reported (probably higher than 4 with "Radix 16")
The peak number of floating point operations can be reached when the CPU is executing an addition, a multiplication, a load and a store in the same cycle.
The division is usually executed in the multiplication port, takes tens of clocks, and it's execution cannot be overlapped with other operations in the same port.
What this mean is, that having a divider with a shorter latency, does not improve the peak FP throughput of a CPU, but it improves the real world performance, because the pipelines are busy for less clocks which can be used to process other (fast) FP operations.
From this old comparison it turns out that K7's division had half the latency of P4 (here we are not talking about the general latency of the pipeline, which is even longer in case of the P4, but only of the execution latency of the divider), which makes me think that AMD was already using a Radix 16 divider since the K7 days (but i have no information about the DIV latency on K10).
You are correct. I will update it soon.

Quote:
Concerning the peak FP throughput of K10 VS Core2/Penryn, it is incorrectly reported an AMD's disadvantage of 3 instructions per clock VS 4.
In fact, they both can perform 1 FADD and 1 FMUL in parallel; C2D can also perform one 128 bit load and one 128 bit store in parallel; K10 at peak should be able to perform 2 128 bit loads, but data on this is a bit vague.
Once I get more information about this, I will update it.

Quote:
Supershuffle:
Penryn introduces 128 bit SSE shuffle operations.
K10 can perform 2x 128 bit SSE shuffle operations, in fact it is 4x faster than K8 in this (AMD).


In fact, the out of order buffers of Intel and AMD are organized in a very different way.
Intel uses a single reorder buffer with 96 entries, while AMD has a general scheduler with 72 entries, plus a dedicated 24 entries buffer for the integer pipelines and 36 entries for FP.
(link)
You are again correct. I will update it later.

Quote:


According to Realworldtech, Barcelona has 8 prefetchers per core, for a total of 32.
I get the information in this page:
http://www.anandtech.com/cpuchipsets...spx?i=2939&p=8

  #10 (permalink)  
Old 06-26-2007, 05:59 AM
qcmadness's Avatar
qcmadness qcmadness is online now
Super Member
 
Join Date: Jun 2007
Location: Hong Kong
Posts: 1,143
Send a message via MSN to qcmadness
Pippero: you can get the latency numbers at
http://www.amd.com/us-en/assets/cont...docs/40546.pdf

Appendix C

  #11 (permalink)  
Old 06-26-2007, 04:40 PM
Pippero's Avatar
Pippero Pippero is offline
Member
 
Join Date: Jun 2007
Posts: 595
Thanks
From a quick look at it, it seems that the latency of division, on K10, is pretty low.. but i'll get back to this when i have a bit of time.
For now, there is still something to be fixed in this collection:
Quote:
4. ALU, FPU and SSE units
Hammer:
ALU units: 3
Maximum dual-precision (64-bit) FP per cycle: 3
SSE units: 2 units, 64-bit
SSE versions supported: SSE, SSE2 (all Hammer versions), SSE3 (for Rev. E and later)
Max SSE executions per cycle: 1
Core Arch.:
ALU units: 3
Maximum dual-precision FP per cycle: 4
SSE units: 3 units, 128-bit
SSE versions supported: SSE, SSE2, SSE3, SSSE3 (part of SSE4)
Max SSE executions per cycle: 2
NGA:
ALU units: 3
Maximum dual-precision FP per cycle: 3
SSE units: 2 units, 128-bit
SSE versions supported: SSE, SSE2, SSE3, SSE4A (part of SSE4 with some Core Arch. specific codes removed)
Max SSE executions per cycle: 2
Penryn:
ALU units: 3
Maximum dual-precision (64-bit) FP per cycle: 4 (higher efficiency with "Radix 16")
SSE units: 3 units, 128-bit
SSE versions supported: SSE, SSE2, SSE3, SSE4
Max SSE executions per cycle: 4 (higher efficiency with "Radix 16")
The highlighted in bold is incongruent, since Penryn and Merom have exact the same capabilities in this area (with the only exception of division latency and shuffle).
Also, K8 and K10 have 3 SSE units, FADD FMUL and FMISC (which are the same 3 ports of the FP pipeline), and talking about "SSE executions per cycle" is a bit misleading.
In fact, K8 has the same number of "SSE executions per cycle", if executing scalar / 64 bit SSE instructions, but it has half the throughput with vector 128 bit instructions.
Concerning "Maximum dual-precision (64-bit) FP per cycle" this is again a bit misleading.
All the architectures presented can do only 2 "real" FP instructions at most (i mean number crunching stuff like multiplication, addition, division, etc).
The rest is load/store and move instructions, and here the situation is a bit confused, because it's still not clear to me if K10 can perform 2 128bit loads in parallel with 2 FP operations or not, and a more in-depth analysis should also consider the impact of load-execute instructions to the mix, where K10 is supposed to be more efficient thanks to its 32 byte fetch window.
I know i sound a bit obscure, but at the moment i can't dig more on the topic.
__________________
Special thanks to Verndewd:


I'm a fanboy of my own userbars:



And watch out for these guys:

  #12 (permalink)  
Old 06-27-2007, 11:19 AM
qcmadness's Avatar
qcmadness qcmadness is online now
Super Member
 
Join Date: Jun 2007
Location: Hong Kong
Posts: 1,143
Send a message via MSN to qcmadness
Quote:
Originally Posted by Pippero View Post
Thanks
From a quick look at it, it seems that the latency of division, on K10, is pretty low.. but i'll get back to this when i have a bit of time.
For now, there is still something to be fixed in this collection:

The highlighted in bold is incongruent, since Penryn and Merom have exact the same capabilities in this area (with the only exception of division latency and shuffle).
Also, K8 and K10 have 3 SSE units, FADD FMUL and FMISC (which are the same 3 ports of the FP pipeline), and talking about "SSE executions per cycle" is a bit misleading.
In fact, K8 has the same number of "SSE executions per cycle", if executing scalar / 64 bit SSE instructions, but it has half the throughput with vector 128 bit instructions.
Concerning "Maximum dual-precision (64-bit) FP per cycle" this is again a bit misleading.
All the architectures presented can do only 2 "real" FP instructions at most (i mean number crunching stuff like multiplication, addition, division, etc).
The rest is load/store and move instructions, and here the situation is a bit confused, because it's still not clear to me if K10 can perform 2 128bit loads in parallel with 2 FP operations or not, and a more in-depth analysis should also consider the impact of load-execute instructions to the mix, where K10 is supposed to be more efficient thanks to its 32 byte fetch window.
I know i sound a bit obscure, but at the moment i can't dig more on the topic.
I will quote the source later......

Quite busy in these few days

  #13 (permalink)  
Old 06-27-2007, 01:53 PM
qcmadness's Avatar
qcmadness qcmadness is online now
Super Member
 
Join Date: Jun 2007
Location: Hong Kong
Posts: 1,143
Send a message via MSN to qcmadness
I think I need to rewrite that part......

Still reading the information in depth

  #14 (permalink)  
Old 07-03-2007, 10:25 AM
qcmadness's Avatar
qcmadness qcmadness is online now
Super Member
 
Join Date: Jun 2007
Location: Hong Kong
Posts: 1,143
Send a message via MSN to qcmadness
I will rewrite that part in a few days......

  #15 (permalink)  
Old 07-03-2007, 09:33 PM
Epsilon84 Epsilon84 is offline
Member
 
Join Date: Jul 2007
Posts: 122
That's a treasure trove of valuable technical information QC, thanks for taking the time to compile it! I'm still learning about the technical/architectural aspects of CPUs and this quick 'reference card' is a good place to start!

  #16 (permalink)  
Old 07-19-2007, 09:16 AM
qcmadness's Avatar
qcmadness qcmadness is online now
Super Member
 
Join Date: Jun 2007
Location: Hong Kong
Posts: 1,143
Send a message via MSN to qcmadness
finally cut the whole things...

thx Pippero

  #17 (permalink)  
Old 07-19-2007, 09:35 AM
Clue69Less's Avatar
Clue69Less Clue69Less is offline
XCPUs.com Editor
 
Join Date: Jul 2007
Location: Sunny Colorado
Posts: 7,004