64-bit Architectures

Recap: what is a 64-bit CPU?

Recall that when referring to the bit-ness of a CPU, we are actually referring to the size of the general purpose (GP) registers. Thus 64 bit numbers can be stored in these registers, meaning that the CPU can store and operate on much larger chunks of data. The most obvious advantage of 64 bit computing is the ability to rapidly address massive amounts of memory, since 2 raised to the power of 64 gives us nearly 17 million terabytes of addressibility.

Perhaps the most obvious area that would benefit from 64-bit computing is the database arena. It should be noted that 64-bit computing is possible using 32-bit CPUs, but this involves splitting 64 bit data over two registers. Hence 64-bit computing runs much more efficiently on 64-bit hardware. However, the take home message is that 64-bit hardware will yield very little performance benefit unless you are running 64 bit software. Windows XP comes in both 32 bit and 64 bit flavours, but hardware support for the 64 bit OS is still lacking. Similarly, various distributions of Linux come in both 32 bit and 64 bit versions. In fact, under Intel's IA-64 bit architecture, 32 bit code may well run slower...

Intel and AMD Forge their Plans

Both Intel and AMD have 64-bit offerings. Initially, these two offerings couldn't have been more different. AMD's 64-bit CPU builds on the existing x86-architecture, extending it to 64-bit. Intel, on the other hand, categorically stated that they would not extend the x86 architecture in this way and that their 64 bit CPUs would be built from scratch with completely new architecture. This new architecture is known generically as IA-64 (i.e. Intel Architecture), just like the 32-bit generations of Intel CPUs can all be grouped as IA-32. While this architecture would allow the processing of x86 instructions, it would do so through an emulation layer rather than natively, resulting in severe performance penalties.

As you will soon see if you keep reading, Intel couldn't afford to keep their promise!

Intel IA-64 and Itanium

The IA-64 architecture has taken nearly 10 years to develop, and finally has arrived in the form of the Itanium chip.

The main architectural advancement in IA-64 is what is known as EPIC - Explicitly Parallel Instruction Computing. In 32-bit architectures, the objective is to perform as many instructions per clock cycle as possible. While this objective has not changed in IA-64, the way of going about it has. In IA-32 superscalar architectures, much processing 'power' is actually devoted to determining which instructions can be executed in parallel and re-ordering instructions accordingly. Furthermore, prior to the execution step of any instruction in the pipeline, much of the CPU's physical volume is occupied by circuitry designed to handle speculative execution and out-of-order execution.

IA-64 attemps to circumvent these phases by actually allowing the software to directly tell the CPU how it should be executed in parallel. Thus time is saved in the pre-execution steps, and pipeline stalls are reduced. One major drawback of this approach is that much more emphasis on code-optimisation is passed on to the compiler. I.e. the software must be optimised before the CPU gets to it. This means that existing code (all code written for 32-bit platforms) will suffer a performance penalty when ran on IA-64.

Pipeline stalls in the form of missed branch predictions are also reduced. Rather than build on branch-prediction architecture, IA-64 actually performs all possible branches in parallel. This is called predication. The downside is that parallel execution units are being used to execute instructions that will be discarded when the CPU determines which branch was the correct one. If an execution unit is 'speculatively' executing a 'predicated' branch that will eventually be ditched, then it can not be executing parallel instructions that will be used. It's a two edged sword: apparently, it's better to waste execution power carrying out predication than to end up with a pipeline stall when it turns out that the CPU predicted the wrong branch.

The other key features of IA-64 is speculative loading. Once again, this relies heavily on compile-time optimisation. The basic idea of speculative loading is that data is fetched from memory well before it is actually required, thus eliminating bottlenecks that arise from slow fetches. At compile time, the compiler determines instructions that will require data to be fetched. It then inserts speculative load instructions earlier in the code, so that data can be fetched prior to the time it will actually be used. At runtime, it is possible that when a speculative load instruction is reached, the data required does not actually exist in memory, yet. In traditional CPU architectures, a fetch for data that does not exist would end in a runtime exception that, if not handled by the code (i.e. by the clever programmer who has anticipated this memory hole), would cause the program to crash. With IA-64, the runtime exception would not occur at the time of the speculative load, but only later, should the data still not be available when required.

Reflecting the shift from CPU-determined parallelism to compile-time-determined parallelism, the Itanium has 9 execution units (seven integer, two floating point), but only a 10-step pipeline.

Itanium block diagram

As indicated in the diagram, it also has a whopping 128 64-bit general purpose registers, as well as 128 82-bit floating point registers. These figures really are massive, and should, in theory, banish the woes of trying to find a place to store data for a long, long time.

What about x86 instructions? Well, the IA-64 will support x86 instructions, but the new architecture is in no way optimised for it. In fact, switching to x86-compatibility mode by the Itanium is a bit of cludge, and the result is poor x86 performance.

The first generation of IA-64 was the Itanium which uses trilevel cache, runs with a 133MHz FSB and tops out with a core clockspeed 800MHz.

But how did it really perform?

When running true 64 bit software designed for the IA-64 architecture, it actually ran pretty well. However, when running older (well, not that old) 32 bit software, it was performing woefully: it was running at under 15% of the speed of an equivalently clocked 32 bit P4! Oh dear...

Not surprisingly, the Itanium was a complete flop. As such, it earned itself a nickname: The Itanic.

Itanium II

The Itanium II first arrived in 2002, codenamed McKinley. It had a 266MHz FSB which was 128 bits wide and double-pumped, allowing it to achieve a maximum bandwidth of 6.4GB per second. This was matched by the P4 in the desktop arena almost straight after.

The Itanium II still topped out at a disappointing 1GHz. While the performance of processing IA-32 x86 instructions was improved, the Itanium II still couldn't even keep up with a lower-clocked PII. Furthermore, the die area is massive, at 464 square millimeters, partly due to putting the level 3 cache on-die. This results in a high production cost for Intel, since they can cut less than 50 of these chips from a standard wafer. For comparison, you could cut over 3 times as many Pentium 4 chips from the same wafer!

Various subsequent incarnations of the Itanium have come and gone, including the Madison, Hondo and Deerfield. They're not very exciting.

AMD x86-64 (Hammer) - the time of the Opteron and Athlon-64

Originally codenamed Hammer but now branded AMD64, x86-64 extends the x86 32 bit architecture to 64 bits. This is completely different approach to Intel, who claimed they would not be making a 64 bit CPU that would natively execute the x86 instruction set.

At a glance, the figures for CPUs using the Hammer architecture appear much less impressive: there are only 16 general purpose registers. However, this in part reflects the totally different approach taken by AMD. The GP registers have been increased from 32 bits to 64, much in the same way as GP registers were widened in the move from 286 to 386.

Like the Itanium, it has 9 execution units. Of these, three are ALUs, three are address generation units (AGUs) and three are floating point units. Hammer architecture exhibits more RISC traits than IA-64. It uses a slightly longer pipeline than the Itanium, at 12 stages. This is hardly surprising, since the early stages of the pipeline are employed in typical CISC-to-RISC operations, i.e. decode CISC to RISC-like micro-ops. Indeed, the Hammer can dispatch nine RISC operations (ROPs) per clock cycle, against the Itanium's six.

The Opteron

The first incarnation of the Hammer released was the Opteron (codenamed Sledgehammer) which arrived in April 2003. This CPU was aimed at the high-end server market.

It took off straight away and has proven to be a very powerful CPU. But where the Opteron particularly excels is in the multiprocessor environment. Dual Opterons significantly outpace dual Xeons. This is partly due to the fact that each Opteron has its own integrated memory controller, meaning that in a multiprocessor environment, the CPUs do not have to share the same memory pipe.

Athlon 64

Athlon 64 The Athlon 64 (originally codenamed Clawhammer) represented AMD's big push of 64 bit processing in to the consumer desktop market Rather than targetting 64 bit processing at servers only (as Intel had done with the Itanium), AMD released the Athlon 64 as a direct successor to the Athlon.

It is a somewhat slimmed-down version of the Opteron, and this is reflected in the price. While roughly the same size as its predecessor and initially using the same 0.13 micron fabrication process, the Athlon 64 used a Socket 754 interface and had a 200MHz FSB double-pumped to 400MHz. Even more impressive was the cache: 128kB of level 1 cache split equally between instructions and data, and a whopping 1MB of level 2 cache (which is over twice as much as the Athlon XP Barton chip)! This huge amount of cache means that the first Athlon 64 was made up of nearly twice as many transistors as its predecessor, at nearly 110 million!

Another improvement was that the Athlon 64 now fully supports SSE2. Thus this incarnation of the Athlon series is no longer eclipsed by Big Blue in the streaming multimedia arena.

But more significantly, the Athlon 64 implements an on-die memory controller, just like the Opteron. No longer is the memory controller part of the Northbridge, meaning that the CPU can now access memory directly, rather than taking a journey through the relatively slow Northbridge. Not only does this reduce latency on memory access, but it also means that the memory controller can now work at the full CPU clock speed. Of course, the memory itself can't run that fast, which is why the memory speed is still set at a fraction of the core clockspeed. Early Athlon 64 CPUs used a 64 bit on-die controller.

As if this wasn't enough, the new Athlon 64 makes use fo the new HyperTransport bus which links the CPU with subsystem. Traditionally, the FSB connected the CPU to the Northbridge and the Northbridge would coordinate activity with all other subsystems, including the memory subsystem. Furthermore, interfacing with slower hardware subsystems (such as I/O) was carried out through a much slower link between the Northbridge and the Southbridge. This is explained in more detail here. But now we have the HyperTransport bus which links the CPU to all other subsystems (except memory which is connected directly through the on-die controller). When you consider that the bidirectional 16-bit HT bus runs at 800MHz double-pumped to 1600MHz (giving a total bandwidth of 3.2GB/sec each way), you can see that it is a major improvement on previous technology.

AMD also released the FX range, a top-spec version of the Athlon-64 aimed at hardcore gamers and extreme users. The first incarnations was called the FX-51 and plugged into a larger 940 pin slot. This chip doubled the width of the on-die memory controller to 128 bits. Subsequently, this change was also adopted in the 'standard' Athlon 64 range. This tends to be a pattern of the FX range: new advances in this range trickle down into non-FX chips a little while later.

Later Athlon 64 CPUs (and the dual core X2 chips, as well as the newer FX range) used a larger Socket 939 interface and sported the dual channel 128 bit on-die memory controller.

Given AMD's decision to carry on down the x86 route, the Athlon 64 is fully backwards compatible with existing hardware. Hence users could switch from Athlon to Athlon 64 without the need for any new software, and without having to worry about any performance impact while running their legacy software. Furthermore, even the 32 bit platforms run faster on the Athlon 64 than on its predecessor. Given that most desktop Athlon owners are still power-users and gamers, there's little reason not to consider the upgrade.

While the Hammer architecture is still based around the old x86 instruction set (although with the addition of several 64 bit instructions), optimisation is carried out by the CPU independent of compile-time optimisation. The real advantage of the x86-64 architecture is that existing x86 code will run blisteringly fast on the Hammer, while it runs poorly on an Itanium.

Intel Resort to Plan B

Astonishingly, in 2004 Intel announced that they would be releasing a 64-bit CPU that would natively support x86. This is interesting since Intel claimed such a chip would never be on their roadmap. I don't think anyone - especially Intel - realised how strongly the Athlon 64 would take off in the consumer market. Intel's Itanium was a complete flop and they soon realised they would have to do a lot of backpedalling if they were going to get back into the 64 bit game... Even if it did make them look like a bunch of bumbling fools in the process.

And so, in June 2004, Intel released their first 64 bit chips using the new Extended Memory 64-bit Technology (EM64T), rather than IA-64. It's amusing to me that the acronym looks a bit more like 'Emulated 64-bit Technology'. I say this because Intel have basically taken AMD64 and ported it to the P4E. Does anyone else find it ironic that Intel said they would never go down this road, and yet here we are with Intel using AMD's technology? Nobody saw that coming ten years ago.

The Future

64 bit computing is now well established. The next evolutionary step up the ladder is dual core CPUs.