Optimisation Principles

CISC vs RISC Architecture

The performance of a CPU is highly dependent on its internal architecture. The goal is to do as much 'work' in a given amount of time as possible. This can be achieved using several different approaches. One approach is to increase the 'power' of each individual instruction in the instruction set. This results in CISC architecture. Another approach is to attempt to reduce the clock cycle time, thereby increasing the clock speed of the CPU. To do this, the complexity of individual instructions must be reduced, leading to RISC architecture.

The x86 instruction set is a predominantly CISC instruction set. CISC stands for Complex Instruction Set Computing. Traditional theory states that CPUs can be made quicker by adding more and more complexity into the instructions of the instruction set. The aim was to perform as much work in a single instruction as possible. Hence instruction sets grew larger and complicated. Early Intel CPUs such as the 8086 and 286 are considered pure CISC processors.

RISC theory arrived a little later. The Reduced Instruction Set Computing approach states that the best performance can be achieved by reducing the time taken to execute any given instruction. Rather than have complex instructions that require many clock cycles (more on this later) to complete, RISC chips use very simple instructions that could be performed in fewer clock cycles. Performance can then be improved by making the cycles shorter - i.e. by implementing a faster clock.

Thus, with RISC architecture, we increase CPU speed by attempting to keep each instruction simple enough so that it can be performed in a single clock cycle.

How do we make instructions more simple? Well, one approach is to limit the number of memory addressing modes available to instructions. CISC instructions can usually address memory in many different ways. This builds complexity into the instruction and also means that a given instruction op-code can be of variable size. RISC instructions, on the other hand, are usually limited to a single memory addressing mode. In fact, in a full RISC implementation, most instructions can not access memory at all! Instead, a special set of instructions (called load and store instructions) are designed to read and write from memory, transferring data to and from registers as required. The rest of the instruction set can only operate on register data or immediate data (i.e. data specified as part of the instruction itself).

The result of this so called Load-Store RISC architecture is that instructions are less complicated and, as a bonus, tend to be of fixed size. This makes performance-optimising strategies, such as pipelining, easier to implement.

In summary, the result of the RISC approach is that clock speeds are increased. As an added benefit, RISC-based chips require fewer transistors than CISC-based chips of similar performance, and are therefore cheaper to build. Also, since the CPU core die area is smaller for RISC chips, more die area can be devoted to performance enhancing features, such as extra registers and larger caches. (To see comparisons of the number of transistors used in CPUs, check out the CPU History section.)

Where does CISC Stop and RISC Begin?

It should be noted that CISC and RISC are not clearly defined classes, but should rather be regarded as a set of CPU design principles. Thus, modern CPUs often exhibit traits of both architectures and can not be categorised as either purely CISC or purely RISC.

Intel have been increasing their use of RISC technology since the Pentium chip. However, these chips must also be able to perform traditional x86 instructions. In this case, the x86 instructions are handled a little differently to the older CISC-chips which execute x86 instructions natively.

During the decode step, one or more decode units in the RISC chip take the x86 instructions and convert them into simple fixed-length RISC-like micro-instructions, often called micro-ops. These micro-ops can then be handled just like any other RISC instruction; the CPU architecture can therefore utilise standard RISC optimising strategies (these will be described below) to improve performance and increase clockspeed.

Defining Performance

Remember that we can not define the performance of a processor by clock speed alone. CPU manufacturers certainly exploit the ignorance of the consumer by conditioning us to believe that clockspeed is the be all and end all of everything.

Instead, we should think of performance as the time it takes to perform a task. Let's define the time taken to perform some arbitrary task (such as a complex program) as T. Now, if the number of instructions required to perform this task is N, the average number of clock cycles per instruction is C and the clock speed of the CPU is S, then we can arrive at the following (though simplistic) formula:

T = N * C / S

Clearly then, the goal is to make T as small as possible. We can do this by minimising N and C, while maximising S. (For those of you who are not mathematically minded, please do not be put off by this formula. This really is simple stuff!)

RISC based processors tend to have a smaller value of C, since instructions are more simple than in CISC designs and therefore require less cycles to complete. However, at the expense of this, RISC chips tend to have a higher value of N (compared with CISC), since more instructions are required to perform the same task.

However, certain principles can be applied which can effectively reduce C to 1, such as pipelining which will be discussed very soon.

Although this discussion is very simplistic, it does strongly hint at how a CPU can be optimised. It would seem that if we can employ strategies to minisise C (such as pipelining), then CISC architecture will yield the highest performance. However, it turns out that in practice, many performance optimising principles (such as pipelining, once again) are more easily implemented using RISC architecture.

The Instruction:Clock Cycle ratio

In the early days, each step of the instruction process (i.e. fetch, decode, address generate, execute and write back) would need one or more clock cycles. Therefore a single instruction could require over 5 cycles to complete. Hence the instructions:clock cycle ration for such a CPU was only 0.2 or less! It doesn't take a genius to work out that if you can improve this ratio, you will get a faster CPU.

General Approaches

As mentioned earlier on, there are a few general approaches to CPU optimisation. These include: increasing the power of instructions (the CISC approach), decreasing the cycle time (predominantly RISC), and, perhaps most importantly, decreasing the number of cycles per instruction. In practice, a good balance of these approaches proves most cost effective.

To begin with, we must move away from our single internal CPU bus view (see Instruction Process) that I've been referring to so far. Clearly, with a single bus, only one data transfer is possible at any given time. Furthermore, this bus design requires the use of temporary storage registers. By implementing a multi-CPU bus design, we can now transfer more than one piece of data within the CPU at any given time. This design also circumvents the need for temporary registers.

Secondly, as already mentioned, accessing main memory is slow compared to all other phases of the instruction process. Thus we need to minimise the time taken for this step, or at least limit its presence as a bottleneck.

One way to do this is to overlap the fetch phase with instruction execution. The CPU can be made to fetch instructions from memory while it is executing instructions that have already been fetched. This is a primitive form of pipelining, which will be explained in much more detail later.

This approach is called an instruction prefetch and is typically achieved by way of an instruction unit. The job of this unit is to determine (or predict) which instruction will be executed next, fetch that instruction and queue it up (into the instruction queue), ready for when the execution unit (whether integer unit or FPU) is ready for it.

This process can be optimised by storing instructions and data in a cache. This will be explained in detail in the next section. But for now, it is sufficient to think of the cache as a small, local store of instructions and data which can be accessed very rapidly. By storing data here, the CPU can avoid main memory interactions.

Before we go on to specific optimisation strategies, I will finally mention the use of multiple integer units and FPUs. By having more than one integer unit in combination with one or more floating point units, we can effectively run more than one instruction at once. And when I say 'more than one instruction at once', I'm not simply referring to the overlapping of instruction phases that we see in pipelining, but in fact complete separation of execution processes. This approach is called superscalar architecture.

Superscalar Architecture

The following sections look at the various performance enhancing strategies that can be used to improve this ratio and/or increase clock speed. The next page takes a detailed look at Pipelining.