Chapter 3: Pipelining
Department of Computer Science
The University of Chicago
Last modified: Wed Feb 18 09:49:15 CST
Outline
The information in the text is pretty clear. I decided to condense
out a conceptual outline, in a logical order a bit different from the
one in the text. I recommend that you read the text in order, but
correlate each section to my outline. You may skim the material in
3.7. Make sure that you know what additional complications arise with
floating point operations, and other relatively slow instructions, but
you don't need to assimilate the material in detail. I will grill you
(in assignment and exams) on 3.1-3.6. I'll expect you to be able to do
the quantitative reasoning in 3.1-3.5, to explain the issues in 3.6
thoroughly, and to show a general appreciation of the added
complications in 3.7. Read 3.8-3.11 to help assimilate the earlier
material, and 3.12 for fun.
- I. Pipelining is a technique for speeding up CPU performance with a
given instruction set. Pipelining is normally transparent to the
functionality of programs, and affects only performance.
- II. How to design a pipeline.
- Start with an unpipelined CPU.
- Divide each instruction execution into segments (``stages'')
taking equal time, using a small number (usually 1) of clock
cycles per segment. The number of segments/stages is called the
``depth'' of the pipeline.
- Divide the circuitry in the CPU datapath into sections
corresponding to the execution stages. Keep the incrementing
of the CPU entirely in the first section.
- Add memory (registers or latches) for every wire that goes
from one section to the next.
- Find and correct bugs (deal with ``hazards'') with
additional circuitry.
- III. How a pipeline executes.
- A. Under ideal conditions, a pipeline of depth d
executes d instructions simultaneously: the dth
stage of instruction 1 is simultaneous with the d-1th
stage of instruction 2, the d-2th stage of instruction 3,
... , the 1st stage of instruction d.
- B. When full simultaneous execution of d stages doesn't
work (for reasons given below), ``stall'' one or more
instructions in the pipeline, inserting ``bubbles'' amounting
essentially to null operations.
- C. Some pipeline designs execute instructions that may need to
be skipped (e.g., after conditional branches). Such pipelines
occasionally have to cancel a partially executed instruction in
the pipeline. If the cancelled instruction has modified any sort
of memory or machine state, the modification must be
undone.
- IV. Performance of a pipeline.
- A. Latency, throughput, overhead.
- Pipelining keeps the latency of an individual
instruction execution the same, or makes it worse.
- Pipelining is intended to improve CPU
throughput, which in turn improves the latency
(total execution time) of programs that execute many
instructions.
- ``Overhead'' in pipelining usually refers to the extra
circuitry and delay involved in a pipelined CPU (including
extra clock skew due to the extra circuitry). If CPU work is
conceived as the execution of separate instructions, one
after another, delays from stalling and cancelling
instructions are another sort of overhead.
- B. Calculating pipeline throughput improvement.
The speedup attributed to pipelining depends, of course, on
the unpipelined standard of comparison. We assume that the
unpipelined and pipelined architectures have the same
instruction set and the same number of clock cycles in the
latency of each instruction, and that there is no
cancellation. SCPI stands for the average number of
Stall Cycles Per Instruction in the pipelined architecture.
Throughput pipelined Clock cycle unpipelined depth
---------------------- = ----------------------- * --------
Throughput unpipelined Clock cycle pipelined 1 + SCPI
If there are cancellations, the average time per uncancelled
instruction wasted in cancellation must be added in the
denominator along with SCPI.
- V. Hazards.
- A. Types of hazards.
- 1. ``Structural hazards'' occur when more than one
instruction needs the same circuitry at a given time.
- a. Within the CPU.
The obvious place for structural hazards is in the
CPU, e.g. if the same ALU is used for address calculation
as well as data arithmetic. Conflicts within the CPU are
the result of failure to complete the design step of
dividing the circuitry into sections.
- b. Outside the CPU.
There may also be conflicts outside of the CPU,
e.g. if more than one instruction needs to access memory
at the same time. Such a memory conflict is easy to
create, since the first stage of each instruction fetches
the instruction from memory, and a later stage may do a
load or a store.
- c. Worth removing?
In principle, all structural hazards can be removed
by additional circuitry, but the cost of the extra
circuitry (which may require a longer clock cycle as
well as the space and design costs) is not always
justified.
- 2. ``Data hazards'' occur when the interleaving of
instructions causes a write to be executed out of order. It may
be a write to register or to memory, but register conflicts are
the most common because of register allocation
patterns. Whenever two references to the same storage are
reversed, and at least one of them is a write, we get the wrong
result.
- 3. ``Control hazards'' occur when jumps or branches take
control away from the instructions in the pipeline.
- B. Detecting hazards.
Hazard detection requires circuitry, and constitutes part
of the overhead of pipelining. I was surprised that all the
equality tests required for hazard detection are considered a
reasonable amount of overhead circuitry for pipelining, but I
guess that's just a failure of my intuitions about circuit
size.
- C. Surviving hazards.
- 1. Stall. A stall is the only solution to a structural
hazard, and a long enough stall works for all hazards.
- a. When a hazard is detected in the Instruction Fetch
stage, the usual way to stall is to insert a null
operation as the next instruction in the pipeline, and
suppress incrementing the PC. The null operation
``bubbles'' through the pipeline. This bubble adds
directly to that unpleasant SPCI term in the
denominator of the speedup formula.
- b. If a hazard is detected after the first stage, then
there are instructions already in the pipeline that must
be delayed. I'm not sure whether this is done in real
machines or not: I haven't found guidance in the text
yet. Besides some sort of disabling/enabling lines
running backwards from the circuitry for each stage to
the previous, this might require more stable interstage
storage. Does anyone know more about late initiation of
stalls?
- 2. Cancel. The impact of a branch control hazard is only
known after the branch condition is tested. Unless we stall
during every branch instruction, which is costly because it
usually requires more than one stall cycle, then there are
invalid instructions in the pipeline, and they must be
cancelled. This may be more costly than stalling,
particularly if a cancelled instruction can modify a
register before the cancellation is known.
- 3. Forward. Data hazards may sometimes be resolved by
forwarding data from one stage of the pipeline to another,
bypassing the usual register write/read. This requires extra
circuitry, and it only works when a read and a write occur
in the same machine cycle, but it might be accomplished
without a speed penalty.
D. Cost of hazards.
- 1. Structural and data hazards are often resolved by
stalling for a single machine cycle.
- 2. Control hazards may easily lead to 3 wasted machine
cycles. Control hazards are usually the biggest performance
worry for pipeline designers and compiler writers.
VI. Impact of pipelining on compilers.
Compilers producing code for pipelined machines typically
re-order instructions to fill delay slots. This is usually an
optimization step after initial code generation. By interleaving
independent arithmetic operations, compilers can avoid most delays
from data hazards and structural hazards. With the delayed branch
in VII E below, compilers can avoid most delays from control
hazards as well. The usual approach is to first compile naive
code, then find the delay cycles, then look for instructions that
can move to cover the delays.
VII. Impact of pipelining on instruction set design.
If we expect pipelined circuitry when we design the
instruction set, we will choose instructions that favor
pipelining.
- A. Simple and uniform instructions.
- B. 2 operands rather than 3.
- C. Load/store.
- D. Consistent location of operands independent of op code.
- E. Delayed branch instructions, or condition codes (a
delayed branch is a lot like an implicit condition code).
A simple pipeline for DLX
Here is a table showing the
5 stages of execution for pipelining DLX instructions. The arrays Regs
and Mem and the single register PC are global, and all other values
are local to each instruction. Wherever a global variable is written
in stage i, and also read or written in a stage <i, hazards will
occur. They are data hazards if the global variable in question is
Regs or Mem, control hazards if it is PC. There are four different
pairs of entries in the table above that display hazards. Make sure
that you find all four, and understand precisely what the danger
is. One of them is a bit subtle, and I haven't seen it mentioned in
the text.
Think of two different reasons why the idle stage for ALU
instructions is stage 4 instead of stage 5.
Miscellaneous variant views of pipelining issues
- Pipelining is a sort of parallel computation. The interleaving
of operations in parallel computation is notorious for causing
surprises. Notice that pipelining interleaves parts of
instructions, so it is not safe to think of individual instructions
as atomic events. Most machines have other sources of timing
complications, such as interrupts, so the precises behavior of the
pipeline is not determined by the instructions in it. Data hazards
are even worse than you think at first. If an unresolved hazard was
guaranteed to do operations in the same wrong order each
time, at least we could discover that order and debug. But, with
interrupts or other timing complications, a data hazard may lead to
different results on different runs, depending on the timing of disk
and network accesses.
- The performance impact of pipelining is more subtle than you
think at first, since compilers will naturally try to take advantage
of the known behavior of the pipeline. Fortunately, this effect
typically acts in the designer's favor.
- The resolution of hazards with stalls and cancellations in
effect reintroduces some of the latency in individual instructions
as a degradation of throughput.
- Recall the four principles for improving performance from the
discussion of Chapter 1:
- Identify and improve the bottleneck.
- Cater to common patterns of behavior.
- Overlap latency of different steps.
- Keep the bottleneck busy.
How do pipelining, forwarding, delayed branches, etc. relate to
these principles?
Maintained by Michael J. O'Donnell, email:
odonnell@cs.uchicago.edu