Com Sci 222/322
Machine Organization

Lectures

Chapter 8: Multiprocessors

Department of Computer Science

The University of Chicago

Last modified: Wed Mar 10 16:42:06 CST 1999

Reading Guide

Study 8.1 closely. Look over 8.2 briefly to get an idea of the sorts of applications that exploit multiprocessing. Read 8.3 closely, and pay particular attention to the snoopy cache idea. Read only the first subsection of 8.4, pages 677-679, to understand the problems with ``distributed shared memory.'' The key element in DSM as presented here is not the fact that memory is distributed, but rather that addressing is logically centralized. 8.5 is important, but synchronization is also covered in the OS course, so we'll just skim it. 8.6 is extremely important for the future, since the best multiprocessor designs will probably move away from essentially sequential models of memory behavior, but we'll just define the key issues, and skip the details. Read through 8.8-8.10 to help cement the ideas, and 8.11 for fun.

General Comments

The basic ideas of pipelining and memory hierarchy appear to be key components of good computer design for some time, and the details appear to be converging, driven by a few principles of design. By contrast, multiprocessing is in a more confusing and unstable state. The value of multiprocessing is tied up very intricately with the structure of application programs (Section 8.2). There are several substantially different approaches to connecting many processors, and it's very hard to choose between them. There will probably be radical changes as we move from handfuls of processors toward millions.

The relationship between multiprocessor architecture and parallel programming is also confusing. Uniprocessor operating systems are often multithreaded, which means that they simulate parallel computation on a single processor by interleaving work on different processes/threads. This paradoxical situation results from the fact that parallel programs have a better structure than purely sequential ones for representing the computations in operating systems. So, we have parallelism as well as sequential computation in hardware and in software, but the software and hardware structures often don't match well. I expect that there will be several more rounds of adjustment on both sides---hardware design adjusting to fit the latest ideas in operating systems and vice versa---before we converge on a stable structure.

We have already noticed interactions between computer architecture and both compilers and operating systems. Database design has already dealt with issues of data consistency, and there is likely to be some interesting importation of ideas about transactions from database systems into multiprocessor design.

Coherence and Consistency

I find the discussion of coherence and consistency in 8.3 and 8.6 quite confusing and somewhat unsatisfying, so I'll try to reorganize the essential ideas here. It may be that the text assumes you have encountered parallel programming elsewhere, but I expect that many students have not seen parallel programming before.

Why is it so tricky?

In a purely sequential computation, we rely on a simple sounding and intuitively appealing definition of memory behavior:

The value returned by READ M[a] is the value v from the most recent previous execution of WRITE v M[a].

Unfortunately, such a definition has no clear meaning when the READs and WRITEs may be in separate processes executing in parallel. It turns out that the simple sounding notion of ``most recent previous'' depends on purely sequential processing for its meaning.

In order to apply the definition above to parallel computation, we need to assign a precise time to every READ and WRITE operation.

When does a given instruction execution occur? When the PC first receives its address? When the PC is set to the next address? When the instruction enters the pipeline? When the following instruction enters the pipeline? When the instruction in question leaves the pipeline? When it enters or leaves the last non-idle stage? What about cancelled instructions?
Suppose we have answered the previous question. Precisely when does the chosen event occur? We probably want to associate it with a particular clock pulse. But, remember clock skew: what particular point in the circuitry does the pulse have to reach?

I'm confident that the questions above have no satisfactory answers, so our notion of correct memory behavior in a parallel computation must not rely on a perfect time sequence of operations. There is an amusing qualitative analogy to the behavior of time and space in the physical theory of relativity.

Sequentializability instead of sequence

The simplest idea for avoiding the lack of sequence is to require sequentializability. When several processes compute in parallel, we require that the final results are the same as some sequential interleaving of the sequences of operations in the individual processes. So, if we have processes P, Q, R executing operations P1, P2, Q1, Q2, etc., then we demand a result equivalent to R1 R2 P1 Q1 Q2 Q3 R3 Q4 R4 P2 ... or to Q1 R1 Q2 Q3 P1 P2 P3 P4 R2 ..., or any other interleaving. We do not require that the operations actually interleave---e.g., Q1 and R1 might go on simultaneously---as long as the visible results (output) are equivalent to some interleaving.

The only subtlety in the idea of sequentializability is the definition of ``operation.'' The particular operations that interleave are called ``atomic'' operations. What operations should be atomic? For computer architecture, it is most natural to view each individual read or write to memory as atomic. An application might like to think that every machine instruction (which might have more than one memory access) is atomic, or even every high-level instruction.

The requirement of ``coherence'' in the text is essentially the requirement that all accesses to a single memory address are sequentializable. Consider the two programs

x = 1;    x = 0;
v = x;    w = x;

A coherent computation can produce v=1 and w=0, by the interleaving

x = 1;
v = x;
          x = 0;
          w = x;

It can also produce v=w=1, by the interleaving

          x = 0;
x = 1;
v = x;
          w = x;

It can produce v=w=0, by the interleaving

x = 1;
          x = 0;
v = x;
          w = x;

But, no coherent computation can produce v=0 and w=1, because the sequence x = 1; w = x; x = 0; v = x; is not an interleaving of the two programs. Notice that v and w are used just to fix the values of x for clarity in describing the results. It is only the interleaving of READs and WRITEs on x that is restricted by coherence.

Sequentializable consistency applies to the sequencing of operations on different addresses as well. Consider the two programs

x = 1;     y = 2;
y = 1;     x = 2;
v = x;
w = y;

Sequentializability allows v=2 and w=2, by the interleaving

x = 1;
y = 1;
           y = 2;
           x = 2;
v = x;
w = y;

We can also get v=2 and w=1, by the interleaving

x = 1;
           y = 2;
y = 1;
           x = 2;
v = x;
w = y;

But, we cannot get v=1 and wy=2, since there is no sequential interleaving of the programs that executes y = 2 after y = 1, and also x = 1 after y = 2.

Strengthening and weakening consistency

Consistency, in the sense of sequentializability of memory accesses is both too weak and too strong. It is too weak for a lot of programming needs, and too strong for the best possible parallel performance.

Section 8.5 explains how locks and barriers are used by software to enforce stronger restrictions than sequentializability. The implementation of locks etc. depends on some sort of atomic hardware operations beyond READ and WRITE, such as the atomic exchange of two values. When you study operating systems and/or parallel programming techniques, you will encounter other software synchronization primitives, including semaphores, event queues, critical regions, and monitors. All of these essentially expand atomicity to programmer-specified groups of operations. They also weaken the notion of atomicity somewhat so that an atomic operation on resource A does not have to interact atomically with operations on resource B.

Sequentializability is a crude, and overly strong, requirement in many cases. This should not be surprising, since sequentializability enforces equivalence with some sequential computation, while the whole point of parallelism is to avoid sequential bottlenecks. Perhaps consistency criteria based directly on parallel concepts can support useful software while allowing better parallel performance than we get with sequentializability. Section 8.6 discusses some variant notions of consistency. The database community has long recognized the need for flexible notions of consistency. I predict that general-purpose parallel and distributed computing will eventually need to borrow concepts of transaction processing and very flexible consitency constraints from databases.

Optimistic processing with rollback

The natural reaction to a need for consistency is to prevent inconsistent states from ever arising. But, in database systems and networking, we have discovered that it is often better to go ahead with a computation that is highly likely to be consistent, check consistency later, and repair any problems that arise. This strategy wins when the probability of inconsistency times the cost of repair is sufficiently low. This basic idea has already slipped into computer architecture in branch prediction for pipelines. I predict that it will infiltrate far more in the future.

Snoopy cache protocols

Figures 8.11 and 8.12 on pages 664 and 665 of the text show a simplified snoopy cache-coherence protocol. The ``states'' are states of a cache line, not states of a memory block. An invalid line has no memory block associated with it. The shared and exclusive states can be thought of as properties of the memory block in the cache line, as well as properties of the line.

The figures sketch the operations performed by each cache. Since a single read or write instruction execution may require participation of several caches, I've worked out an alternate presentation that traces the actions involved in each read/write. The emphasized lines indicate operations that go on in caches other than the one where the Read/Write is being executed. These informal programs cannot be implemented directly in circuitry, because they combine operations that must be performed in several different CPUs and caches.

Read address a:

if a's block is not in the local cache then

invalidate the cache line

read a's block from memory (bus operation)

kill exclusivity in other cache if any

write back exclusive version if any

place block in cache and mark line shared

load value at a to register

Write address a:

if a's block is not in the local cache then

invalidate the cache line

read a's block from memory and cancel it in other caches (bus operation)

invalidate in other caches

write back exclusive version if any

place block in cache and mark line exclusive

else if a's block is not marked exclusive then

cancel it in other caches (bus operation)

invalidate in other caches

mark line exclusive

write to address a in local cache

The operations outlined above are consistent with the state changes in Figures 8.11 and 8.12, but the figures suggest that a write miss reads in the corresponding block, even if it's already in the local cache. I've optimized away that unneccesary memory operation.

Many snoopy cache designs have one more state, called ``clean and private'' in the text. That is, they keep track separately of whether a block is exclusively held (and thus safely writable), and whether it has actually been written (and thus invalidates reads on the same block in other CPUs). A write to a clean and private line upgrades it to exclusive, without any bus traffic. An operation to the same block from another CPU downgrades a clean and private block to shared.

Snoopy cache performance

Snoopy caches suffer an additional sort of miss: a coherence miss. They also suffer an increased miss penalty because of the extra complexity of misses. And, they increase traffic on the memory bus, which may lead to bus-conflict stalls, which we haven't studied. But, there is no increase in the hit cost. Other approaches to cache coherence tend to increase the hit cost, or to make even greater demands on the bus. Snoopy cache appears to be just the right idea for several processors on a shared memory bus, because:

The whole concept of multiprocessing only works if memory conflicts are kept low in the programs. A program that makes a snoopy cache inefficient due to large numbers of coherence misses will probably be inefficient on other multiprocessor architectures as well.
As long as misses are kept infrequent, a greater miss penalty is less harmful than a greater hit cost.

Maintained by Michael J. O'Donnell, email:

odonnell@cs.uchicago.edu

Com Sci 222/322 Machine Organization