CS 222/322 - Machine Organization
Take-Home Midterm Exam, Winter 1998
Due Wednesday, 18 February, 1:30 PM
7 problems on pages 2-6

This is a self-timed exam. Please limit yourself to 2 working hours from the time that you open the paper, until you complete the solutions that you hand in. You may take a break up to 15 minutes during the exam.
You may use the textbook, my class materials, any other published material, and notes of your own. You definitely need the textbook, and the chart that I created on the Web showing the DLX pipeline. You don't need a calculator, but you may use one if you like. Do not discuss the exam with anyone except me.
There are 6 problems for Com Sci 222 and 322. They cover preliminary material, and the three chapters of the text, with one or two problems for each of those sections. There is one additional problem for Com Sci 322 only. Each numbered problem is weighted equally for grading purposes.
The problem statements are a bit wordy, since I will not be nearby to answer questions. The wordiness does not not indicate subtlety.
Please deliver your finished exam to me at the beginning of class.

Com Sci 222 and 322

Some computers have hardware for arithmetic in base 10 to support business software. Base 10 is too bulky for an exam, so we'll look at base 3. Each base 3 (trinary) digit requires two binary bits. A 1-digit base 3 adder has five bits of input: , , , , and , where

The base 3 digits are represented as follows

So, if , , , , and , then we are adding the trinary digit 2 to the trinary digit 1, with no carry coming in. We never have , nor , since those values don't correspond to a trinary digit.
Write a boolean formula in sum of products form, expressing the carry out ( ) in terms of the five input bits. You should get a sum of six products, and each product involves two or three input bits.
The left column below lists some choices that you might make in the design of a computer. The right column lists some types of technology whose qualities might lead to those choices. Make a one-to-one matching between the columns, showing the most likely correspondence of design choices to the types of technology affecting them most strongly.
In problem 1.4 in the text, we found that the proposed change from a pure load/store instruction set to a machine with some register-memory instructions eliminated 10.75 load instructions per 100 load/store instructions, converting 10.75 old register-register ALU instructions to new register-memory ALU instructions, thereby saving 10.75 clock cycles. But, it paid 24 extra clock cycles per 100 load/store instructions because branches take longer, for a net increase of 13.25 cycles. Suppose that we find a way to implement the new register-memory instructions without adding a clock cycle to the execution of branch instructions, but instead by slowing down all clock cycles from 2 nanoseconds to nanoseconds, for some s>1.
1. For what values of s do the new register-memory instructions improve performance?
2. Suppose that we can also add ALU operations that combine two values from registers, but store the result in memory (e.g., add R1 and R2, storing at address 1000 in memory), at the cost of slowing down clock cycles by the same factor of s. The new save-result-to-memory ALU operations allow many store instructions to be eliminated, saving one clock cycle for each. Which kinds of stores can be eliminated, and which cannot? Let n be the fraction of store instructions that can be eliminated ( ). What relationship between s and n causes the new design to beat the original load/store design? If n=1, for what values of s does the new design win? In a program that does mainly number crunching, do you think n is likely to be nearer to 0 or to 1?
Explain very briefly why a machine with a better MIPS rating may be slower than a machine with a worse MIPS rating. 2 or 3 sentences should be enough.
This is a variation on problem 2.3 in the text, comparing a memory-memory instruction set to a load/store instruction set for total bytes of memory traffic.
1. Write a program fragment consisting of a sequence of assignments in a high-level language that performs 5 integer operations. (That is, give a slightly larger example similar to the one in the statement of problem 2.3). You may use any number of variables, and you may write 5 assignments with 1 operation each, 1 assignment with 5 operations, or anything in between. Make the memory-memory instruction set beat the load/store instruction set by the largest factor that you can manage, counting total code and data. Report the total number of code bytes and date bytes transmitted for each instruction set. You do not have to write out the assembly code.
2. Write another program fragment with the same restrictions as in part a, but this time make the load/store instruction set win by the largest factor that you can manage. Report the total number of code bytes and data bytes transmitted for each instruction set.
3. Clearly, the relative efficiency of memory-memory vs.\ load/store depends on the program. Do loops in programs tend to tip the efficiency balance toward memory-memory, or toward load/store? Explain briefly (3 sentences can be enough).
Look at the table describing the DLX pipeline that I provided on the Web. Compare a nonpipelined implementation of DLX with a pipelined version based on this table. Assume that each stage takes one clock cycle in the pipelined version, and that each nonpipelined load, store, ALU, and branch instruction takes a number of clock cycles equal to the nonidle stages in the pipelined implementation. Assume that the clock cycle is 10 nanoseconds unpipelined and 11 nanoseconds pipelined. For hazards:
- All data hazards stall until the write is finished (no forwarding of any sort).
- For branches and jumps, predict they are not taken. Cancel the 3 instructions in the pipeline behind a branch that is taken.
I tried to produce bad pipeline behavior. The following loop decrements R1 from 1000 to 0.

There are plenty of instructions after the loop, but we don't care what they are.
1. Compute and compare the running time for this loop in the pipelined and unpipelined machines. The pipeline loses the race by a very small margin.
2. Point out where data and control values already computed in the simple DLX pipeline could be forwarded to speed up the pipeline on this program, and compute the improved running time.

Com Sci 322 only

Figures 3.22 and 3.23 on pages 163-164 show how to use an adder in the ID stage to reduce stalls due to branches.
1. How does the resolution of branches in the ID stage change the pipeline performance in Problem 6?
2. For a very short loop, every instruction may be in the pipeline simultaneously. Propose a forwarding of information that takes advantage of the presence of both loop instructions in the pipeline to save one more machine cycle in the loop above. Be precise about which stage provides the forwarded information and which stage uses it. How long could a loop be and still have a reasonable chance to benefit from this trick?

About this document ...

The command line arguments were:
latex2html -split 0 midterm.tex.

The translation was initiated by Mike O'Donnell on Thu Feb 18 21:59:16 CST 1999

Mike O'Donnell
Thu Feb 18 21:59:16 CST 1999