THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY1 DEVELOPMENT AND ANALYSIS OF A GLOBAL COMPACTION TECHNIQUE FOR MICROPROGRAMS Jehkwan Lah CRL-TR-40-84 September 1984 Room 1079, East Engineering Building Ann Arbor, Michigan 48109 USA Tel: (313) 763-8000 1Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the funding agency.

DEVELOPMENT AND ANALYSIS OF A GLOBAL COMPACTION TECHNIQUE FOR MICROPROGRAMS by Jehkwan Lah A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy ( Electrical Engineering ) in The University of Michigan 1984 Doctoral Committee: Professor Daniel E. Atkins, Chairman Professor Gideon Frieder Associate Professor Trevor Mudge Professor Norman Scott Associate Professor Toby Teorey

ABSTRACT DEVELOPMENT AND ANALYSIS OF A GLOBAL COMPACTION TECHNIQUE FOR MICROPROGRAMS by Jehkwan Lah Chairman: Daniel E. Atkins The need for a better microprogramming tool has increased considerably as increased demand and support of computer technology has brought about wide use of microprograms. The eventual goal of microprogramsming tool development Would be to make a high level microprogram language and a compiler to generate minimal-execution-time microcode for a variety of machines. In generating minimal-execution-time microcode, one aspect that differentiates microprogramming languages from macroprogramming languages is the need for compaction in highly horizontal microarchitecture. Among the proposed microprogram compaction methods, the trace scheduling is the most general and appears to give the fastest execution of compacted microcode. However, the growth of memory size by extensive copying of blocks can be enormous, exponential in the worst case, and the complicated bookkeeping se age of the trace scheduling has been an obstacle to implementation. A technique called beta compaction, based on trace scheduling, is proposed to mitigate the drawbacks of trace scheduling. Basically, it identifies the junction blocks ( the blocks beginning with a join and ending with a conditional branch ) as the major source of complication, and cut traces at those junction blocks. It achieves almost all the compaction of the trace scheduling except that which causes copying of blocks. Memory size after the beta compaction is usually smaller than the original. Even when the memory size grows in rare instances, it

is bounded by O(n2) in the worst case. And the bookkeeping stage is very much simplified. The compacted microcode size variation as the source microcode changes is also very small. A loop-free version of both beta compaction and trace scheduling has been implemented. Comparison between the two was done using artificially synthesized microcodes and the above properties of the beta compaction was confirmed. A simple microprogrammable machine based on ANM2900 components was designed and simulated with an interactive user-friendly interface. A realistic application program was written and hand-compiled into microcode. The microcode was executed on the simulator both before and after compaction, which demonstrated the applicability of the compaction technique and the correctness of the implementation.

TABLE OF CONTENTS DEDICATION...................................................................................... ii ACKNOW LEDGMENT....................................... iii LIST OF FIG TURES........................................ vi LIST OF APPENDICES........................................ ix LIST OF ABBREVIATIONS................................................................ x CHAPTER I. INTRODUCTION 1 I. A MIODEL OF A MICROPROGRAM.................................. 7 2.1 Data Dependency Analysis........................................ 9 2.2 Host Machine Description........................................ 10 2.3 Microprogram Description.........................................11 iII. PRENIOUS RESEARCH............................................. 13 3.1 Local Compaction........................................ 14 3.1.1 First-Come First-Serve..................................... 15 3.1.2 Critical Path........................................ 17 3.1.3 Branch and Bound........................................ 20 3.1.4 List Scheduling........................................ 23 3.1.5 Vegdahl's Thesis........................................ 26 3.2 Global Compaction........................................ 28 3.2.1 Wood................................................................ 29 3.2.2 Tokoro.............................................................. 31 3.2.3 Fisher........................................ 33 IV. BETA COMPACTION ALGORITHM 38 Iv

4.1 Algorithm Description................................................. 39 4.1.1 Live register analysis......................................... 42 4.1.2 Pick Trace........................................................ 43 4.1.3 List Scheduling................................................. 47 4.1.4 Bookkeeping...................................................... 54 4.2 Simple Illustrative Example......................................... 58 V. EXPERIMENTATION AND ANALYSIS....................... 68 5.1 Purpose........................................................................ 68 5.2 Hypothesis................................................................... 69 5.3 Implementation............................................................ 69 5.4 Comparison against trace scheduling........................... 70 5.4.1 Artificial M icrocode Model................................ 71 5.4.2 Program Structure M odel................................. 72 5.4.3 Procedure.......................................................... 76 5.4.4 Result and Discussion....................................... 78 5.5 Simulator of an AMD2900-based system............... 81 5.5.1 System description............................................ 82 5.5.2 Functions and human interface........................ 85 5.5.3 An Application Program................................... 85 5.6 Space Complexity Analysis.......................................... 87 5.6.1 Trace Scheduling.............................................. 88 5.6.2 Beta Compaction.............................................. 90 V1. CONCLUSION.................................................................. 95 6.1 Summary..................................................................... 95 6.2 Future Research........................................................... 96 APPENDICES...................................................................................... 97 BIBLIOGRAPHY.................................................................................. 126 or

LIST OF FIGURES Figlriire 3.1 Data dependency graph of example microcode................................... 14 3.2 The working of first-come first serve................................................. 16 3.3 The working of critical path.............................................................. 18 3.4 Itesult of critical path....................................................... 19 3.5 W\ orking of branch and bound exhaustive.......................................... 22 3.6 W eight of MO's for list scheduling..................................................... 25 3.7 MIO's with different data antidependencies........................................ 27 3.8 W ood's compaction heuristic.............................................................. 30 3.9 Tokoro's NIO movement rules............................................................ 32 3.10 Fisher's trace scheduling.................................................................... 34 3.11 Copied blocks in the trace scheduling................................................ 36 4.1 Junction block.................................................................................... 39 4.2 Insertion of dummy blocks................................................................. 41 4.3 Selecting traces................................................................................... 46 4.4 Sample data dependency graph.......................................................... 48 4.5 Example of building DDG.................................................................. 50 4.6 Example of assign priority.................................................................. 52 4.7 Components of trace.......................................................................... 55 4.8 Cop ing O's.................................................................................... 57 4.9 Sample microprogram........................................................................ 59 4.10 List of MO's....................................................... 60 4.11 Live registers....................................................... 61 vl

-1.12 I)ata dependency graphs.................................................................... 62 4.13 Intermediate results............................................................................ 63 4.14 Final result......................................................................................... 64 4.15 Trace scheduling on the example....................................................... 65 4.15 Trace scheduling on the example ( cont'd )....................................... 66 4.16 Comparison summary......................................................................... 67 5.1 Program structure model................................................................... 7 3 5.2 Program structure model for 3 conditional branches......................... 74 5.2 Program structure model for 3 conditional branches ( cont'd )......... 5 5.3 Example of an artificially synthesized microcode............................... 77 5.4 Summary of experimental result........................................................ 79 5.i Distribution of the two junction block case....................................... 80 5.6 Block diagram of the simulated system.............................................. 83:,.7 Microinstruction format..................................................................... 84 5.8 Summary of 2-3 tree insertion program execution............................. 87 5.9 Exponential memory growth in trace scheduling............................... 88 5.10 Worst case in beta compaction.......................................................... 91 5.11 Calculation of memory size increase................................................... 92 A.1 Processor initialization frame............................................................. 100 A.2 Trap/trace frame............................................................................... 101 A.3 Alter/display frame ( datapath subframe )........................................ 102 A.4 Alter/display frame ( pipeline register subframe )............................. 103 A.5 Alter/display frame ( control store subframe )................................... 104 A.6 Alter/display frame ( main store subframe )...................................... 105 VII

A.7 Alter/display frame ( trace table subframe )...................................... 106 A.8 Recording frame......................................................... 107 A.9 Performance data frame......................................................... 108

LIST OF APPENDICES Appendix A. The simulator functions and human interface................................... 98 B. Pascal listing of 2-3 tree insertion algorithm...................................... 109 C. AM\IDASMI listing of 2-3 tree insertion algorithm............................... 116

LIST OF ABBREVIATIONS.IO 0 Micro operation lIl Micro instruction DDG Data dependency graph SLAN Straight line microcode CIl Complete instruction DRS Date ready set EP Early partition LP) Late partition CP Critical partition

CHAPTER I INTRODUCTION Since ilairice WN'ilkes first introduced the concept of microprogramming in 1951 [W\NIL51], the use of the microprogram has increased continuously as the demand for and support of computer technology have increased. Increasing complexity of digital systems requires more complex microprograms. Decreasing high speed memory cost and emerging development tools help meet the requirement. Ilowever, until recently microcode was produced with little help of developmellnt tools. The only practical tool available to many microprogrammers was an assmcl)ly-like bit stuffer using mnemonics. The microcode was written by someon(,e who had a thorough knowledge of the machine to be programmed. The microprogramming was considered a part of the hardware design and so it was often done by the hardware designer. And once developed, the microcode was not touched except to fix bugs. Those microcode developments were manageable by a few specialists with little help from development tools, mainly because the microcode was relatively small in size and less complex. However, things have changed. With the introduction of writable control store and user microprogramming, the need for better microprogramming tools is higher. And increasingly complex digital systems and the advent of VLSI are increasing the use of microprograms. Design time for VLSI systems depends on sophisticated design aids for hardware and microprogramming. VLSI hardware and microcode design problems are large and technology-dependent; humans alone cannot handle the detail and explore all the alternatives involved in correct 1

2 optimal design [IAR.81]. VLSI technology has also reduced the cost of high-speed memnor! available, so the size of micromemory has become less of a constraint and the microprogramming of some special purpose processors involves larger microcodes than conventional macro instruction emulation. All these changes are making it, more critical to have better tools in microprogramming. The eventual goal of microprogramming tool development would be to make a high level language and a compiler to generate minimal-execution-time microcode for a wide variety of machines. There have been various attempts to design and implement higher level languages for microprogramming but none of these has resulted in the production of a generally available compiler. Several proposed higl level microprogramming languages are reviewed in [SIN80]. Even though the same pressure that has led to the widespread acceptance of conventional high level languages now applies to microprogramming [DAV78], the development of microprogramming languages lags far behind that of macroprogramming languagers. There are a couple of factors which complicate compilation to microC(( e. First, the structure of horizontall microcode is much more complicated than that of conventional machine code. In horizontal microarchitecture, a microinstruction contains several microoperations which directly control parallel hard\ware resources. The microoperations contained in one microinstruction are executed at the same time. Accordingly, the detection of available parallelism and scheduling microoperations are very difficult. Timing can be very complicated. Certain microoperations require different clock phases than others or might take more than one clock cycle. If some degree of ( if not total ) machine independence is sought, generating equally fast-executing microcode for different, machines is an extremely different task. Second, in most microprogramming environments, minimal-execution-time microcode is very important. In particular, when the microprogram is used to See the definition in chapter II.

3 emtll3t e acroarchit ect ure, the efficiency of the whole syst-ems ult inat ely depends on the efficiency of the microprograms implementing the instruction set of the macromachine. The user microprogram ( of non-emulation application ) may allow less strict efficiency requirements. But the only reason that, a user would want to write microcode is to gain speed. If the compiler cannot generate sufficiently fast-executing microcode, it is of no use. One of the critical issues in developing a high level microprogram language is 1how, to genera t e mi n imal-execution-time microcode. As machines have more coIIncurrency available in datapath at the register transfer level, it is particularly iml)ortant to efficiently control all the available parallel resources. It is a comnion belief that microcode generated by a machine ( or a compiler ) cannot be executed faster than one written by a highly skilled microprogrammer, but as the size of nmicroprograms grows, it becomes beyond human intelligence to handle all opportunities to optimize a microprogram [PAT76]. In generating minimal-execution-time microcode, one aspect that different itaesn microprogramming languages from macroprogramming languages is the need for compaction in highly horizontal microarchitecture. Microprogram compaction is the process of exploiting parallelism to combine microoperations ( MO's ) into microinstructions ( NI's ) to reduce the time and/or space needed for the execution of a microprogram. In solving microprogram compaction problems, optimlality is pursued but not necessarily arrived at, since the microprogram optinlization ( optimal compaction ) problem has been proved to be NP-complete both in microword dimension [DEW76] and in bit dimension [ROB79]. The microprogram compaction problem has been approached in two ways: local compaction and global compaction. Local compaction deals only with a straight line microcode ( SLM ) section also known as a basic block. A SLM is a sequence of hIO's with no jump into the microcode except at the beginning, and no jump out except possibly at the end. Global compaction deals with microprograms which have more than one basic block with conditional jumps, joins, and

loop)S. Several methods have been proposed to solve the local compaction problem, and they are reviewed in an article by Landskov et al [LAN80]. The same group of people ran some experiments with the methods and reported the results in [DA\81] and claimed that the local compaction problem is considered to be essentially solved. However, Vegdahl presented an algorithm in his thesis [VEG83] which showed that the classical compaction problem, which does not consider reordering of source MO's, can be solved in O(nv), where v is a bounded inumber of registers in the machine. And so, Vegdahl conjectures, the problem of finding a semnantics-preserving partial ordering which yields optimum compaction is a more difficult problem, because the general compaction problem - that is, the problem that considers all semantics-preserving partial orderings - is still NPcomplete even with the bounded-register assumption. Here we are dealing with the classical compaction problem and do not consider reordering of source MlO's. Even though the problem can be solved in polynomial time, the order which is equal to the number of registers in the hardware is, in most cases, so high that one is justified in using heuristics, which has low order ( usually O(n') ) polvnomial time complexity. Several methods to solve the local compaction problem, including Vegdahl's thesis results, are reviewed in section 3.1. Four methods have been proposed to solve the global compaction problem [DAS79. WOO79a, T0K78, TOK81, FIS79, FIS81a] as summarized in [FIS81b]. Also. a recent paper by Isoda [IS083J has proposed another approach, which uses a generalized data dependency graph. The trace scheduling by Fisher [FIS79, FIS81a] is the most general method and appears to give the fastest execution of compacted microcode. Although Fisher's trace scheduling procedure for global compaction may produce significant reduction in execution time of compacted microcode, the growth of memory size by extensive copying of blocks can be enormous. In the worst case, the memory

size can grow exponentially2 [FIS81a], and the complex bookkeeping stage of the trace scheduling is an obstacle to implementation. In [FIS8laI. Fisher suggests modifications to mitigate the memory grouwth effect.. These modifications are based upon selection of a probability threshold belowe which ( with some exceptions ) MO's are not allowed to move across block boundaries. Both memory size and execution time appear to be very sensitive to changes in the source microcode. See section 3.2.3 for further discussion of the modifications. A few methods to solve the global compaction problem are reviewed in section 3.2. In this dissertation a technique called beta compaction, based on trace scIhe(tuling, is proposed to mitigate the drawbacks of trace scheduling. Basically, it identifies the junction blocks ( the blocks beginning with a join and ending with a conditional branch ) as the major source of complication, divides traces at, those junction blocks, and compacts each divided trace separately. It. achieves almnost all the compaction of the Fisher's trace scheduling except, that which cauises copyling of blocks. An earlier effort to correct the drawbacks of trace schieduling called tree compaction is reported in [LAH83]. A loop-free version of both beta compaction and trace scheduling has been implenmented. Comparison between the two was done using some synthesized microcodes and it was confirmed that the beta compaction generates almost as fast-executing microcode as the trace scheduling but with much less memory. A microprogrammable machine based on AMD2900 components was designed and simulated with an interactive interface. A realistic application programs was written and hand-compiled into microcode. The microcode was executed on the simulator both before and after compaction, to demonstrate the applicability of the compaction technique and the informal correctness of the implement ation. 2 One such example is shown in section 5.6.1.

Chapter II presents a simple microprogram model and definitions of the terms ~which are used throughout the thesis. Chapter III reviews some work done previ(olsly in the area of microprogram compaction including local compaction and glol)al compaction. Chapter IV and Chapter V are the core of this thesis. Chapter IX presents a detailed description of the beta compaction which are developed to solve the global compaction problem. It also gives one illustrative example to show the actual working of the heuristic. Chapter V reports two experimental results. First, the comparison of beta compaction against the trace scheduling using an artificial microcode model is presented. Second, compaction of an application program and its execution on a simulated machine is described. At the end of the chapter, exponential memory growth of the trace scheduling and the worst-case space complexity analysis of the beta compaction are shown. Cha)pter \1 summarizes the dissertation and suggests some future research. Throughout the dissertation a few abbreviations of terms are used to make sentences short and clear. The abbreviations are listed at the beginning of this do(cullent.

CHAPTER II A MODEL OF A MICROPROGRAM The purpose of this chapter is to present a microprogram model which will be used in later chapters of the thesis. A part of the model is based on D. I,andskov et al [LAN80]. The model is to satisfy the minimum requirement of allouwing presentation of the essential concepts of the microprogram compaction heluristics ~with clarity. Since this is the basic model, many extensions are possible. Some of them will be discussed along with the basic model. Definition: A microinstruction ( Ml ) is an ordered set of all control signals in a machine at a given ( quantized ) time. Definition: A microprogram is a time-ordered sequence of MI. Definition: Each separate machine activity specified in an MI is called a mlicrooperation ( NIO ). Thus an M11 can be characterized as a set of hMO's and the MO is the basic unit of operation that we are dealing with. In some literature [MAL78,LAN880 ], microbundle ( NIB ) is defined as a set of MO's, all of which are coupled to one another. Thus every MO in an MB must go into the same Mh. The concept of bundling can be useful in dealing with MO's which operate on transitory data storage. The most common classification applied to microinstruction formats is to describe them as horizontal or vertical [SAL76]. The horizontal and the vertical are relative terms used to indicate the degree of encoding in microinstruction 7

8 formnats. There are two extremes, totally horizontal and totally vertical, and many variations in between. The highly horizontal microinstruction format w%.oul(d have a sufficient number of separate control fields to exercise simultaneous control over all independent hardware facilities, with encoding limited to mutually exclusive control signals. The highly vertical microinstruction format, on the other hand, would have a high degree of encoding and, in many ways, resembles conventional machine instruction. We will use the terms horizontal and vertical to indicate highly horizontal and highly vertical, which include not only extremes but some neighborhood of the extremes. Definition: Microprogram compaction is the process of exploiting parallelism to combine NO0's into MI's to reduce the time and/or space needed for the execution of a microprogram. The vertical MNI's, in a strict sense, specify only a single MO to be performed in each MI. Therefore vertical microprogramming allows no room for microprogram compaction. Horizontal microprogramming is implied throughout the discussion of microprogram compaction. Definition: A straight line microcode ( SLM ) is an ordered set of MO's with no entry point, except at the beginning, and no branches, except possibly at the end. Microprogram compaction which deals only with a single SLM at a time is called local compaction. Microprogram compaction which deals with more than one SLNI. where SLM's are connected with each other through conditional branches, joins or loops, is called global compaction. In local compaction, the execution time of compacted microcode is reduced by reducing the number of compacted kMI's. However, in global compaction, simply reducing the total number of compacted MI's does not necessarily reduce the execution time of the compacted microcode because the probabilities of SLMI's being executed are different. It is possible, in global compaction, that a microcode with larger size is executed faster than a microcode with smaller size for some data set.

2.1. Data Dependency Analysis In the process of microcode compaction, a given sequence of MO's are placed into a sequence of XMI's where an MI may contain one or more MO's. Although the original order of MO's is invariably changed, we want the original semantics of the microcode retained and the data integrity not violated after the compaction. To accomplish these we need to analyze data dependency between MO's. The following definitions about, data dependency are made only with respect, to a SLNI or a trace. A trace is a sequence of blocks and MO's in the blocks are treated as if they form a SLM1 for the purpose of data dependency analysis. The trace will be defined in sec. 2.3. The definitions of data dependency can be extended to include any program structure [IS083]. Definition: In a given sequence of MO's mo1, mol,,), moP,..., mo mo..., mo,, mot, we sav moi and mo. have a data interaction if they satisfy any of the following conditions. (1) An output resource of mo. is also an input resource of mo.. ('2) An input resource of mo is also an output resource of mo.j (3) An output resource of moi is also an output resource of mo.. To maintain data integrity, the order of any two MO's cannot be changed if they have a data interaction. Using the definition of data interaction, we can define a partial ordering over the MO's. Definition: Given two MO's moi and mo. where moi precedes mo. in the original SLMI, mo. is directly data dependent on moi ( written moi -. mo. ) if the two MO's have a data interaction and if there is no sequence of MO's, mokl, mok2;,..., mokn' n>l. such that mo -- mokl, mok(k) Mok -n mokn omOi The second part, of the definition ensures that two directly data dependents NiO's will havte no other chain of directly data dependent MiO's between them.

10 I)Dita deplendenc) is the transitive closure of the direct data dependency relation. Definition: The partial ordering over MO's defined by direct data dependencye can be represented by a directed acyclic graph. We call this the data dependency graph ( DDG ). Each node on a DDG corresponds to a MO. An edge on a DDG from a node i corresponding to mo; to a node j corresponding to moP indicates that mo. is directly data dependent on mo.. Vegdahl adopted the definition of data antidependency from Banerjee et al. [13AN79] and used that as a basis for his analysis. Definition: If MIO may destroy data that is required by another MO, the former is said to be data antidependent on the latter. This is the case (2) of our definition of data interaction. For further discussion of data antidependency and Vegdahl's analysis, see section 3.1.5. 2.2. Host Machine Description Besides data dependency among MO's, there is another restriction that has to be considered in microprogram compaction which is imposed by the host machine itself. If two MO's require exclusive use of a single resource, they cannot be placed into the same MI. The situation is called a resource conflict. To handle the resource conflict, we need a model to describe the host machine resources seen by the microprogram. Even though two MO's require separate resources, they may not necessarily be placed in the same MI1 because, as a result of partial encoding. the two MO's use the same NMI field. The situation is called a NMI field conflict. In the case of totally horizontal microinstruction, the Nil field conflict, does not exist because each resource in the hardware has its own control field in the hIl. Therefore, the resource conflict and the MI field conflict are equivalent in totally horizontal microcodes. To simplify the discussion, we will assume a

11 totally horizontal MiI format without losing generality. It would then be straightforward to extend our discussion to handle partially encoded microcodes. We now present a model of a machine control word which includes a simple description of the host machine resources. Definition: An MO is represented by a six-tuple ( label, instruction, nextaddress, destination-registers, source-registers, resource vector ). The label is a number to identify each NO10, the instruction is a command to the microsequencer ( microaddress controller ), for example, CONT, CJMP, GOTO, etc., and the next-address is a target address for a branch. The destination-registers and the source-registers are the registers written and read respectively in a given NIO. A resource vector is a, vector in which each- component corresponds to an individual resource available in the hardware. The value of each component is either 1 or 0 depending on whether the corresponding resource is used in a given MO. The resource vector gives a necessary description of the host machine resources to detect and resolve the resource conflicts. Other models used in the literature [LAN80, DAV81, TOIK81], which were intended for local compaction, do not have sequencer instructions and branch addresses. On the other hand, they have a set of clock phase required for MIO execution and a set of all MlI fields required by the MO. It would be straightforward to add these attributes to our model as necessary. Definition: A register is alive at some point in a program if its contents will be read in the future prior to being overwritten. Otherwise, the register is dead. 2.3. Microprogram Description A loop-free program can be represented as a connected directed graph with one node of in-degree zero ( entrance ) and one node of out-degree zero ( exit ). For programs with multiple entry points, a dummy entry block from which

execution is branched to the actual entry points, is added to comply to the above rule. Similarly, a dummy exit block is added for programs with multiple exit points. Definition: A trace is a set of blocks which forms a path of execution and has higher probability of execution than any other path in a given microprogram or a part of it. Definition: A leighted execution time of a microprogram is the sum of execution time of each block multiplied by the probability of that block executing.

CHAPTER III PREVIOUS RESEARCH Previous research attempts to solve the microprogram compaction problem has been approached in two different ways: local compaction and global compaction. Earlier attempts concentrated on what is now known as local compaction, which schedules hIO's within block boundaries. First, it was attempted to find an optinmal solution3 [YAU74] only to find out that the algorithm was exponential in complexity. So efforts were directed to finding reasonable heuristics. It was later proved that the microprogram optimization ( optimal compaction ) problem is indeed NP-complete both in microword dimension [DEW76] and in bit dimension [ROB791. Four different heuristics have been proposed to solve the local coInpaction problem and are reviewed in [LAN80]. More recently, in an effort. to get more efficient microcode, the compaction Nas tried not only among iMO's within each block boundary but also among N()'s beyond block boundaries. This is now known as global compaction. The global ccomnpaction of microprograms has been an active research area for the last five years or so and several heuristics have been proposed. dNote that in a straight line microcode, getting the minimum microcode size is equivalent to getting the minimum execution speed of the microcode. 13

14 3.1. Local Compaction In [DAV81], the same group of people who reviewed four local compaction heuristics in [LAN80] reported the experimental result on these heuristics and claimed that the local compaction problem was essentially solved. Based on the experimentation, they recommended first-come first-serve ( see Section 3.1.1 ) with some modification and list scheduling ( see Section 3.1.4 ). However, Vegdahl revisited the local compaction problem again in his thesis [VEIG83] and proved that the classical local compaction problem, which does not consider reordering of source MIO's, is not NP-complete, but polynomial in complexity. Hle presented such an algorithm. In this section, four local compaction heuristics are briefly reviewed and V\eg(lall's thesis results are discussed. In describing the heuristics, we will use one simple example to enhance understanding. The DDG of the example micro(code is showvn in Figure 3.1. The circled numbers represent microoperations. (X)~ ~O micro operatinm ~ I direct~- direct dependency esroe codflict Figure 3.1 Data dependency graph of example microcode

15 The arrows indicate the direct data dependency relation between MlO's. The dotted lines are used for convenience to indicate resource conflict between MO's. We will show the working of each heuristic on the same sample microcode along with the description of the heuristic itself. 3.1.1. First-Come, First-Serve The first-come first-serve heuristic described by Dasgupta and Tartar [I)AS761] operates on a straight line microcode ( SLM ) which is in the form of a list of MlO's. NIO's from the SLM are added, in the order in which they appear on the list, to an initially empty list of Ml's. Here is the outline of the heuristic. for each 10 do Determine the rise limit of the MO by searching MI's fromn bottom to top using data dependency. Place the NIO in the first MI which has no resource conflict with the MO, starting from the rise limit to the last Nil if the M1O is not placed, create a MI at the bottom to hold it. The detail procedure follows. The search for an existing MIl to which the current hMIO can be added begins Nwith a data dependency analysis. Starting at the bottom of the MII list and proceeding upwnard, examine each M1I if the MO under current consideration is not data dependent on any MO in that Ml. If the current MO is data dependent

16 on an MIO in the i th MI, the current MO cannot be placed in any MI earlier than i or i itself. The object of this search is to find the earliest ( highest ) Ml in svhichl tile new NIO can be placed without violating the ordering imposed by the data dependencies in the SLM. This MI is called the rise limit. The next step in the search for an existing Ml is the examination of resource conflicts. To be placed in a MI, the current MO must not conflict with any NMO already in that NIl. Assuming that a rise limit i was found, search downward in the list, starting with MI i, for some MI in which the new MO can be placed. When such a MII is found, add the new MO. If no such MI is found, add the new MIO to the end of the MI list, thus forming a new MI containing only one MO. If no rise limit was found, then the current MO was not data dependent on any \IO in the NII list, and the MO can be added to any MI with which it has no conflicts. In this case begin the downward search at the top of the NH list. If MO 1 MO 2 MO 3 MO 4 M0 5 MO 6 LII F 1 I 1 1 rise limit=1 2 2 2 2 2 rise limit-2 3 3 3 3 rise limit2 4 4, 5 4, 5 rise limit-4 rise limit-4 6 confllct between rise limit-5 2 and 3 Figure 3.2 The working of first-come first serve

17 there are no MlI's to which the MO can be added without conflict, use the MO to form a nevw NI1 at the top of all the other Ml's. Placing this MO at the top will keep it from blocking any new MO's that depend on it. Figure 3.2 shows the working of the first-come first-serve heuristic on the sample microcode. Each column shows the placement of one MO as indicated on top. The boxes represent MI's. The rise limit is shown for each MO. After MO I is placed, the rise limit for MO 2 is 2 because of data dependency. Even though the rise limit of MO 3 is also 2, it can not be placed in the same MI w ith NIO 2 because of resource conflict between NIO 2 and MO 3. The remaining NlO's are placed similarly and the result is five MI's. 3.1.2. Critical Path The critical path heuristic for microcode compaction was introduced by Ilanmanmoorthy and Tsuchiya [RtLkM74]. This critical path heuristic attempts to identify NIO's that must be executed at a certain time in order for the list of Ml's to be optimal. The MIO's chosen are those which are on a longest, path ( the critical path ) through the DDG. As noted, the minimum possible number of MI's is just the length of the longest path. Here is the outline of the heuristic. Determine early partition Determine late partition Determine critical partition Revise the critical partition Place non-critical MO's The detailed description of each step follows. The first step is to create an early partition ( EP ). Each time frame of the EP contains those MO's that can be executed in that time, at the earliest. So

18 starting from the first, time frame, fill in each time frame with all executable N1()'s and if there is no more executable MO's, then go on to the next time framle(. A 10 must be placed in the time frame of the EP after the frame of its latest ancestor. Data dependencies alone determine the placement of MO's in the EP. Conflicts will be resolved later. The early partition of the sample microcode is shown in Figure 3.3(a). Each different shading indicates a separate time frame. The next step is the creation of a late partition ( LP ). In the LP the latest, possible timings of the hMO's are considered. Each time frame of the LP contains those MfO's that can be executed in that time, at the latest. It starts from the O10 for which no other MO is data dependent and goes on to the one earlier time frame and fills it with all MO's for which no other MO is data dependent but those which are in the later time frame. If the data dependency graph is represented by a matrix, the LP can be created by applying the EP algorithm to the transpose of the graph matrix. The construction of the LP resembles the 2 3 (a) early partition (b) late partition (c) critical partition Figure 3.3 The working of critical path

19 construction of the EP in reverse. The late partition of the sample microcode is shown in Figure 3.3(b). the construction of these partitions facilitates the identification of MO's on the critical path of the DDG. These MO's, called critical MO's, are just those with the same timing in both early partition and late partition. It is called a critical partition ( CP ). The critical partition of the sample microcode is shown in Figure 3.3(c) by the shaded nodes. Since the CP was constructed by considering only data dependencies between MO's, two MO's in the same frame of the CP may conflict. And since the frames of the CP will serve as a basis for MI's in the list of MI's, conflicting MO's in a frame must be separated. The result of separating all the conflicting 1MO's in each frame is called the revised critical partition ( RCP ). The separated frame forms subframes. The revised critical partition of the sample microcode happens to be the same as the critical partition shown in Figure 3.3(c). In other words, in this particular example, the critical partition contains no time frame which contains two conflicting MO's. 4, 5 2,6 Figure 3.4 Result of critical path

20 The next and final step of the heuristic is to add the noncritical MO's to the R('P, forming the final list of Ml's. For each noncritical MO, we search the RCI' from the frame containing the MO in the EP to the frame containing the MO in the LP for a MI to which the noncritical MO can be added. If the MO cannot be added to any of the frames within this range, a new subframe is created. The final result of critical path heuristic on the sample microcode is shown in Figure 3.4 after placing non-critical MO's 2 and 5. 3.1.3. Branch and Bound The third method of solving the local compaction problem is branch and bound ( BAB ), a general class of tree-searching scheduling algorithms. Yau, Schowe and Tsuchiya [YAU74] were the first to describe the application of this technique to microcode compaction. In BPAP a tree is built, the nodes of which correspond to microinstructions. A path front the root of the tree to a leaf is an ordering of hI's, and thus a list, of SMI's. The tree will branch whenever there is more than one MI that can be placed at a point in the list of MI's. A complete tree represents every possible hI ordering. There are two variants of this algorithm. The first is BAB exhaustive, in which every branch of the tree that can possibly lead to an optimal MO ordering is explored. The other is BAB heuristic, in which pruning is done to the tree. BIAB exhaustive is an optimal algorithm, and running in exponential time, while BAB heuristic is not guaranteed optimal and can be made to run in polynomial time. The growth of the tree can be bounded even in BAB exhaustive. As in the other algorithms, calculate the lower bound on the number of Mhl's in the best possible ordering. ( Remember that this is the longest path through the DDG. )

21 A path through the BAB tree of this length represents an optimal ordering, and the algorithm can stop once such a path is obtained. The growth of the tree can be further bounded by remembering the length of the best ( shortest ) path found so far. If the length of an incomplete list of MI's is greater than or equal to this length, the current path needs no further consideration. Definition: data ready set ( DRS ) is the set of all unplaced MO's that are not data dependent on any unplaced MO. Definition: complete instructions ( CI's ) is defined as an instruction to which no other NIO's in the DRS can be added. Thus a CI is not a subset of any other legal NII given a DRS. The outline of the BLAB exhaustive algorithm is presented here using a recursive function. form initial data ready set; call FormCl; procedure FormCI form all possible Complete Instructions; for each Complete Instruction do form new data ready set; if data ready set is not empty call FormCI; A more detailed description of the algorithm follows.

22 Like the critical path heuristic, the BAB algorithm gets its information on the data dependencies of M-O's of an SLM through a DDG. The first step of the BA3B exhaustive algorithm is the construction of a DRS. The contents of the DRS change as execution of the algorithm progresses. The initial DRS contains just those MiO's not data dependent on any other MO. DRS-I}J r 1 5 IDRSkb2ia 2 IDRS=313 3 " DRS.=2A,5 3 DIRS= —4[,5DRS4 4,]5 DRSY2f 4, 5 DRSQ6e 4 DR1s6} 2, 6 6 Figure 3.5 Working of branch and bound exhaustive

23 The next step is to form Mi's from the MO's in the DRS. We wish to form only the largest possible Ml's. These are called Cl's. Cl's make up the nodes of the BAB tree so that a path through the tree is a list of CI's or a list of MhI's. For the detailed algorithm of forming CI's, see [LAN80] Sec. 3. Now, for each CI, compute a new DRS and form the next CI's which will make branches from the previous node ( or CI ). Repeat this for all the branches until there is no more MO's left. Figure 3.5 shows working of the BAB exhaustive algorithm on the sample microcode in Figure 3.1. It starts with the initial DRS which contains just 10 i. So there is only one CI with MO 1, which is shown in the box. The next DRS after the CI is shown on the right side of the CI box. From the DRS of { 2, 3 }, two Cl's are formed; namely one with MO 2 and the other with MO 3. The two Cl's make two branches in the graph. The process continues similarly until all possible branches are considered. The shortest path from the node to a leaf, which is indicated with thick lines, represents the optimal solution. 3.1.4. List Scheduling The list scheduling can be considered a special case of the BAB heuristic but it is important in its own right. The heuristic used in the list scheduling is as foll Ow S. Instead of examining every complete instruction generated from a DRS. exanline only the best Ci where the best CI is-determined by some metric. The outline of the list scheduling, shown below, is very similar to that of the BAB exhaustive shown in Section 3.1.3.

24 form initial data ready set; call FormCI; procedure FormCI form best. complete instruction; form new data ready set; if da.ta ready set is not empty call FormCI; The only difference is that inside the procedure FormCI, one best CI is formed and considered instead of examining all possible CI's. This heuristic requires only an amount of time that. grows polynomially with the number of 1MO's in the SlNI. The metric used in forming the CI affects the optimality of the conlpactiWn. I'xtensive test of different metrics were reported by Fisher [FIS79]. In the test, the metric used byN Wood [W0078] was shown to be one of the best, and is presented here. Assign a weight to each MO in the DDG. The weight of an MO is the number of levels of descendants of that MO in the DDG. The execution of the heuristic proceeds as follows. Compute a DRS from the DDG. Find the 1MO in the DRIS with the highest weight. Add it to the MI, which is initially empty. Then find the NIO with the second highest weight in the DRS. If this MO does not conflict with any MO already in the MI, add it. Otherwise, consider the M10 with the next highest weight. Repeat until all the MO's in the DRS have been examined. The resulting MI is the CI with the highest weight. Add this MI to the list, of MlI's and compute a new DRS. Repeat until all of the MO's in the SLNl have been placed. Figure 3.6 shows the DDG of the sample microcode with the weight of each NI() indicated outside of circles. The working of the list scheduling is the same as

25, 3 0( 2 ) " 3 2 60 Figure 3.6 Weight of MO's for list scheduling tie optillal path of the BAB exhaustive shown in Figure 3.5 with dark lines. fron1 the DRS of { 2, 3 }, MO 3 is selected since its weight of 2 is higher than thie weight of 0 of MO 2. From the DRS of { 2, 4, 5 }, MO 4 is selected for the sanle reason. Then NIO 5 is placed in the same MI while MO 2 is not because of resource conflict with MO10 4. The forming of the last MI is obvious. One last note on this example microcode. Keep in mind that this is just one very specific example. The fact that the first-come first-serve heuristic did not get the optimal solution while the others did does not necessarily indicate inferior performance of first-come first-serve or superior performance of the others. As noted at the beginning, the test result by Landskov et al. [LAN80] recommended first-conme first-serve with some modification and list scheduling.

26 3.1.5. Vegdahl's Thesi One of the contributions of Vegdahl's thesis is that he has shown that the classical microprogram compaction problem, which does not consider semanticspreserving reordering of source MO's, can be solved optimally in time O(nj), where l, is the number of registers in the hardware, instead of in exponential time, although the order v of the polynomial may be very high. Definition: A NMO may require data that is produced by another MO. If this is the case, then the former is said to be data dependent on the latter. Definition: A O10 may destroy data that is required by another 1MO. If this is the case, then the former is said to be data antidependent4 on the latter. [BAN79] Vegdahl recognized that data antidependency should not be treated as data dependency in solving microprogram compaction problems, as demonstrated by the following example [VEG82]. Figure 3.7(a) shows an example, where register x is txNice used( as a temporary. Figure 3.7(b) shows one valid partial ordering of the N1O's wvith data dependency and data antidependency among them. However, FIigure 3.7(c) shows another valid partial ordering of the MO's with the same data dependency but different data antidependency. Clearly, the partial ordering shovn in Figure 3.7(c) can yield less number of MI's when compacted. V\egdahl contended that the problem of optimally ordering the NIO's - and thereby determining the data antidependencies - is a more difficult problem than coplaction. lie then presented an algorithm, called a chain-matrix compaction algorithm, to optimally solve the compaction problem, once data antidependencies are specified. He concluded that the determination of data antidependencies is likely the more difficult problem, because the general compaction problem is NP-complete. 4 Note that data antidependency was defined as a part of data dependency in chapter II.

27 MDO 1: b <- a MO 2: x <- b MO 3: c <- x MO4: x <- d MO 5: e<-x MO6: f <-e (a) 1 By 2 51 3 62 data dependerwy 4)W~~~~~~~ data aWtide perefcy (b) (C) Figure 3.7 MNO's with different data antidependencies W\e now present a summary of the chain-matrix compaction algorithm from [VEG83]. For a complete description with an example, refer to [rEG82].

28 Thle 10's are first divided into v chains, each being placed in the chain corresponding to the register it writes. There is a strict ordering on the MO's in any chain - namely the order in which they appear in the source program. Each chain defines one dimension of a directed acyclic graph that is shaped as a vdimensional matrix. The edges of the graph correspond to MI's; missing edges denote conflicting NiI's. Each node in the graph represents a set of MO's; the node corresponding to matrix element <1,2,... > represents the set of MO's containing the first MIO of chain 1, the first two MO's of chain 2, etc. Any node whose NIO's violate a data dependency is removed. At this point, the problem is reduced to a single-source shortest path problem that can be solved using dynamic programming in time that is linear in the number of nodes. 3.2. Global Compaction W\Nhile the local compaction problem is concerned only with possible NIO movements within each block boundary, methods for solving the global compaction problem attempt to achieve even more compaction by moving MO's over the block boundaries. This widened scope opens up a whole new set of issues. To name a few: When is it legal to move a MO to another block? What kind of provisions have to be made? How far can a MO move: just to adjacent blocks only, or further to over several block boundaries? How do we optimize overall performance, given that some blocks are more frequently executed than others? How much extra space is needed? The global compaction problem is still considered an active research area. We present some of the previous works here and propose some improvements in the next chapter.

29 3.2.1. Wood [W0079a, W0079b] This is one of the earlier attempts to compact MO's beyond block boundaries. It is an elegant and clean method even if somewhat restrictive. The basic idea is to allow N10 movements among the same block level in a block structured micro program. The big precondition of this work is that the source microcode has to be written in a block structured fashion, i.e., only if-then-(else) and (nested) loop structures are allowed. Given a block structured source microcode, the goal is to allow NIO's to migrate over a conditional block, but not to let them land inside the block nor to allow any MO to migrate beyond the confines of its own block. To achieve the goal, a concept of a hierarchy of levels of NMO's within the microprogram is introduced. MO's may be of two basic types: a primitive NMO Mhich is an atomic entity, or a block-type MO which may be expanded at one level lower to a set of component MO's. These component MO's may, in turn, be either primitive or block type. A block-type MO represents a whole block of NIO's which may be an if-then-else block or a loop block. Using the MIO's of these types, data dependency is defined between MO's at the same level as follows. Mlulti-level dependency rule: If A and B are primitive MO's with A data dependent on B. then the outermost block containing A but not B should be marked as dependent on the outermost block containing B but not A. The basic structure of the heuristic can be best described by Wood's own diagram [W0079a], which is reproduced in Figure 3.8. In the figure, the term DAS ( data available set ) is used which has the same meaning with the DRS ( data read. set ). In selecting MO from DRS, block type MO's are selected first before primitive NIO's are selected. So, the compaction starts at the lowest level block and works its way up. A separate DRS and waiting list is maintained for

Select Kicro-operation Efrom DAS at Current Level Primitive or Block type? Block ---------- Primitive Descend Attempt to Pack vOne LeNvel e into Current WordJ ------- --- ------ ------- A11 ioro-operations Put Components j at Current tevel of Block into \ have been Paoked? WAITINGC List -------- at New Level Yes No Ascend One Level Need to Start New Word? Yes No Sot up DAS with Clashed Hicro-operationa Plus Ones Newly Available --------- -— I —-----—.-_ — Figure 3.8 Wood's compaction heuristic [WOO79a] each lev-el of iO's. The waiting list is a list of the MO's at that level which have not yet been made available for packing.

31 Even though this is a nice clean way of solving the global compaction problem, it is too restrictive and may not perform as well as the other methods. It does not allow many of the legal MO motions which will be discussed in the following sections. 3.2.2. Tokoro [TOK81] Tokoro's approach to solving the global compaction problem is based on his identification of a set of rules of legal MO motions over the block boundaries. It is sometimes referred to as automated menu method. Even though the set of rules is very comprehensive and almost complete,5 the proposed heuristic procedure to utilize these rules is rather ineffective. The pictorial representations of Tokoro's MO movement rules are reproduced from his paper [TOK81] in Figure 3.9. Note that the MO's in these figures are drawn at the edges of the blocks just to make the explanation of the concept simpler. In fact, MO's can be transferred from the inside of blocks and/or into NlI's within blocks as long as data dependency relations are maintained. In Figure 3.9, A, B and nop ( no operation ) are MIO's and S, represents a source block and qd represents a destination block. MO's are transferred from source blocks to destination blocks. The simple transfer type optimization shown in Figure 3.9(a) and the common NIO type optimization shown in Figure 3.9(b) are obvious and selfexplanatory. A redundant MO is one in which all resources written by the N10 are never referenced afterwards. The redundancy reduction type optimization shoswn in Figure 3.9(c) simply eliminates a redundant, MO to save a cycle. The redundancy insertion type optimization shown in Figure 3.0(d) actually adds a redundant NO10 to one path without adding a cycle to make it possible to save a 6Since To(koro s rules deal only with MO motions between adjacent blocks ( he calls it contiguous segments ). MIO movements over several blocks, the kind of MO migration described by Wlood [OOC)79bJI ( see sec. 3.2.1 ) for example, are not allowed.

32 sod so Ss II Ss I A nop A B A flop nop B MnopB A B /Ss SsV Sd Sd (a) Simple transfer type optimization Sd Sd Sd Ssl Ss2 SS Ss2 A flop A B A flop A fop nop B nop B nop B A B Ssl Ss2 Ssl Ss2 Sd Sd (b) Common micro-operation type optimization Sd Sd Ss Ss nfop i op B B A flop nop B A flop A no A A Ss V Ss Sd Sd (c) Redundancy reduction type optimization Sd Sd Ssl Ss2 Ssl Ss2 A op A A lop nop B nop B A B Ssl Ss2 | Ssl Ss2 Sd Sd (d) Redundancy insertion type optimization Figure 3.9 Tokoro's MO10 movement rules

33 cycle at the other path. The detailed procedure to apply these rules, which will not be repeated here, is basically as follows. For each MO in each block, search for a possible move which can save a cycle by checking against all the rules. If such a move can be found, move the MO, otherwise select the next MO and repeat the entire search. This automated menu method appears to suffer from the following shortcomings [FIS81a]. Each time a MO is moved, it opens up more possible motions. Thus, the automated menu method implies a massive and expensive tree search with many possibilities at each step. Evaluating each move means recompacting up to three blocks, an expensive operation which would be repeated quite often. To find a sequence of very profitable moves, one often has to go through an initial sequence of moves which are either not profitable, or, worse still, actually make the code longer. Locating such a sequence involves abandoning attempts to prune this expensive search tree. 3.2.3. Fisher [FIS79, FIS81a] The trace scheduling appears to be the best proposed method to solve the global compaction problem. Its basic concept is nice and elegant and the available indications show that it gives good speed-up of execution time of compacted microcodes. lHowever, in actual realization of the elegant concept, it gives some undesirable side effects which need to be corrected. And it is this correction process, called bookkeeping, that is nontrivial and complicated as will be described later. In brief, trace scheduling proceeds as follows. To schedule a given 1MO, we repeatedly pick the most likely trace from among the uncompacted MO's, build the trace DDG, and compact it. After each trace is compacted, some duplication of MO's into locations off

34 the trace is done to keep the semantics of the original microprogram unchanged. When no MO's remain, compaction has been completed. Figure 3.10 shows the basic steps of the trace scheduling. The figures are redrawn from [FIS82J with additional labeling to make it easier to explain. In Figure 3.10(a), each box represents a block of MO's and the lines between blocks indicate possible execution path. So the blocks with more than one line at the bottom are the blocks which end with a conditional branch. In Figure 3.10(b), the blocks covered by the shaded area are assumed to form a path with the highest probability of execution. In other words, the blocks form a trace. In Figure 3.10(c), after some preprocessing,6 the MO's in the trace are treated as if they belong to a single block and compacted using list scheduling. After the list scheduling is done on the trace, bookkeeping is necessary, which possibly includes nmassive copying of blocks. First, let's consider repairing rejoins. Rejoin is a join in a trace to which an execution flow from a non-trace block is reached. When there were joins to the trace, %we now' must find a location in the new sequence of MI's to join to. N0IO's niax have moved above and below the old join. We may only rejoin to places that have no NiO's at or below them which had been above the old join, since we d(nilt want to execute such MO's. When the highest point, for rejoin is found, all N1I()'s wlic(h had been below the old join but are now above the new rejoin have to be copied into the joining block. In Figure 3.10(d), blocks denoted by "R" are the blocks which contain the copied MO's because of repairing rejoins, and Nx' and N,~ are two newly found rejoining points. Second. consider conditional jumps. Some MO's which were originally abo ve a conditional jump on the trace may have been scheduled in an NMI below the conditional jump. We must copy these MO's to the blocks that the conditional junmp jumps to. If all registers written by the MO are dead at the beginning of %See conditional source registers

35 A B,~~~~~~~~~~~~~~~ / c D E F~~~~~~~~~ G~~~~~~~~~ H I Mb~~~~b (a) (b) B B Ma D~~~~~~~~~ I D Mb Na 5~~~~~~~ Mci N H H (C) (d) Figure 3.10 Fisher's trace scheduling

36 the block, then the MO does not need to be copied. In Figure 3.10(d), blocks denoted by "S" are the blocks which contain the copied MO's because of conditional jumps. Up to this point, the bookkeeping stage is not overly-complicated. But here is the big question. What if the rejoin N, happens to be below the conditional jump M, in Figure 3.10(d)? In the original source microcode, the conditional jump was below the join but if the conditional jump is above the join, for the execution flow coming from the block B. the conditional jump does not exist any more. So the execution path of block 1, block F and block H or the semantics of that path of the microprogram cannot be realized after the scheduling. To solve this problem, instead of rejoining block B to the trace, the blocks F, G, H and I are duplicated after the block 13, as shown in Figure 3.11. This kind of block copying is the major source of colnpllicat ion and possible memory explosion. To prevent possibly exponential growth of memory size, Fisher suggested modifications of trace scheduling [FIS81a]. (1) If the block ends in a conditional jump, we draw a DDG edge to the jump front each M1O which is above the jump on the trace and write a register alive in the branch. (2) If the start of the block is a point at which a rejoin to the trace is made, wre draw I)DG edges to each MO free at the top of the block from each MIO which is in an earlier block on the trace and has no successors from earlier blocks. Fisher recommended the above be used if the expected probability of a block's being reached is below some threshold. However, we suspect that there are some difficulties. First, the major source of extensive copying of blocks is

37 2~E~i) N a — 1 /I fromn initial long traces which therefore have higher probabilities of being executted. Thus, fixing blocks with lower probability of being executed, after possihl- extensive copying has already been done, helps relatively little. If, however, the threshold is raised to include blocks of high probability of execution, then it becomes close to compacting each block separately. Second, in long traces, the memory size growth is so sensitive to minor changes in source code that it is possible for one small change in source code to almost double the size of compacted code. The example microcode in section 5.6.1 has such sensitivity. Third, the bookkeeping stage which is the major source of complication remains to be the sanie /.

CHAPTER IV BETA COMPACTION ALGORITHM Chapter III has reviewed several local compaction and global compaction techniques for microprograms. It is clear that global compaction gives better perfornmance than local compaction, since some extra compaction over the block bmoulndaries is done. Among the heuristics to solve the global compaction problein. trace schleduling appears to be the best proposed method so far. HIowever, the trace scheduling has some serious drawbacks, as described in section 3.2.3. In I)articular, the kind of block copying shown in Figure 3.11, which can possibly cause memory explosion, may be unacceptable in many applications. Also, the bookkeeping stage of the trace scheduling is so complex that Fisher avoided a formnl (lesription of it in his paper [FIS81a] and later found a major bug during v erificat ion [N1I'81]. In Figsure 3.10 of previous chapter, it can be seen that the reason for all the block copying is the fact that the rejoin Na happens to be below the conditional junip A!, after the compaction of the trace. This can happen only around the block which has a join at the beginning and a conditional jump at the end, as shown in Figure 4.1. We call such block a junction block. Having recognized that junction blocks can cause the serious complication, we propose a modification on the trace scheduling to avoid the complication. The modification is to cut traces at each junction block. By making this modification, we not only prevent the kind of block copying shown in Figure 3.11, but also simplify the bookkeeping stage tremendously. Also, the worst case memory growth is reduced from exponential to O(n'). Wfe call this improved trace scheduling a beta compaction. 38

39 codt0nmal Figure 4.1 Junction block The beta comnpaction has all the above nice properties with very little trade off in compaction. It performs all the compaction that the trace scheduling does except that it does not allow MO movement across a junction block. MO's may still m(ove into and out of junction blocks but MO's may not migrate across junction l)locks. We conjecture that this kind of MO movement, which is allowed by trace scheduling, occurs very rarely, as will be shown experimentally in the next chapter. 4.1. Algorithm Description Informally, the beta compaction algorithm proceeds as follows. An entry block with highest probability is selected. Starting from the block, a trace is selected by following its descendant block of highest probability until it reaches an exit block or a junction block or an already compacted block. The blocks in the trace are treated as if they are a sin

40 gle block, and compacted using list scheduling with copying of N1O's as ne('essary. This process is repeated until all blocks are compacted. A more detailed description follows. In Section 4.2, a working example of beta compaction is given. Input: Sequence of MlO's ( label, instruction, next address, destination registers, source registers, resource vector ). Output: Sequence of n]'s where each MI contains one or more MO's. Procedure: (1) Initialization and preprocessing (2) Repeat until all blocks are compacted Pick trace List scheduling Bookkeeping During the initialization and preprocessing, dummy blocks may be inserted wherever a conditional branch and a join directly meet as shown in Figure 4.2. These points are the only places where extra blocks may have to be created to hold copied MIO's in the bookkeeping stage. Insertion of dummy blocks at the beginning totally eliminates the need to create blocks later. Empty dummy blocks after the compaction will simply be deleted. Loops are handled the same way as they are in trace scheduling. The inner most loop is compacted first and the compacted loop is represented as a pseudo MiO xhich contains all the data dependency information and serves as a M1O in

41 A B C D points were \ ll/my blods are Inswtw H I Figure 4.2 Insertion of dummy blocks the compaction of outer loop. Fisher proposed unwinding of loops in [FIS82] as a way to compact loops. In the beta compaction, similar loop unwinding is possible but, except for small inner loops, is probably too costly in space. The next few sections will describe in detail the major steps of the beta compaction, including live register analysis. Here are a few definitions to help the description. Definition: A tracehead set is a set of blocks in which each element of the set will make the beginning block of a trace. Definition: Off-the-trace blocks are non-trace blocks which are the targets of conditional branches.

42 4.1.1. Live register analysis The live register analysis is the process of determining which registers are alive at some point in a program. This information is necessary to allow as much compaction as possible without allowing illegal MO movement. Here we are interested in the live register information at the beginning and at the end of each block. The live register analysis is done once at the beginning as a part of the preprocessing and after new block boundaries are identified in the bookkeeping stage. It is also done every time a MO is copied from one block to another in the bookkeeping stage. The following summary is taken from [AH077]. Definition: The depth-first-ordering of the nodes of a graph is the reverse of the order in which we last visit the nodes in the preorder traversal. Depth-firstnumlbers DFN(n) assigned to each node indicate the depth-first-ordering of the nodes. Definition: Ijnj is the set of registers alive at the point immediately before block n and OU1T/nj is the same immediately after the block. Definition: DEFInj is the set of registers which are assigned values in block n prior to any use of that register in the block n, and USE[nj is the set of registers used in block n prior to any definition thereof. Input: A set of blocks each containing a sequence of MO's. Output: A list of live registers at the beginning and end of each block. Procedure: (1) Compute a depth-first ordering of the nodes. Let the nodes be n1. n2,..., nN' such that DFNni]- = i.

43 (2) begin for i:- I to N do begin IN[n]:= USE[n]; OUT[n]:= empty; end while changes occur do for i: — n to 1 by -1 do /* in reverse depth-first order */ begin OUTlnl]l U IN[s]; - a successor of ni INinj: ( OUT[n, - DEFInil )U USE[nij; end end 4.1.2. Pick Trace The pick trace routine determines the next set of trace blocks to be compacted. It selects the highest probability block among non-compacted blocks and iteratively selects the highest probability descendant block until it reaches an exit block. or a junction block, or an already compacted block. After selecting the trace blocks, all the MO's from those blocks are converted to a form acceptable to the list scheduling routine which performs local compaction regardless of block boundaries. Input: A set of blocks where each block is a 2-tuple. The first element is the proalilit- of execution and the second is a flag indicating if the block has been comlpacted.

44 Output: The set of blocks which forms an execution path. Procedure: (1) Delete empty blocks from the tracehead set. (2) Initialize the conditional source registers of each MO to empty. (3) Select a block with the highest probability of execution in the tracehead set. Call the selected block the current block and add it to the trace. (4) Repeat until the current block is an exit block, a junction block or an already compacted block Update the current block with a block of the highest execution probability among its descendant blocks. If the new current block is not a compacted block, mark it, as compacted and add it to the trace. (5) U-pdate the tracehead set. {(i) C-onvert the MO's in the trace to the appropriate format for the list, scheduling. The first step of the pick trace procedure is to delete empty blocks from the tracehead set. It is possible that some of the blocks in the tracehead set became empty in the previous bookkeeping procedure. So it is necessary to delete them fromn the tracehead set and instead add their descendent blocks to the tracehead set. Search through the blocks in the tracehead set to pick up a block with the highest probability of execution among them. The selected block ( call it start, block ) will be the first block in the trace selected.

45 Starting from the start block, repeatedly select the next block in the trace by picking the block with the highest probability among the descendant blocks of the current trace block. Mark all the blocks selected as trace blocks. The selection is stopped when a program exit block, a junction block or a marked block ( already compacted ) block is encountered. A program exit block or a junction block is included as a part of the trace but not a marked block. In the process, the descendant blocks of any trace block except the next trace block is collected as the off-the-trace blocks for that trace block. These off-the-trace blocks are necessary to update the tracehead set and to make some necessary copies of NIO's in the bookkeeping stage. Delete the start block from the tracehead set and add all the off-the-trace blocks of each trace block to the tracehead set. Conv-ert the NO10's in the trace to the appropriate format for the list schedulinl. Note that the list scheduling does not know about the blocks and our implenleltation of the list scheduling requires a slightly different NMO format, e.g. se(uleIntial numbering of MiO's. Also, add the live registers at the beginning of the off-the-trace blocks as the conditional source registers of the conditional branch I(0 in the trace for those off-the-trace blocks. Add a dummy register to all the conditional branch MIO's as a conditional source register and destination register to establish a DDG edge between the conditional branch MO's, and thus keel) the order of those MlO's. Figure 41.3 shows an example of selecting traces throughout the course of the beta compaction. Each box in the figure represents a block of IMO's. Each subfigure shows the result of a call to the pick trace routine, Figure 4.3(a) being the first and Figure 4.3(e) being the last. It is assumed that the execution probability of blocks A, F, C, G, B, D and H are from high to low in the order of listing. Trace blocks are indicated with dark lined boxes and compacted blocks are indicated -with shaded boxes.

A/ Bl \I B (a) (b) (c) / " H (d) (e) Figure 4.3 Selecting traces

47 4.1.3. List Scheduling The list scheduling routine does just local compaction on the list of NIO's without any knowledge of block boundaries and generates a list of compacted nMl's. Input: The sequence of MO's in a basic block or in a trace Output: The sequence of compacted MI's Procedure: (I) Build DDG on hMO's starting from the first MO. (2) Assign priority value ( number of levels of successors in DDG ) to each MO. (3) CYCILE 0 (41) I)ata Ready Set ( DRS ) is formed from all MO's with no predecessors on the (5) While DRS is not empty, DO (''CYLE - CYCLE + 1 The NIO's in DRS are placed in iteration CYCLE in order of their priority until DRS is exhausted or no more resources are available for CYCLE. All 10O's so placed are removed from DRS. All unscheduled hIO's not in DRS whose predecessors have all been scheduled are added to DRS. }

48 The above procedure may be seen as three stages, i.e., build DDG ( step 1 ), assmignt Ipriority ( step 2 ) and schedule MO's ( steps 3, 4 and 5 ). A more detailed description of each step follows. (A) Build DDG Given a sequence of MO's, a DDG is built based on source and destination registers of the NlO's. Starting from the second MO ( the first MO forms a single node DDG bv itself ), add one more node for the 1MO on the partially built DDG. The DDG is a directed acyclic graph with a set of source nodes and a set of sink nodes as shown in Figure 4.4. The source nodes are the nodes with in-degree zero and the * both J W* SkIf node F 4d Figure 4.4 Sample data dependency graph

49 sink nodes are the nodes with out-degree zero. A single node not connected to the rest of the graph is considered to be both a source node and a sink node. Let Uls call the N10 to be added the current MO, and the set, of nodes being consi(dcred in the partially built, DDG the current node set. At first the elements of the current node set are the sink nodes of the partially built DDG and are updated as the algorithm progresses. Now take the current MO and see if there is any data interaction between the current MO and the MO's represented by the current node set. If there is any data interaction i.e., the current MO is directly data dependent on some MfO's represented by the current node set, draw edges from the corresponding nod es in the current node set to the new node representing current MO. Note that since we handle MhO's in sequence, the current MO may be data dependent on the NMO's of the current node set but the MO's of the current node set can not, be data dependent on the current MO. Conditional source registers are used when considering data interaction between a conditional jump hMO and the hIO's that follow but not the MO's that precede. This constraint is to prevent, movement of MIO's above a conditional jump which may overwrite live registers at off-the-trace blocks but still allow movement of MO's below a conditional jump. In the case of a join, list scheduling is done on a trace with no constraint and new joining points are identified at the bookkeeping stage. For all the NlO's of the current node set on which the current MO is data dependent, tag all their ancestor nodes so that those ancestor nodes may not be included in the updated current nodes. When all the elements of the current node set, are considered, update the current node set and repeat the process until the updated current, node set is empty. To update the current node set, consider each node of the set at a. time. First, delete the node from the current node set. If the deleted node is not the source node, add the untagged ancestors of the node to the current. node set. And delete any element. of the current node set if it is an ancestor of another

0so niode. An example of building DDG is shown in Figure 4.5. The figure shows the change of the current node set and how edges are drawn to the newly added node. The black circle is the newly added node and it is assumed that one data dependency has been found between the node and current node set in Figure 4.5(a). The node labeled P is tagged because it is the ancestor node of the node P U FP A A * rel d rodeA IW A (C)

51 that the newly added node is data dependent upon. Another data dependency is as;sumed(l to be found in Figure 4.5(b) and there are two tagged nodes labeled P for the similar reason. In Figure 4.5(b), the node labeled A is not included in the current node set because it is an ancestor node of another element of the current node set. (B) Assign priority There are many ways to define priorities on nodes in DDG. Here we adapt the best known scheme [FIS791 which is the level of descendants in DDG. The algorithm to assign priorities to nodes has a flavor similar to building the DDG algorithm. Basically, assign priority 0 to the sink nodes and priority 1 to their immediate ancestors and the like, up until all the nodes are assigned priorities. Let us call the set of nodes which have the same priority the unipriority node set. The algorithm starts by assigning the set of sink nodes to the uniprioritv node set. Then assign priority 0 to all the nodes in the unipriority node set. IRepeat updating the unipriority node set and assign the priority one higher than the previous one until the unipriority node set becomes empty. Note that some node may be a member of several consecutive unipriority node sets and be assigned the priority associated with the last unipriority node set. After assigning the priority to each of its elements, the unipriority node set is updated. The node which has no descendant is tagged and deleted. The node whose descendant nodes are all tagged is also tagged and deleted. All the ancestor nodes of the deleted nodes are added to the unipriority node set. Note that if a node in the unipriority node set has a descendant node which is not tagged, the node is neither tagged nor deleted nor its ancestor nodes added to the unipriority node set. Such a node will be assigned a higher priority next time. An example of assigning priority is shown in Figure 4.6. The figure shoaws the change of the unipriority node set. Note that some nodes belong to more than one unipriority node set and get the priority of the last unipriority node set.

62 AA A A 2 1 0 (c) priodt.y 2 (d) prioity 3 EI::.z mzipority nme set A t kgmul SAn Figure 4.6 Example of assign priority

63 In Figure 4.6(d), the final priorities are shown on each node. (C) Schedule MO's This step actually places MO's into MI's using data dependency, resource usage and priority of MO's. Let us call the MI where MO's are being placed as the current MI. New NII's are created as necessary. Let us start from the first MI which is the current MI. The initial data ready set ( DRS ) contains all the MO's of source nodes of DDG. DRS always contains those IMO's which can be placed into the current Nil without data dependency. But some of the MO's in the DRS may require the same resource, in which case we have to choose one MO over the others, using their priority. Pick MIO's from the DRS in the order of their priority and place them in the current iMl until any of the required resources of an MO to be placed is used by an already placed -MO in the current MI. When this process of placing MO's into the current MIl is stopped because either there is a resource conflict or all the NlO's in DRS are placed, DRS is updated. To update DRS, all the MO's which have been placed in the current MI are first deleted. Then all unplaced MO's not in DRS whose ancestors in DDG have all been placed are added to DRS. There is one exception in updating DRS. When the trace ends with a conditional branch MiO, we want to keep the conditional branch MiO in the last MI after the scheduling, which is not guaranteed by the above procedure. To ensure that such conditional branch MO stays in the last, MI, we put all the other O10's in DRS before we put the conditional branch MO when we update DRS. There is another way of handling the placement of such a conditional branch 10. Just schedule MO's without any exception and if the conditional branch N10 happens not to be in the last MI, then the MI's below the liI which contains tile conditional branch NIO will form new blocks in each branch. Using this

64 method, we may be able to save a cycle in some rare cases at the expense of some space. However, we chose not to use it because it changes the structure of the program, which we are against. W'hen DRS is updated, another MI is created to hold some MO's in DRS and the process is repeated until all the MO's are placed. An example of straightforward list scheduling was shown in section 3.1.4. For an example of list scheduling of a trace, which involves conditional source registers, see the beta compaction example in section 4.2. 4.1.4. Bookkeeping The result of list scheduling is a list of compacted MI's. And we know the list of blocks to which those compacted MIIs belong; the blocks are the trace blocks selected at the pick trace stage. The task of this bookkeeping stage is to put the Nll's into appropriate blocks and, if necessary, make some copies of lNO's to off-the-trace blocks to keep the original semantics of the microprogram unchanged. Input: A program in which a trace has been compacted into a sequence of MlI's. Output: A program in which the given trace is partitioned into blocks and the rest of the program modified to maintain the original semantics. Procedure: (1) Identify the hlI's which contain conditional branch MO's as block boundaries for the first, part of the trace. (2) For the last part of the trace, find rejoining points such that below the newin rejoin there is no MIO's which were above the old join.

55 (3) Do necessary copying of MO's for the first part of the trace. (4) Do necessary copying of MO's for the last part of the trace. The bookkeeping stage is divided into two major parts. The first part maps NMI's into blocks; the second part makes appropriate copies of MO's. Each part handles the first half and the second half of the trace blocks separately. Because of the way that the traces are picked, there is no junction block in any trace and all the traces in the beta compaction has certain forms. The traces are composed of three components, as shown in Figure 4.7. First, there are zero or more blocks which have conditional branch MO's at the end. Then there is one' block with one entry point and one exit point. Last, there are zero or more blocks which have a join at the beginning of them. Depending on a trace, some ctypC bc JLuuiry C m / typeft bry F Figure 4.7 Components of trace' If the dummy blocks are not inserted where a conditional branch and a join directly meet. this block m3ay not exist sometimes.

56 of the above three components may be empty. However, the order of the components, when they exist, does not change. The first part of the bookkeeping stage is to map compacted MI's into blocks. In other words, since the MO's may have been moved across the block boundaries during the list scheduling, we need to find out the new block boundaries. For the c-type boundaries shown in Figure 4.7, the job is very simple. The NI's which contain conditional branch MO's form natural block boundaries. For the j-type boundaries shown in Figure 4.7, the job is not as easy. We need to find new rejoining points as follows. Below the new rejoin, we want to have no N1O's which were above the old join before the list scheduling. We want the earliest possible point, which satisfies the above condition to be a new rejoining P)0int. Thlle sarme order of the original joins are kept among the new joins except possibNly two or more adjacent old joins become the same point, when new rejoining points are found. This order-preserving property does not apply to c-type boundaries in general. It is possible that the order of two conditional branch NI()'s may change after the local compaction. We do not allow this to happen by drawn~ing artificial DDG edges between conditional branch MlO's before local conmI)actio()I, Iecause allowing a change of order of conditional branch N10's,'we believe, N(o)uld complicate the heuristic unnecessarily with very little or no gain. Tlhere is a recent study of some possibilities by allowing such change of order [I(LSN3]. The second part of the bookkeeping stage is to copy some of the NMO's t.o off-the-trace blocks to keep the original semantics of the microprogram unchanged. The copying NIO's is done again in two parts; first for c-type boundaries, then for j-type boundaries. Let us first consider c-type boundaries. Recall that MO's are not, allowed to move above conditional branches if they write any conditional source registers ( live registers at the beginning of off-the-trace blocks ). And MO's which are

57 moved above conditional branches either write no register or dead registers for the off-the-trace blocks. So such MO's are not candidates for copying. The NIO's which need to be copied are the ones which were above conditional branches but are below them after the local compaction. These MO's need to be copied to the beginning of the off-the-trace blocks for each conditional branch which the NIO's have moved below. However there is an exception, as shown in Figure 4.8. Suppose a MIO has been moved from block A to block H as a result of the local compaction. The MO would have been copied to both block C and block E because block C and block E are off-the-trace blocks. But since the block E is in the same execution path with the block H, the MO needs not be copied to block E. So the MO is copied only to block C. To do this systematically, for each NIO which needs to be copied, form a target block set ( block C B r Figure 4.8 Copying MO's

58 and block E in the above example ). Then subtract8 all the blocks that can be backtraed starting from the block that, the MO has been moved to ( blocks 11, D, E, B and A in the above example ). Then copy the MO only to the resultant target block set, which is just the block C in the above example. The copying of MO's for the j-type boundaries works much the same way as for the c-type boundaries. 4.2. Simple Illustrative Example Here is an example of how the beta compaction algorithm works. Figure 4.9 is a microprogram with 21 MO's in 7 blocks. The letters A through G are assigned to each of the 7 blocks and the numbers I to 21 indicate MO's. Figure 1.10 shows a list of the NMO's represented by the six-tuple model of Chapter II. Mi() 3 and M\O 12 are conditional jumps. It is assumed that the probability of )lock B executing is 0.7 and that of block E is 0.6. Accordingly, the probability of )lo(ck C executing is 0.3 and that of block F is 0.4. The probability of blocks A, D and G executing is all l's, naturally. The first step of the beta compaction algorithm is to perform live register analysis and the result is shown in Figure 4.11. The first trace blocks are A, B and D. and then are compacted together as if a single block. The compaction is done first by building a DDG with MO's in the blocks considered. When building I)DG, the live registers at the beginning of block C or IN'[C] are assumed to be read at MO 3, which is a conditional jump. In other words, registers in the set IN[C] are conditional source registers of MO 3. However, those registers are used only when considering data interaction between hMO 3 and the MO's in block B, but not between MO 3 and other MO's of block A. s set subtraction

59 A B 5 C 8 10 D 11 0.6 o 0.4 14 F1 18 19 G 20 21 Figure 4.9 Sample microprogram The DDG is shown in Figure 4.12(a). Note that, even though conditional source register 10 of MO 3 is also a destination register of MO 2, no edge is drawn between MO 2 and MO 3 to allow possible movement of MO 2 below MO 3. Also note that conditional source register 8 of MO 3 and destination register 8 of MIO 4 makes an edge between MO 3 and MO 4 which prevents the movement of IMO 4 above MO 3. Next, list scheduling is done on the XMO's in the trace and the result is shown in Figure 4.13(a). Now, necessary bookkeeping is done on the

60 block name inst next_addr dest_reg srcreg resource_vec 1 CONT 1 2, 3 0 1 1 0 A 2 CONT 10 1, 5 0 0 1 0 3 CJMP 7 2 3, 4 1 0 1 0 4 CONT 8 9, 5 0 1 0 0 B 5 CONT 2 2, 5 0 0 0 1 6 GOTO 10 15 4, 8 0 1 0 0 7 CONT 9 8, 8 0 1 0 1 C 8 CONT 1 1, 5 0 1 0 0 9 CONT 8 11, 12 0 0 0 1 10 CONT 14 2, 13 0 1 0 1 D 11 CONT 7 3, 5 0 1 0 0 12 CJMP 15 6 4, 10 1 0 0 0 E 13 CONT 9 3, 4 1 0 1 0 14 GOTO 19 6 3, 6 0 1 0 0 15 CONT 2 6, 8 0 0 0 1 F 16 CONT 7 2, 14 0 1 0 1 17 CONT 4 2, 5 0 1 0 0 18 CONT 14 11, 12 0 0 1 0 19 CONT 9 6, 10 0 0 0 1 G 20 CONT 13 7, 3 0 0 0 1 21 STOP 3 4, 11 0 1 1 0 Figure 4.10 List of MO's sequence of MlI's. The point right after conditional branch MO 3 is the boundary between block A and block B. The rejoin from block C is made right before the last NII which contains no MO which were above the old join. Since MO 2 is scheduled below conditional jump MO 3, MO 2 is copied to block C. Note that IO 11 has been moved from block D to block A and is not copied since block A. and block D are on the same execution path.

81 inreg[A] = 2 3 4 5 8 9 11 12 13 outreg[A] = 1 2 3 4 5 8 9 10 11 12 13 inreg[B] = 2 3 4 5 9 10 11 12 13 outreg[B] = 2 3 4 5 8 10 11 12 13 inreg[C] = 1 2 3 4 5 8 10 11 12 13 outreg[C] = 2 3 4 5 8 10 11 12 13 inreg[D] = 2 3 4 5 8 10 11 12 13 outreg[D] = 3 4 5 6 7 8 10 11 12 14 inreg[E] = 3 4 6 7 10 11 outreg[E] = 3 4 6 7 10 11 inreg[F] = 3 5 6 8 10 11 12 14 outreg[F] = 3 4 6 7 10 11 inreg[G] = 3 4 6 7 10 11 outreg[G] = Figure 4.11 Live registers Similar compaction is done for the next trace of blocks D, E and G. The DDG is shown in Figure 4.12(b) and the result. of list scheduling is shown in Figure 4.13(b). Since MIO 20 is above the new rejoin from block F, the MO 20 is copied to the block F. The third trace contains a single block C. It is compacted using straightfor-'ward list scheduling. The DDG is shown in Figure 4.12(c) and the result of list scheduling is shown in Figure 4.13(c). The last trace is also a single block which contains block F. The DDG is shown in Figure 4.12(d) and the result of list

82 Blocks, B. D Blos D E, G 2 3 2 BloaCk C Block F 18) 20 (c) (ii Figure 4.12 Data dependency graphs scheduling is shown in Figure 4.13(d). The final result of the beta compaction done on a given example microprogram is shown in Figure 4.14. Execution time of the trace blocks A, B, D, E, G is 7, and wvcighted execution time is 7.8. 12 MN's are used to hold the compacted rnicrocode.

63 MO's in cycle 1 = 1 block A' MO's in cycle 2 = 3 11 MO's in cycle 3 = 2 4 5 block B' MO's in cycle 4 = 6 MO's in cycle 5 = 10 12 block D' (a) MO's in cycle 1 = 10 12 block D' MO's in cycle 2 = 13 14 20 block E' MO's in cycle 3 = 19 21 block G' (b) MO's in cycle 1 = 2 7 block C' MO's in cycle 2 = 8 9 (c) MO's in cycle 1 = 15 MO's in cycle 2 = 16 block F' MO's in cycle 3 = 17 18 20 (d) Figure 4.13 Intermediate results The trace scheduling done on the same example is briefly described in Figure -.15. Figure 4.15(a) shows the state after the list scheduling and bookkeeping on the first trace ( blocks A, B, D, E, G ) is done. Since MO 6 has been scheduled below conditional branch MO 12, the join from block C to block D had to be moved down with blocks D, E and F copied, as shown. The second trace of blocks C'. D' and E' is compacted and some more copying is done, as shown in Figuire 4.15(b). Next the remaining single blocks F' and F" are compacted. The final result is shown in Figure 4.15(c) and it is redrawn in Figure 4.15(d) in a more conventional form. Compare the result, of trace scheduling in Figure 4.15(d)

64 A 3, 11 Dv 1 12 E' 13, 14, 20 F'16 G' 1 Figure 4.14 Final result wvithl the result of beta compaction in Figure 4.14 and note the change in program sit rlet lire. After the trace scheduling, the execution time of the trace blocks A, B, D, E, G is 7 and weighted execution time is 8.1, which is about the same9 as the weighted execution time from the beta compaction. However, space used is 21 MlI's, a 75c increase. The space time product, of the trace scheduling result is 8 In this particular example, it just happened that the weighted execution time from the beta compaction is slightly faster than the weighted execution time from the trace scheduling. Generally, we expect the trace scheduling gives slightly shorter weighted execution time than the beta compaction as will be shown in the next chapter.

65 AEZ /,, ~~~~~~~i_ ~~~~~1 /~~~~~7.,8. 9 2 C 3,f 11 2.4,5, F12 D l, 12 / 1516, 17 18 F 13, 14, 2 6, 19, 20 13, 14 T 6, 19 20 19, 20 w~~~~~~~~~~~ _ i (a) A1-6-D —E-G ~~~~~~~~~[217 1' / se E' 3, 11 /2, 7 2, 4, 5 Fee 10, 12 10, 12 15, 16, 17, 18 9, 13, 14 F 13, 14, 20 6, 19, 20 8, 19 15, 16. 17. 18 6,19 20 Or 9, 19 _20 21. 03) Figure 4.15 Trace scheduling on the example

O0 Q_ __ __~j F" ~ 0'-0' m 3,11 6,_ __ _7 _, 4,f 16 10, 12 15 10, 17, 18, 19 9, 13, 14 16 13.14_ 20 20 8, 19 _, - 20 _20 21.~ ~ ~ ~~3 U (c) 11 1..!2,4,5 2..,'7 10, 12 10, 12 / _/ 9, u s 13, 14., 20 168.191 1..... l 6.. 19 17, 18, 19 16 20 20 17, 118, 19 20 21 (O Figure 4.15 Trace scheduling on the example ( cont'd )

67 about 80%c bigger than that of the tree compaction result. These are summarized in Figure 4.16. This specific example shows the potential advantages of the beta compaction algorithm, i.e., it saves memory space with similar speedup in execution time. More analysis and experimentation are described in the next chapter. time in weighted space ce trace time needed time product beta on 7. 8 12 93.6 trace c;edulig | 7 8. 1 21 1TO.1 Figure 4.16 Comparison summary

CHAPTER V EXPERIMENTATION AND ANALYSIS 5.1. Purpose A new technique to solve the global compaction problem has been developed. It is called beta compaction and is based on Fisher's trace scheduling. The purpose of this experimentation is to demonstrate its feasibility and practicality as well as to empirically verify its effectiveness. For the experimentation, two separate approaches have been taken. First., some artificial microcodes are synthesized in a controlled environment by a microcode model, and are compacted using list scheduling, trace scheduling, and beta compaction. The purpose is to compare the performance of the beta compaction against that of the trace scheduling and the list scheduling. Second, an application program written in Pascal are hand-compiled into microcode and compacted using the same three methods. Both uncompacted and compacted microcodes are run on the software simulator for a two ALU AMID2900-based system. The purpose is to demonstrate that the beta compaction performs well in compacting real application programs as well as to infornlally show that the implementation of the three compaction methods is correct. 68

89 5.2. Hypothesis Fisher's trace scheduling seems to be the best proposed heuristic so far for solving the global compaction problem in reducing the execution time of the compacted microcode. However, it has some serious drawbacks. First, the resultant microcode size can grow rapidly because of the extensive copying of blocks necessary, exponentially in the worst case. Second, the bookkeeping phase of the heuristic is very complicated and thus has been an obstacle to implementation. Third, the resultant microcode size can change significantly by a small change in source microcode. The beta compaction is basically a careful limitation of trace scheduling. With this new heuristic, we claim to achieve almost the same level of execution time reduction, with much less memory required than the trace scheduling. The bookkeeping phase of the heuristic is very much simplified. Also, the size variation of the compacted microcode is much less sensitive to changes of the uncompacted source microcode. (Comparing against the local compaction technique ( list scheduling ), the beta compaction should perform better in both execution time and memory size. 5.3. Implementation A loop-free version of both beta compaction and trace scheduling has been implemented. Implementation of loop handling has been omitted because it is not critical in comparing the two methods, since both methods handle loops the same way. Standard Pascal was used to implement these global compaction heuristics on a VAX-11/780 running Unix 4.lbsd. Pascal was chosen over C to implement the compaction heuristics mainly because it is generally safer and it has built-in

70 set operations. However, it turned out that the set operations available in standard Pascal are so minimal that they were not, as useful as expected. The most needed construct in dealing with sets was something like: for all the elements of this set, do statement.; Since such a construct is not provided in standard Pascal, what we had to do was: for all the elements of universal set, do if it, is an element of this set then statement; And it was not very efficient. Implenlentation of list scheduling and beta compaction is very close to what was described in Chapter IV. The list scheduling is implemented as a part of the beta compaction. The source code size of the beta compaction is about 80k bytes, including some comments. Parts of the beta compaction has been rewritten to imnpliement trace scheduling. Major differences are pick trace and bookkeeping routines. 5.4. Comparison against trace scheduling Performance of any compaction method depends on so many variables thatp it is virtually impossible to accurately model them all. For example, the performance depends on the architecture that the microcode is compacted for, i.e., how much hardware resources are available, what kind of restrictions ( timing, encoding ) are imposed on using the resources, and what kind of parallelism is av ailable

71 in the uncompacted source microcode, and how much of such parallelism is available over block boundaries, etc. Hence, we have developed an artificial microcode model where parameters are kept constant whenever possible or are otherwise random. This artificial microcode model is described in Section 5.4.1. Since we are dealing with the global compaction problem, i.e., programs may contain several conditional jumps, joins, and/or loops, we need a way to represent program structure. In the absence of any available model to represent program structure, we have developed a simple program structure model to be used for our purpose of comparing two heuristics. We chose to restrict our program structure model to represent if-then-else structure only. There is no loop or multiway branching. This is sufficient for our purposes and at the same time easy to implement. The program structure model is described in Section 5.4.2. 5.4.1. Artificial Microcode Model The purpose of this model of a microprogram is to empirically compare the performance of the beta compaction against the trace scheduling. It has been developed to make this comparison as fair as possible. Each O10 is represented by a six-tuple as described in chapter II. ( label, sequencer command, jump address, destination registers, source registers, resource vector ) The purpose of the label is just to distinguish one MO from the others. The sequencer command and the jump address determine the block structure of microprograms. It is not a simple task to generate all possible configurations of microprograms. A somewhat simplified but still systematic way of generating block configurations is presented in the next section. The number of destination registers and the number of source registers are kept constant respectively and so

72 are the number of resources. Each block contains a constant number of MO's described above and the number of blocks, which is a constant, and the relationship among them is determined by the program structure model described in the next section. The selection of register usage and resource usage is done randomly as follows. One register is selected randomly out of a available register set and assigned as a source or destination register. It is repeated as many times as there are source and destination registers. For resource selection, the probability of each resource being used in a IMO is preset. Whether or not each resource is to be used is determined independently using the assigned probability. It is presumed that by using the above random selection of registers and resources, the hard-to-grasp parameters like parallelism in microcodes and resource conflict are kept random. In the experiment, many number of microcodes are synthesized for each set of constant parameters and the results are av eraged. 5.4.2. Program Structure Model In this model, we have made several assumptions: (1) only two way branching ( if then else ) is allowed (2) only two way join is allowed (3) no loop (4) no goto Mlulti-way- join can be achieved by allowing some blocks to be empty but we are not concerned about the size of blocks here. We are only interested in the topology of block structure, which is represented as a graph where a node is a block and an edge is a possible path of program execution.

~I l~~73 j ciijjj (a) 0 conditional branch (b) 1 conditional branch C ~= jjj c ~ jjj c ~ m jjj (c) 2 conditional branches Figure 5.1 Program structure model First, consider a case where there is no conditional branch. There is only one possible configuration in this case, which is just a single block or a node in graph rep~resentation. It is shown in Figure 5.1(a). If there is one conditional branch, the possible configuration is still unique, with 4 blocks, and is shown in

Figure 5.1(b). If there are two conditional branches, the number of possible configurations becomes three, as shown in Figure 5.1(c). There is another way of representing these configurations by using strings. In the string representation, each letter corresponds to a block. The blocks which end waith conditional branches are represented as letter c and the other blocks are represented as letter j. The order of these letters is similar to that of the preorder traversal of a tree. String representations are shown below each graph representation in Figure 5.1. If we see the string representation in Figure 5.1(c), we can see a pattern. The first letter is always c and the last three letters are always j's. The location of the c in the middle three letters determines three different configurations. Note that for each c in a string, there cannot be more than two j's until the next c cones except the last c in the string. This rule is accumulative, i.e., if there cjc jC jjj cjcj jjc jjj ccjj jj C jjj cjej j j jej jjj cc jj jcj jjj cjjC cjj jjj cjcj cjj jjj ccjj ejj jjj c jcC jjj jjj ccj jij jjj cccj jjj jjj (a) string representations of 12 possible configurations Figure 5.2 Program structure model for 3 conditional branches

( pjuoo ) saqoumuq leuo!l!puoo j ioj lapoue ainlpniJs u'Jo0d Zgi aJnS!i uoweuasadai tk6M (c) r mrrorrom om.rs.rsC ffl I11 [oo 30 fff 3(3 3o from (Tc 13 fil fo( rio o,, uff ro o..ff o a.r fo C.r rrCIz Iff rr.or ILf f..T I fff f off iT f ff i gg

78 are two consecutive c's, then four j's may follow before another c. Described differentliv. for any letter in a string except the last j, if its prefix contains less than twice as many j's as c's, the letter can be a either c or j. If its prefix contains exactly twice as many j's as c's, the letter has to be a c. If its prefix contains more than twice as many j's as c's, then the prefix contains a letter j which represents a premature exit block. Now we can represent all the configurations for the case of three conditional branches using the above rules, which is shown in Figure 5.2(a). For each string representation, corresponding graph representation is shown in Figure 5.2(b). 5.4.3. Procedure The experiment was done using the following parameters: The number of registers: 16 The number of destination registers: 1 The number of source registers: 2 The number of resources: 4 The probability of each resource used:.25 The number of blocks: 10 The number of hiO's in each block: 3 The above parameters are chosen based on the system simulated in Section 5.5.1 and the program structure model in Figure 5.2. For each MO, one register among 16 is selected randomly and assigned as a destination register and two ( possibly the same ) registers are selected and assigned as source registers. For each hIO, it is assumed that the probability of each resource being used is 1/4. So it is possible that all four resources are used in a particular MO with the probability of 1/256, or no resource is used again with the probability of 1/256. With 10 blocks, there exist 12 possible configurations as described in previous

77 c 1 1 2 15 1 0 0 0 1.000000 c 2 2 4 12 0 1 0 0 1.000000 j 7 12 11 10 0 0 1 0 1.000000 c 4 7 13 5 0 0 1 0 0.800000 c 5 7 6 4 0 0 1 0 0.800000 g 10 8 3 8 0 0 0 1 0.800000 c 7 9 10 10 0 1 0 0 0.200000 c 8 14 5 7 1 0 0 0 0.200000 c 9 12 10 5 0 1 0 0 0.200000 c 10 11 2 4 0 0 0 1 1.000000 c 11 10 3 0 0 0 1 0 1.000000 j 16 5 1 13 0 0 0 1 1.000000 c 13 5 13 5 0 0 1 0 0.800000 c 14 0 8 14 1 0 0 0 0.800000 g 28 6 11 4 1 0 0 0 0.800000 c 16 3 9 11 0 0 0 1 0.200000 c 17 5 0 0 0 0 1 0 0.200000 j 22 15 14 4 1 0 0 0 0.200000 c 19 13 2 3 0 1 0 0 0.160000 c 20 4 2 12 0 1 0 0 0.160000 g 25 5 2 11 0 0 1 0 0.160000 c 22 15 3 14 0 0 1 0 0.040000 c 23 7 7 13 0 1 0 0 0.040000 c 24 11 12 12 0 0 0 1 0.040000 c 25 8 3 10 0 0 0 1 0.200000 c 26 2 8 9 0 0 1 0 0.200000 c 27 13 8 7 0 0 0 1 0.200000 c 28 10 7 8 1 0 0 0 1.000000 c 29 13 10 9 0 0 0 1 1.000000 c 30 4 5 13 0 0 1 0 1.000000 Figure 5.3 Example of an artificially synthesized microcode section. For each configuration, 20 microprograms are synthesized and compacted using list scheduling, beta compaction, and trace scheduling. An example of microcode synthesized using the models described in the previous two sections is shown in Figure 5.3.

78 5.4.4. Result and Discussion Thle result of the experiment described above is summarized in Figure 5.4 in ternms of number of junction blocks. We are particularly interested in the performance of the beta compaction compared to trace scheduling as the number of junction blocks varies, since handling of the junction block is the key in beta compaction. Figure 5.4(a) shows weighted execution time of compacted microcode and Figure 5.4(b) shows control memory size of compacted microcode. Since there are 5 configurations with no junction block, 6 configurations with one junction block, and I configuration with two junction blocks, as shown in Figure 5.2(b)). the numbers in Figure 5.4 for the zero junction block case represent the average of 100 microprograms, the numbers for the one junction block case represent the average of 120 microprograms, and the number for the two junction block case represents the average of 20 microprograms. As showNn in Figure 5.4(a), the weighted execution times of the beta compaction result and the trace scheduling result are almost the same regardless of the numbler of junction blocks. The weighted execution times of the global compaction result ( beta compaction and trace scheduling ) are about half of original weighted execution time and about 25% less than the local compaction result ( list scheduling ). The memory sizes of the beta compaction result and the trace scheduling result are almost the same for the zero junction block case. However, the memory sizes of the beta compaction result are about 15% less than the memory sizes of the trace scheduling result for the one and two junction block cases. The memory size difference between local and global compaction results is not, as large as the weighted execution time difference between local and global compaction results. Note that the memory sizes of the trace scheduling result for one and two junction block cases are actually larger than the memory size of the list, scheduling result.

79 number of Junction blocks 0 2 original 22.6 list scheduling15.6 beta 10.L2 11.9 13.0 compactlon __.. trwe tched g 10.1 11.7 12.8 scredJling. (a) weighte execution time number of Junction blocks O 1 2 original 30 list 21.3 scheduling beta 20.4 19.8 20.4 corpaction trace 2015 22.8 23.9 scceclsling_ (b) control memory size Figure 5.4 Summary of experimental result

80 of (a) of FmuDc ogrcms 2 16 20 24 28 32 memory size (a) Figure 6.5 Distribution of the two junction block case

81 The distribution of 20 microprograms for the two junction block case is shown in Figure 5.5. Figure 5.5(a) shows the distribution for weighted execution time and Figure 5.5(b) shows the distribution for memory size. In the distribution for execution time, a bar of height N above execution time T is interpreted to mean that N microprograms out of the 20 tested had weighted execution time T. The interpretation is similar for the memory size distribution. It is clear that the distribution of weighted execution time is almost the same for the beta compaction and the trace scheduling, as shown in Figure 5.5(a). However, the distribution of memory size is quite different between the two. While most microprograms yielded the memory size of between 16 and 20 in the beta compaction, the trace scheduling result showed a wide distribution, ranging from 16 to 31. This implies the high sensitivity of memory size in the trace scheduling for the changes of source microcode. 5.5. Simulator of an AMD2900-based system The previous section described the comparison between the trace scheduling and the beta compaction using synthesized microcodes and confirmed the claimed properties of the beta compaction. However, we are interested in the performance of the beta compaction not only on the synthesized microcodes but also on some real application programs. Since we do not have access to a microprogrammable machine with parallel resources, we decided to simulate one. Mlicroprogrammable hardware based on AMD 2900 components is designed and simulated to test both uncompacted and compacted microcodes. The simulator is hased on Glenford Myers' paper(MYE81I and has most of the nice menu driven human interface described in the paper and more. The simulation is done at the register transfer level which is seen by a microprogrammer. The simulator is written using C and the curses screen drawing package on bVAX-11/780 running lU.NIX 4.lbsd.

82 5.5.1. System description The block diagram of the simulated system is shown in Figure 5.6. It is a simple system using the AM2903 ALU and the AM2910 microprogram controller. The system has two ALU's ( AM2903 ) to provide parallel resources for compacted microcodes. In the experment, an application program is hand-compiled for the target system assuming there is just one ALU, then compacted for the target system with two ALU's. To make the register file accessible from both ALU's, the internal registers of AMID2903 are ignored and external registers are used. Each ALU has its own register file and they always have the same value in order to allow parallel access to the registers. So up to four registers can be read by two ALU's at the same time, addressed by a and b addresses of each ALU. Any register write operation will write both register files the same value. Since b address is used for register write operation the proper b address among two is selected in the hardware and fed to the register files. All the non-ALU registers including pipeline registers and status registers are latched at the rising edge of each clock or at the beginning of each clock cycle. Therefore, micro memory fetch, main memory fetch and ALU operation are done at the same cycle. Since CPU and main memory are the only things that are on the system bus, no bus arbitration is necessary, except the DOUT register should not be writing onto the data bus in memory read cycle. Main memory is assumed to be fast enough to allow access in a single microcycle. However, the wait states can be inserted if necessary. When two ALU's are utilized, some values to be loaded in non-ALU registers may be lost by other operation before they are actually loaded. For example, an address calculated by ALU 0 on its y bus may be changed by the next ALL 1 operation before it is loaded into memory address register ( MIAR ), since MAR load operation is not guaranteed to follow the address calculation immediately.

83 Data bus ~RFD RF1 _map~l da s _db pror Ix 1. rnu1 stat rAd Figur 58cc da db da db r;ux ALU 0 ALU 1 2903 2903 micro pipeline reg Address tbus Figure 5.6 Block diagram of the simulated system

84 95 63 lmartbuf 31 ldin buf am031 ar03a _ Idoutbuf 8765 92 60 Istat buf 28 am03l amO3b Imar 4321 88 58 lir 24 ldin ea/ arn03i Idout am0310 8765 write oeb/ 84 52 ir mux 20 we/ oey/ amOil onamlie 4321 80 48 16 nam031O cc comp oeb/ m 76 we/ 44 12 m oey/ d n cn i e 72 40 rld/ 8 a x t t rfln mux e amn3a aOda mux a 68 a0db mux 38 4 d d aldamux a aldb mux d marrrux r a 64 doutrmux 32 O For ALU 2 For ALU 1 & the rest Figure 5.7 Microinstruction format

85 Special buffers are built in to mar, din, dout and status registers. These buffers are loaded at the end of micro cycle controlled by pipeline register signals synchronized with a sub-micro phase clock for the purpose. These buffers will hold correct, values until they are loaded into corresponding registers. The micro instruction format is shown in Figure 5.7. The width of the MI is 96 bits ( 3 double words ). For one ALU application, only 2 double words ( 64 bits ) are necessary. Note that the selection between register and external data for ALU input is controlled ordinarily using ea/ and oeb/ signals when internal registers are used, but for this simulator the selection is controlled externally using da_mux and db_mux when external registers are used. 5.5.2. Functions and human Interface The simulator is an interactive menu driven software package and all its functions are controlled by selecting appropriate entries in its menu window. All the internal states are displayed and can be modified. Control store can be loaded from a file and saved to a file. The user program can be executed either in a free run mode of in a single step mode. Break points can be set and the trace is collected. There are many more useful functions and human interface features and Appendix A contains a detailed description. 5.5.3. An Application Program The insert operation to 2-3 tree was chosen to be used as an example program to be run on the simulator because it contains many decision makings, i.e. conditional branches. It is hypothesized that programs with many conditional branches, i.e. many blocks have more parallelism over the block boundaries which can be exploited in global compaction. The more conditional branches, the more chances are there for global compaction.

86 The program was written in Pascal and hand compiled into microcode. The I'a.scal program source listing is attached in Appendix B and a part of ANIDASMI' listing of hand compiled microcode is in Appendix C. During the hand compilation, all thie variables were preassigned to registers, since the number of variables was small. The Pascal program of the insertion algorithm is written using a recursive procedure. So the recursive procedure is implemented on the microcode using a simple stack in main memory to save and restore local variables. Argument passing on procedure calls is done using call by reference, so the same registers are used to represent argument variables inside procedures. The program builds its 2-3 tree data structure in main memory. A simple get-space routine is written in microcode to allocate the main memory space. The program was first compiled into microcode for the simulated system assuming there is one ALU and then compacted for two ALU system using list scheduling and beta compaction. Since the compaction program requires data format different from the microcode generated by AMIDASMI, a conversion program was written which finds out all the register and resource usage of each Ml. After compaction, the result was converted to a format acceptable to the simulator by another conversion program which also calculates new branch addresses and determines multiplexer controls. To demonstrate that the compaction procedure has generated correct microcode, both uncompacted and compacted versions of the program has been run on the simulator and give the same result as the compiled object code of the original Pascal program on VAX-11/780. For a data set which builds a 2-3 tree with 5 leaf nodes, weighted execution time and space needed are summarized in Figure 5.8 for both before and after compaction. The probabilities for calculating weighted execution time were measured from a pascal program execution profile on a large data set. Even though the result confirmed the general properties of three compaction techniques, the difference between beta compaction and the trace scheduling is small. The main reason is that the program has little parallelism over the block' AMIDASM is a trademark of Advanced Micro Device, Inc.

87 wei ght ed space space time (# of words) (# of bits) original 75.15 238 14042 list 43.86 159 13356 scheduinc. trace 40.77 155 13020 scheduling beta 4beta 83 1 12600 compact i on Figure 5.8 Summary of 2-3 tree insertion program execution boundaries because main memory read and write operations impose very heavlocal data interactions between MO's and therefore very small number of NMO's can be moved to different blocks without violating data integrity. It would be possible to relieve some of these data interactions due to main memory access by having dual port memory with separate address and data bus. 5.6. Space Complexity Analysis Space complexity of compacted microcode size is analyzed for both trace scheduling and beta compaction. For the trace scheduling, one microprogram is analyzed to show that the memory size can grow exponentially after the compaction. For the beta compaction, the worst case microprogram is presented and

88 informally proved. The worst case space complexity of the beta compaction is O (?). 5.6.1. Trace Scheduling One microprogram shown in Figure 5.9 is analyzed to demonstrate that the memory requirement, of trace scheduling can grow exponentially. For the purpose of the analysis, it is assumed that each block contains a large unspecified but equal number of MO's and the blocks A41, B.1, Be A,t B.,, Ao form a trace. Furthermore, it is assumed that, after the trace blocks are compacted, one MO from each blocks B1, B2,, Bn has been comnbined with MNO's in block A, so that it is necessary to make copies of blocks as shown in Figure 5.9(b). Since it is assumed that the number of MO's in each blo(ck is equal and larger than one, and the only over-the-block-boundary movenl(llts of MIO's are the ones from blocks B,, Bz..., B, to A0, it is reasonable to assume that the size of each block does not change. Let the size of each block be unity for convenience. Now the rest. of the microprogram except the first trace has to be compacted and it is assumed again that similar situations arise as the first trace and copies are made. Let f(n) be the size of the microprogram shown in Figure 5.9(a) after compaction in terms of number of blocks. n is the number of conditional branches. An) = on + 1 + An-1)+ I + An-2) + 1 + AO) )+ 1 =3n ++ An-1) + An-2) +''' + AO) It is obvious that AO) =o) 1.

A 1 A2 __ I I E __ B2 02 B'2 B2 C2 A3 3 A'3 S I 6 I A0 A0 A"0 A'0 e f (a) (b) Figure 6.0 Exponential memory growth in trace scheduling

go An)= 3n + 1 + An-1) +.n-2) + - + 0 O) -) An-1)= 3(n-1)+ 1+ +An-2)+. +AO),n) - An-l) = An-1) + 3 Th erefore, An) = 2An-1) + 3 This and Ao)= 1 give A n) = 2n+2 _ 3 i.e., f(n) is the exponential function of n, or the number of conditional branches in the microprogram shown in Figure 5.9(a). This is of course a pathological case which is not likely to happen in real microcode. However, the implication that movements of MO's below joins cause extensive copying of blocks is serious. In real microcode, it is possible that movements of MfO's below joins happen frequently enough to increase the memory size forbiddingly. Note that a single movement of an MO from block B. to Ao would approximately double the memory size. 5.6.2. Beta Compaction In beta compaction, the worst case in memory space growth happens when a given microprogram has the structure shown in Figure 5.10(a). It is assumed that in any conditional branch, the left branch has higher probability of execution than the right branch for convenience. An informal proof by induction that this microprogram gives the worst space complexity follows. Starting from a single block or a SLMI, adding one conditional branch to it yields 4 blocks A0, Al, B1 and A'o - the only case, thus the worst case. Since we are concerned only about the topology of the microprogram, the size of each

a 0 v 00 -.0e ~'rr i 6~~~~~~~~~~~~~~~~~~ ____ ~-o 4Qf I~~~~~~~~~~~~~~~ lla ~ Al I 0P-0 ~ ~ " - I~ ~~ oD~A v~~ 0 ~ Cl0 4c m 0~~~~~~~~~~~~~~~~~ lC cly a rN~~~~~~~~~~~a c 0uC tL.~ ~ 431C 4c(

92 block is assumed to be unity and blocks newly created by adding a conditional branch are still assumed to be of size unity. Let us assume that the set of blocks shown in Figure 5.10(a) except the 3 blocks An, Bn, A'n-1, with block Annl connected to block A'n.2, gives the worst space complexity with respect to n-i conditional branches. At this point, let us consider a microprogram shown in Figure 5.10(b). Given such a program, the beta compaction result which would give the largest memory is shown in Figure 5.10(c). First, the trace of blocks a, b and d was compacted. Since we are trying to maximize the resultant memory size, we want all MlO's in block a to be copied to block c, which may be caused by the movement of MO's in block a into block b. The MO's moved to block b do not increase the size of block b, however, because they are moved only when they are combined with the MO's in block b to form new MI's. Therefore, the attempt to make MNO's in block a copied to block c did not increase the overall size. It is assumed that a MIO from block b has been moved to the bottom of block d as a result of compaction of the trace. So MO's in block d is copied to the end of block c, as shown in Figure 5.10(c). Figure 5.10(d) shows a compacted microcode without copying of MO's in block a and its size is the same as the size of the microcode in Figure 5.10(c). So we will use the kind of transformation from the microcode of Figure 5.10(a) to that of Figure 5.10(d) in our discussion. Now we need to find out to which block we should add the n th conditional branch in order to maximize the resultant microcode size. We claim that adding one more conditional branch to either block Anl1 or Bn-l increases the space the most. If the conditional branch is added to one of the blocks A0, Al,..., An2 or A'n-2 A'n3,... A'0, then a junction block is introduced and so the traces become shorter, therefore reducing space increase. If the conditional branch is added to one of the blocks Bk, 1<k<n-l, then the increased size of the microprogram is 4+2k-k which becomes a. maximum of n+3 when k = n-1. Or, if a conditional branch is added to block An., then the increased size of the microprogram is

A k A n-2 Ak 5k Ar1 B rv-1 Bka Bkb An n B'k B'k rA'l1 A'n1 A A k-1 k-1 k(a) (b) Figure 6.11 Calculation of memory size increase n+3, which is the same as for block Bn1 To calculate these size increases, refer to Figure 5.11. It shows the case where the n th conditional branch is added to one of the blocks Bk, 1~k<n-l. Similar to what is shown in Figure 5.10(d), four blocks are added, and the blocks which follow block Bk are also attached at the end of each branch. So the total size increase is 4+2k minus k, because k blocks were there before the conditional branch is added. Similarly, the total size increase when the n th conditional branch is added to block An-l is 4+2(n-1)-(n-1), as shown in Figure 5.11(b). So

94 the size increase is maximum when the conditional branch is added to either block An-l or block BnlI. Therefore, a microprogram with the topology shown in Figure 5.10(a), where the n th conditional branch is added to block An 1, requires the largest possible memory space after the tree compaction is performed. If the conditional branch is added to block Bn1l, the topology of the microprogram is different, but the size increase is identical. The size of the whole microcode after the compaction is S- 1 + 2n+ n+ Zk k=1 -3n 1+ n(n+ ) = O(n,)

CHAPTER VI CONCLUSION 6.1. Summary The need for a better microprogramming tool has increased considerably as increased demand and support of computer technology has brought about wide use( of microprograms. The eventual goal of microprogramming tool development would be to make a high level microprogram language and a compiler to generate minimnal-execution-time microcode for a variety of machines. In generating minimnal-execu tion-time microcode, one aspect that differentiates microprogramming languages from macroprogramming languages is the need for compaction in highly horizontal microarchitecture. Among the proposed microprogram compaction methods, the trace scheduling is the most general and appears to give the fastest execution of compacted microcode. However, the growth of memory size by extensive copying of blocks can he enormous, exponential in the worst case, and the complicated bookkeeping stage of the trace scheduling has been an obstacle to implementation. A technique called beta compaction, based on trace scheduling, is proposed to mitigate the drawbacks of trace scheduling. Basically, it identifies the junction blocks ( the blocks beginning with a join and ending with a conditional branch ) as the major source of complication, and cut traces at those junction blocks. It achieves alnmost all the compaction of the trace scheduling except that which

96 causes copying of blocks. Memory size after the beta compaction is usually smaller than the original. Even when the memory size grows in rare instances, it is bounded by O(n2) in the worst case. And the bookkeeping stage is very much simplified. The compacted microcode size variation as the source microcode changes is also very small. A loop-free version of both beta compaction and trace scheduling has been iml)lemented. Comparison between the two was done using artificially generated microcodes and the above properties of the beta compaction was confirmed. A simple microprogrammable machine based on ANM2900 components was designed and simulated with an interactive user-friendly interface. A realistic application program was written and hand-compiled into microcode. The microcode was executed on the simulator both before and after compaction, which demonstrated the applicability of the compaction technique and the correctness of the implementation. 6.2. Future Research The first immediate tasks are to implement loop handling in beta compaction and to test it with more programs of various kinds. To improve the beta compaction, it might be possible to incorporate Isoda's extended data dependency graph [IS083] into the basic scheme of the beta compaction. This may serve as a more elegant solution to the global compaction problem. In this broad area of research toward developing better microprogramming tools, more work is called for in the area of microprogramming language design, the code generation problem including resource allocation, microcode verification, and the integration of available knowledge.

APPENDICES 97

98 APPENDIX A THiE, SINItMLATOR FUNCTIONS AND HUIMAN INTERFACE The simulator is an interactive menu driven software package and all its functions are controlled by selecting appropriate entries in its menu window. So the best way to describe the simulator function is by describing entries in the menu windows. The menu system has 5 major frames where you can move from one frame to the other freely and one of the 5 frames ( Alter and Display frame ) has 5 suhframes of its own. Processor initialization frame ( Figure A.1 ) trap/trace frame ( Figure A.2 ) Alte r/display frame Datapath subframe ( Figure A.3 ) Pipeline register subframe ( Figure A.4 ) Control store subframe ( Figure A.5 ) SMain store subframe ( Figure A.6 ) Trace table subframe ( Figure A.7 ) Recording frame ( Figure A.8 ) Performance frame ( Figure A.9 ) Each frame has a one-line title on top of the screen and a menu and data display area in the middle. At the lower part of the screen, there are the one-line error message display field, the function field and the frame select field. And in the same line with the function field, there may be either the file name field or the address field depending on the frame selected. The last line is used to display the status of clock, stop, cycle, Icsar ( last control store address register ) and csar ( control store address register ). The clock field displays whether internal

9g clock is running or has stopped. The stop field displays the reason of a stopped (lock. i.e. if the clock is stopped because of error, break point encountered, single step, microp)rogram exit, or time out. The cycle field indicates the time limit in terms of number of cycles which can be easily changeable. The lcsar displays the address of control store which contains the micro instruction which has just been executed. The csar displays the address of control store which contains the micro instruction which is to be executed at the next cycle. Whenever a new frame is displayed, the cursor is located at the function field. The cursor can be moved around using either four arrow keys ( up, down, left, and right ) or h( left ), j( down ), k( up ), 1( right ) keys which are UNLX vi convention. The cursor can be moved only to those fields where a user is allowed to change its value. Trying to move toward the wrong direction will yield a'beep'. The fields where the user is allowed to change its value are indicated by'=>' and those which are mere display of something are indicated by'->'. The following keys are not displayed on the menu but are available throughout the simulator and could be very useful. h or left arrow Move cursor to next allowable field to the left j or down arrow Move cursor to next allowable field to downward k or up arrow Move cursor to next allowable field to upward I or right arrow Move cursor to next allowable field to the right I, ( control-L ) Redraw the screen'Z ( control-Z ) Suspend the program and go back to UNIX ( Execution can be resumed by shell command'fg' ( foreground ) ) break key Stop the program and go back to UNIX ( equivalent to function'q'): ( colon ) Move cursor to function field + ( plus ) In numeric field, increment the value by 1

100 - ( minus ) In numeric field, decrement the value by 1 p. ( p(riod ) In numeric field, reset the value to zero? ( (llesti(,n mark ) Help command. Displays the above info. A.1 Processor Init Frame The processor initialization frame is shown in Figure A.1. This is the first frame a user will see when he starts the simulator. Here he can load the control store and main store from appropriate files. To do this, first move the cursor to file name field and type the file name followed by <carriage return> which makes the cursor go back to the function field. Then type function selection character. The files are required to be in a certain format. One way to make the i =~=-=========-==========- Processor Initialization ==== Fct Description Hardware Timings c Load Control Store Machine Cycle Time => 200 nS m Load Main Store Memory Read Cycle Time => 350 nS q Quit and Return to UNIX Memory Write Cycle Time => 350 nS x Exit to Selected Frame i Processor Initialization d Alter/Display Note: Above numbers are in decimal t Traps/Traces r Recording Functions p Performance Data -— > Control Store has been loaded from amdasm/cs.tree Function => Filename => amdasm/cs.tree Frame Select => Clock -> STOPPED Stop -> NONE Cycles => fff LCSAR -> 0 CSAR => 0 Figure A.1 Processor initialization frame

101 files'with a right format is to store default values ( all zeros ) of the memories in files using store functions in recording frame ( Section A.4 ) and edit the files. Basic hardware timings can be set up. Currently settable parameters are machine cycle time ( micro-cycle time ), main store read cycle time and main store write cycle time. If either read or write cycle time is greater than machine cycle time, proper wait states are generated and added to total elapsed time. See the Section A.5 for performance data frame. A.2 Trap/Trace Frame The trap/trace frame is shown in Figure A.2. Here traps ( break points ) and trace can be set up or cleared. Trace can be turned on or off and the trace t- =================-==-=====- Traps/Traces Fct Description Control Store Traps:::::::::::: n Csar Trace ON Trap => a From => 6e To => 6e f Csar Trace OFF Trap => x From => 0 To => 0 e Empty Trace Table Trap => x From => 0 To => 0 q Quit and Return to UNIX Trap => x From => 0 To => 0 x Exit to Selected Frame Trap => x From => 0 To => 0 i Processor Initialization d Alter/Display a-Active x-Off t Traps/Traces r Recording Functions Main Store Traps::::::::::::::: p Performance Data Trap => r From => 4 To => 7 Trap => s From => la To => la Trap => x From => 0 To => 0 CSAR Trace -> ON f-Fetch s-Store r-Reference x-Off Function => Frame Select => Clock -> STOPPED Stop -> NONE Cycles => fff LCSAR -> 0 CSAR => 0 Figure A.2 Trap/trace frame

102 table can be emptied using appropriate functions. There are two different kinds of traps. One for the control store and the other for the main store. For the control store traps, a user can activate of deactivate the traps. For the main store traps, he can activate the traps for three different ways: fetch, store and reference ( fetch and store ). All traps have their ranges. So, for example, any attempt to execute the micro instruction of an address between the range will make the execution of the user microprogram stop. Assigning the same values for both lower and upper limits is equivalent to setting up just one trap ( break point ) in the conventional way. A.3 Alter/Display Frame d ============================ Alter/Display Fct Description xmO3r -> 0000 amO3r -> 0000 RO => 0000 xmO3s -> 0000 am03s -> 0000 R1 => 0004 d 16-bit Datapath xmO3y -> 0078 amO3y -> 0000 R2 => 0014 p Pipeline Registers xmO3qreg => 0000 amO3qreg => 0000 R3 => ffff c Control Store mar => OOfO R4 => 0005 m Main Store xreg cout => 0 ir => 0000 R5 => ffff t Csar Trace Table xreg-sign => 0 din => 0004 R6 => 0005 a Store PL Reg in CS xreg ovr => 0 dout => 0004 R7 => 0001 s Single Step Processor xreg_z => 0 amlOreg => 000 R8 => 0004 r Free Run Processor amlOuPC => 123 R9 => 0000 q Quit & Return to UNIX sreg _cout => 0 aml0ptr => 0 R10 => 001c x Exit to Selected Frame sreg_sign => 0 aml0stkl => 122 Rll => OOfO i Processor Init sreg_ ovr => 1 aml0stk2 => OfO R12 => OOfO d Alter/Display sreg z => 1 am10stk3 => 03b R13 => O01a t Traps/Traces am10stk4 => 03b R14 => 0000 r Recording Function aml0cc => 0 amlO0stk5 => 000 R15 => 0004 p Performance Data Function => Address => 0 Frame Select => Clock -> STOPPED Stop -> QUIT Cycles => e63 LCSAR -> 122 CSAR => 3ff Figure A.3 Alter/display frame ( datapath subframe )

103 This is considered the major frame of the simulator. It is the only frame where a user can single step or free run the microprogram. It displays and/or allo()s to change values of various parts of the simulated machine. It has five subframes which are shown in Figure A.3 through Figure A.7. The default subframe is the datapath subframe, which is displayed whenever this alter/display frame is entered. For displaying control store, main store or trace table, this frame has an additional field called address. The address is used to indicate the first entry to be displayed. The address field is also used to indicate the index of trace table to be displayed on top. The address ( index ) zero of trace table subframe has a special meaning ( see the description of trace table subframe ). d ============================ Alter/Display. Fct Description xmO3a => 0 amO3a => 0 lmar => 0 xmO3b => 0 amO3b => 0 lir => 0 d 16-bit Datapath xm03i8765 => 0 am03i8765 => 0 ldin => 0 p Pipeline Registers xm03i4321 => 0 am03i4321 => 0 ldout => 0 c Control Store xmO3ea => 0 amO3ea => 0 m Main Store xmO3iO- => 0 am03iO => 0 write => 0 t Csar Trace Table xmO3oeb => 0 amO3oeb => 0 ir mux => 0 a Store PL Reg in CS xmO3we - => 0 amO3we- => 0 amrOi => O s Single Step Processor xm03oey_ => 0 amO3oey_ => 0 r Free Run Processor xmO3cn => 0 amO3cn => 0 cc comp => 0 q Quit & Return to UNIX cc mux => 0 x Exit to Selected Frame xrfin mux => 0 amTOccen => 0 i Processor Init xlda _mix => 0 xOda mux => 0 amlOrld - => 0 d Alter/Display xldb-mux => 0 xOdb-mux => 0 t Traps/Traces xmar mux => 0 ddata => 0000 r Recording Function xdout mux => 0 p Performance Data Function => Address => 0 Frame Select => Clock -> STOPPED Stop -> QUIT Cycles => e63 LCSAR -> 122 CSAR => 3ff Figure A.4 Alter/display frame ( pipeline register subframe )

104 Uhlien running ( or single stepping ) through the microprogram, once a runtime error occurs, the remaining simulation of the cycle is meaningless. So the simulator returns immediately if a runtime error is detected without going through the cycle completely. (1) Datapath Subframe The datapath subframe is shown in Figure A.3. This displays all 16 registers in AM2903 and other registers and allows a user to change them. It also displays input and output of the ALU which are not registers and thus not allowed to be changed by a user ( indicated by'->' ). (2) Pipeline Register Subframe d =============================== Alter/Display Fct Description Control Store d 16-bit Datapath addr msb lsb p Pipeline Registers 0 00030300 0010010b c Control Store 1 000e0102 c680ffff m Main Store 2 000e0101 c3900000 t Csar Trace Table 3 020e0100 00100000 a Store PL Reg in Control Store 4 008e0100 clb4000f s Single Step Processor 5 00034500 00100027 r Free Run Processor 6 OOOeO101 c3900001 q Quit and Return to UNIX 7 020e0100 00100000 x Exit to Selected Frame 8 008e0100 c1340000 i Processor Initialization 9 OGOa0500 00100000 d Alter/Display a OOOeOlOe c680000f t Traps/Traces b 00010300 00100122 r Recording Functions p Performance Data Function => Address => 0 Frame Select => Clock -> STOPPED Stop -> SSTEP Cycles => f43 LCSAR -> 71 CSAR => 72 Figure A.5 Alter/display frame ( control store subframe )

105 d ======= ============== Alter/Display - Fct Description Main Store d 16-bit Datapath addr data p Pipeline Registers 0 OOOf 0003 000f 0007 c Control Store 4 OOfO 0000 0002 ffff m Main Store 8 0007 0000 OOOf 0005 t Csar Trace Table c 0000 0000 0000 0000 a Store PL Reg in Control Store 10 0000 0000 0000 0000 s Single Step Processor 14 0000 0000 0000 0000 r Free Run Processor 18 0000 0000 0000 0000 q Quit and Return to UNIX ic 0000 0000 0000 0000 x Exit to Selected Frame 20 0000 0000 0000 0000 i Processor Initialization 24 0000 0000 0000 0000 d Alter/Display 28 0000 0000 0000 0000 t Traps/Traces 2c 0000 0000 0000 0000 r Recording Functions p Performance Data Function => Address => 0 Frame Select => Clock -> STOPPED Stop -> SSTEP Cycles => f43 LCSAR -> 71 CSAR => 72 Figure A.6 Alter/display frame ( main store subframe ) The pipeline register subframe is shown in Figure A.4. This displays all the fields in the pipeline register and allows the user to change them. It is also possible to load the pipeline register from arbitrary location of control store or to store the pipeline register to arbitrary location of control store. Loading the pipeline register can be done by changing CSAR values at the bottom right of the screen. Storing the pipeline register is done by function'a' which stores the current value of pipeline register in the control store addressed by the address field on the screen. (3) Control Store Subframe The control store subframe is shown in Figure A.5. This simply displays part of the control store on the screen starting from the address specified by the

106 d -============================= Alter/Display Fct Description Trace Table d 16-bit Datapath index CSAR p Pipeline Registers b6 6c c Control Store b7 6d m Main Store b8 6e t Csar Trace Table b9 6f a Store PL Reg in Control Store ba 70 s Single Step Processor bb 71 r Free Run Processor bc 72 q Quit and Return to UNIX bd 73 x Exit to Selected Frame be 74 i Processor Initialization bf 75 d Alter/Display cO 76 t Traps/Traces cl 77 < — last trace r Recording Functions p Performance Data Function => Address => 0 Frame Select => Clock -> STOPPED Stop -> SSTEP Cycles => f3d LCSAR -> 77 CSAR => 78 Figure A.7 Alter/display frame ( trace table subframe ) a((ldress fi(,l(d on the screen. A user is not allowed to change them on the screen. But there are two ways to change the contents of control store. Either set up the pipeline re-gister and store it in the control store as explained above ( this is a lot more sensible way than dealing with hexadecimal numbers ), or save the contents of the control store in a file from recording frame, suspend the execution of this simulator, edit the file, resume execution by'fg', and load the file into the control st ore. (4) Main Store Subframe The main store subframe is shown in Figure A.6. This displays part of the main store on the screen starting from the address specified by the addreess field on the screen. The user is not allowed to change them on the screen. To change

107 r ==========================-= Recording Functions Fct Description c Save Control Store m Save Main Store f Save Machine Facilities q Quit and Return to UNIX x Exit to Selected Frame i Processor Initialization d Alter/Display t Traps/Traces r Recording Functions p Performance Data -— > Main Store has been saved in ms.tree Function => Filename => ms.tree Frame Select => Clock -> STOPPED Stop -> SSTEP Cycles => f3d LCSAR -> 77 CSAR => 78 Figure A.8 Recording frame the contents of the main store, the latter procedure described in Control Store Subframe has to be followed. (5) Trace Table Subframe The trace table subframe is shown in Figure A.7. This displays part of the main store on the screen starting from the index specified by the address field on the screen. \Nhen the address field is zero, and the trace table is not empty, then this subframe will display the last entry ( most recently updated entry ) at the bottom so that most recent 12 entries are displayed. This feature eliminates the need for a search of the last entry in the buffer of 256 entries. The last entry of the trace is clearly marked. A.4 Recording Frame

108 P ======================= Performance Data Fct Description Measurement r Reset all Performance Data Elapsed Micro Cycle -> 621 q Quit and Return to UNIX Number of Wait States -> 119 x Exit to Selected Frame Total Elapsed Time -> 148.0 usec i Processor Initialization d Alter/Display Note: Above numbers are in decimal t Traps/Traces r Recording Functions p Performance Data Function => Frame Select => Clock -> STOPPED Stop -> QUIT Cycles => d92 LCSAR -> 121 CSAR => 3ff Figure A.9 Performance data frame The recording frame is shown in Figure A.8. This frame allows a user to save the contents of the main store and the control store in files. To do this, he has to type file name first and then type the appropriate function. A.5 Performance Data Frame The performance data frame is shown in Figure A.9. This frame displays statistics collected during the simulation. Currently, elapsed micro cycle, number of wait states and total elapsed time are displayed. All of these performance data can be reset to zero.

109 APPE.NDLX B 2-3 TREE INSERTION ALGORITHM program tree( input, output, data ); type elementtype = record key: integer; (* other fields as warranted *) end; nodetypes -( leaf, interior ); nodeptr = -twothreenode; twothreenode = record case kind: nodetypes of leaf: ( element: elementtype ); interior ( firstchild: nodeptr; secondchild: nodeptr; thirdchild: nodeptr; lowofsecond: integer; lowofthird:integer ) end; var root: nodeptr; newp: elementtype; numofdata, i: integer; data: file of char; procedure Printtree( pnode: nodeptr ); begin writeln; if pnode'.kind - leaf then writeln('element =':10, pnode element.key:3 ) else begin

110 writeln('lowofsecond':14, pnode^.lowofsecond:3 ); writeln('lowofthird =':14, pnode'.lowofthird:3 ); if pnode'.firstchild <> nil then begin write('First child' ); Printtree( pnode'.firstchild ); end; if pnode^.secondchild <> nil then begin write('Second child' ); Printtree( pnode'.secondchild ); end; if pnode'.thirdchild <> nil then begin write('Third child'); Printtree( pnode^.thirdchild ); end; end; end; procedure insertl( node: nodeptr; x: elementtype; (* x is to be inserted into the subtree of node *) var pnew: nodeptr; (* pointer to new node created to right of node *) var low: integer ); (* smallest element in the subtree pointed to by pnew *) var pback: nodeptr; lowback: integer; child: 1..3; (* indicates which child of node is followed in recursive call *) w: nodeptr; (* pointer to the child *)

111 begin pnew:- nil; if node'.kind = leaf then begin if node^.element.key <> x.key then begin (* create new leaf holding x.key and return this node *) new( pnew, leaf ); pnew.kind:-= leaf; if node^.element.key < x.key then begin (* place x in new node to right of current node *) pnew'.element:= x; low:= x.key; end else begin (* x belongs to left of element at current node *) pnew'.element: node'.element; node'.element:= x; low:= pnew'.element.key; end; end; end else begin (* node is an interior node *) (* select the child of node that we must follow *) if x.key < node'.lowofsecond then begin child:_ 1; w node'.firstchild; end else if (node^.thirdchild = nil ) or (x.key < node'.lowofthird) then

112 begin (* x is in second subtree *) child:= 2; w:= node^.secondchild; end else begin (* x is in third subtree *) child:= 3; w:= node'.thirdchild; end; insertl( w, x, pback, lowback ); if pback < > nil then (* a new child of node must be inserted *) if node'.thirdchild = nil then (* node had only two children, so insert new node in proper place,) if child - 2 then begin node^.thirdchild:= pback; node'.lowofthird:= lowback; end else (* child = I *) begin node'.thirdchild: node^.secondchild; node'.lowofthird:= node'.lowofsecond; node'.secondchild:= pback; node'.lowofsecond:= lowback; end else begin (* node already had three children *) new( pnew, interior );

113 pnew kind:= interior; if child = 3 then begin (* pback and third child become children of new node *) pnew'.firstchild:= node^.thirdchild; pnew'.secondchild:' pback; pnew^.thirdchild:= nil; pnew'.lowofsecond: lowback; (* lowofthird is undefined for pnew *) low:-= node^.lowofthird; node'.thirdchild: — nil; end else begin (* child <= 2; move third child of node to pnew *) pnew'.secondchild:= node^.thirdchild: pnew'.lowofsecond:= node'.lowofthird; pnew'.thirdchild:-nil; node^.thirdchild: nil:; end; if child = 2 then begin (* pback becomes first child of pnew *) pnew'.firstchild:= pback; low:- lowback; end; if child -- 1 then begin (* second child of node is moved to pnew; pback becomes second child of node *) pnew^.firstchild:- node.secondchild; low:- node.lowofsecond;

114 node'.secondchild: pback; node'.lowofsecond: lowback; end; end; end; end; (* insertl *) procedure INSERT( x: elementtype; var S: nodeptr ); var pback: nodeptr; (* pointer to new node returned by insertl *) lowback: integer; (* low value in subtree of pback *) saveS: nodeptr; (* place to store temporary copy of the pointer S *) begin (* checks for S being empty or a single node should occur here, and an appropriate insertion procedure should be included *) insert 1( S, x, pback, lowback ); if pback <> nil then begin (* create new root; its children are now pointed to by S and pback *) saveS:- S; new( S, interior ); S'.kind:= interior; S'.firstchild:- saveS; S'.secondchild:= pback; S'.lowofsecond:= lowback; S^.thirdchild:= nil; end; end; (* INSERT *)

115 begin new( root, leaf ); root'.kind:- leaf: root.element.key = 5; Printtree( root ); newp.key:- 7; INSERT( newp, root ); writeln('INSERT', newp.key ); Printtree( root ); newp.key:' 3; INSERT( newp, root, ); ws riteln('INSERT', newp.key ); Printtree( root ); newp.key:= 2; INSERT( newp, root ); *writeln('INSERT', newp.key ); Printtree( root ); newp.key:' 4; INSEiT( newp, root ); writeln('INSERT', newp.key ); Printtree( root ); end.

116 APPENDLX C AMDASM ASSEMBLY LISTING OF 2-3 TREE INSERTION ALGORITHM The following is a AMIDASM definition source listing of 2-3 tree insertion algorithm. TITLE Test Program 1 for Microprogram Simulator WORD 96 WD: EQU B#1; Write disable WDDEF: DEF 19X, WD, 12X, 43X, WD, 20X; Write disable definition Datapath Control Signals LMAtR: DEF 19X, WVD, 12X, 6X, B#1, 36X, WD, 20X; Load MAR LiRI DEF 19X. NWD. 12X, 7X. B#1. 35X, %WD, 20X;Load IR LDIN: DEF 19X, WD, 12X. 8X, B#1, 34X, WD, 20X; Load Data In LD(-)'T: DPIF 19X. WD. 1'2X, 9X, B#1, 33X, W'D, 20NX Load Data Out \\'tITE: DET 32X, 10X, B#1. 53X;Memory Write IZ.-\NI I)EF l9X, W'D, 12X, llX, B#1, 31X, WD., 20X;IR RAMN Select Definitions for ANM2910 Sequencer JZ: DEF 19X,. WD, 12X, 12X, H#O, 6X, B#1, IVB#I, 8X, 11X, / \nVD. 20X; Jump Zero CJS. DEF 19X. \WD. 12X, 12X, H#1, IX, 1VB#O, 4VH#O, B#O, 1VB#I, 8X, 11X, / W\D. 8X, 1l2'"V#I#000; Cond JSB PL JSB: DEF 19X. WD, 12X, 12X, H#1, 6X, B#I, 1V-B#1, 8X, 11X, / \.VD, 8X. 1'2VI%1#000; tUnconditional JSB PL JMAP.: I)EF 19X. WD, 12X, 12X, H#2, 6X, B#1, 1VB#I, 8X, 11X, / DI). 20X: Jump Nlap CJP: DEPF 19X, \WD. 12X, 12X, H#3, 1X, 1VB#O, 4-'H#0, B#0, IVB#1, 8X, 11X, / W\I). 8X, 1'2V%(l#C0 00; Cond Jump PL JMIP: DEF 32X, 12X. H#3, 6X, B#l, lVB#l, 8X, 20X, 12V%H#000 / U; lnconditional Jump PL PUSHI: DEF 19X, NWD, 12X, 12X. H#4, 2X, 4~VH#0, B#0, lVB#l, 8X, 11X, / \'WD. 20X Push/Cond Load CNTR HLC': DEF 19X, WD, 12X, 2X, H#4, 6X, B#1, 1VB#l, 8X, 1IX, / *WD, 20X; Push and Load CNTR JSRP: DEF 19X, WD, 12X, 12X, H#5, 6X, B#1, lVB#I, 8X, 11X, / W'D. 20X; Cond JSB R/PL CJ': DEF 19X, WD, 12X, 12X, H#6, 2X, 4VH#0, B#0, IVB#I, 8X, 11X, / \WD, 20X; Cond Jump Vector JMP\': DEF 19X, W\D, 12X, 12X, H#6, 6X, B#1, l1B#l, 8X, 11X, / \D). 20X; Unconditional Jmp Vector JRP: DEF 19X, NWD, 12X, 12X, H#7, 6X, B#1, IVB#I, 8X, 11X, / \VWD, 20X; Cond Jump R/PL RFCT: DEF 19X. WD, 12X, 1X, H#8, 6X, B#1, IVB#l, 8X, 11X, I/ \ D. 20X; Repeat Loop, CNTR <> 0 RPCT: DEF 19X, WD, 12X, 12X, H#9, 6X B#l, VBI, B#l, 8X, 11X, / W'D. 20X; Repeat PL, CNTR< > 0 CRTN: DEF 19X, WD, 12X, 12X, H#A, 1X, 1VB#0, 4VqH#0, B#0, IVB#l, 8X, 11X,

117 / W\'I), 2.0X C(ond RTN TTSN: DIF 32X, 12X, H#A, 6X, B#1, IVB#l, 8X, 32X; Unconditional RTN ('J1'}'; nDIF 19X, WD), 12X, 12X, H#B, 6X, B#1, lVB#I, 8X, 11X, / \VI), 20X; Cond Jump PI & Pop LI)CT: DILEF 19X, WD, 1212X12X, H#C, 6X, B#1, IVB#I, 8X, 11X, WD 20OX; Load CNTR & Cont OOP: DEF 19X, WD, 12X, 12X, H#D, 6X, B#1, 1VB#I, 8X, 11X, / WT\)D, 20X; Test End Loop CONT: DEF 32X, 12X, H#E, 6X, B#1, IVB#l, 8X, 32X;Continue TWB: DEF 19X. WD, 12X, 12X, H#F, 6X, B#1, IVB#I, 8X, 11X, / \WD, 20X Three-way Branch; Register Load ( AM2910 ) RLD: EQU B#O;Register Load Datapath Definition DATAPATH: DEF 19X, WD, 12X, 24X, 4VH#0, 4VH#0, 4VH#C, 4VH#0, / 3VQ#O, 1VB#1. 1VB#O, 19X Register Definition RO: EQU H#O RI: EQU H#1 R2: EQtU H#2 R3: EQUt H#3 R: EQt' H#5 R6: EQU: H#6.7-: EQCU t#7 PS: LoUW II#8 H{9: ti(Q[' tt#9 RIO-t: L(2t' I#A Rl:' ELQ[ It#B Rl2: EQU H#C R13: EQU lI#D R14: EQU lI#E R15: EQU H#F RX': EQU H#1 RNODE: EQU' i#2 P PNE\N: EQU H#3 RL(O)W: EQU' H#4 liPBACIK': EQU 1I#5 RLOW\BACK: EQU H#6 RCIlILD: EQUV H#7 RW.: EQU H1#8 RNODEBEG: EQU H#9 RFREENOD: EQU H#A RSTIKBEG: EQU' H#B RTOPSTK: EQU H#C RNE\\VPTR: EQU H#D RNODETYP: EQUt H#E RSAVE: EQLU H#F Equates for ALU Destination Control ADR: EQU H#O;Arithmetic Shift Down, Results into RAM LDR: EQU H#1; Logical Shift Down, Results into RAM ADRQ: EQU H#2;Arithmetic Shift Down, Results into RAM and Q LDRQ: EQU H#3; Logical Shift Down, Results into RAM and Q RPT: EQU H#4;Results into RAM, Generate Parity

118 LDQP: EQU H#5: Logical Shift Down Q, Generate Parity QPT: EQL: 1H#6; Results into Q, Generate Parity RQPT: EQ(- H#7;Results into RAM and Q, Generate Parity AUR: EQU' H#8; Arithmetic Shift Up, Results into RAM I1TR: EQI' I-#9; Logical Shift Up, Results into RAM AU[RQ: EQ(2 II#A;Arithmetic Shift Up, Results into RAM and Q,LU [R-2: EQU II#B;Logical Shift Up, Results into RAM and Q'YBt'S: EQU t #C; Results to Y Bus only LUQ: EQU tI#D;Logical Shift Up Q SINEX: EQU H#E; Sign Extend REG: EQU H#F;Results to RAM, Sign Extend Equates for ALLU Functions HIGIH: EQU H#O Fi = 1 SUBR: EQU H#1; Subtract R from S SUBS: EQl H#; Subtract S from R ADD: EQU H#3; Add R and S PASS: EQU H#4; Pass S CONIPLS: EQU H#5; 2's Complement of S PASSR: EQU H#6; Pass R COMIPLP: EQU H#7; 2's Complement of R IOW: EQU t-H#8;Fi = 0 NOTRS: EQU 1-1#9; Complement R AND with S EXNOR: EQU H#A; Exclusive NOR R with S EXOR: Et' I-H#B;Exclusive OR R with S AND: E(TQU I#C;AND R with S NOR: EQUt H#D';NOR R with S NAND: EQ' II#E;NAND R with S OR: EQUl II#F OR R with S ALl.' Operand Sources NOTSQ: EQU: Q#5 S = not Q SQ: EQU Q#7;S = Q sAB: EQU' B#00 R = RAM A, S =RAM B XADB: EQt' B#01; R RAM A, S =DIN XDAB: E(U: 13#10;R = P, reg, S = RAM B XDADB3: EQU 13#11 R = PL reg, S = DIN; A2903 Control Signals WE: EQU B#0; Write Enable DISY: EQU B#1; Disable Y Immediate Data IMMD: DEF 19X, WD, 12X, 43X, UVD, 4X, 16VH#; Immediate Data Load buffers LMARBt'F: DEF 32X, B#1, 63X LDINBUF: DEF 33X, B#1, 62X LDOUtTBU'F: DEF 34X, B#1, 61X LSTATBLF: DEF 35X, B#1, 60X; Various Instructions IMMDLOAD: DEF 19X, WDT), 6X, XDADB, 4X, 24X, 4X, 4VH#0, YBUS, PASSR, SQ, / WE, B#0. 3X, 16VH# CALADREG: DEF 19X, WD, 6X, XAB, 4X, 24X, 4X, 4V1H#0, YBUS, PASS, NOTSQ,

119 / \WD, B#O, 3X, 16X CALADIIMD: DEF 19X, WED, 6X, XDAB, 4X, 24X, 4X, 4VH#0, YBUS, ADD, NOTSQ, / WND, B#0, 3X, 16XVH#000 LDNTESTI: DEF 19X, WD, 6X, XDADB, 4X, 8X, B#1, 15X, 4X, 4X, YBUS, SUBR, NOTSQ, / WD., B#O, B#I, 2XA, 16VH# LDNSUBR: DEF 19X. WD, 6X, XADB, 4X, 8X, B#1, 15X, 4VH#0, 4X, YBUS, SUBR, NOTSQ, / \VD, B#O, B#1, 2X, 16X REGLOAD: DEF 19X, WD., 6X, XADB, 4X, 24X, 4VH#0, 4VH#O, YBUS, PASSR, SQ, / NWE, B#O, 3X, 16X ADDNL(C)AD: DEF 19X. W D, 6X, X)DAB, 4X, 24X, 4X, 4VH#0, YBUS, ADD, NOTSQ, / WE. B#O, 3X, 16VH#O000 PASSIMD: DEF 19X, WD, 6X, XDADB, 4X, 24X, 4X, 4X, YBUS, PASSR, SQ, / s WD, B#O, 3X, 16VH# PASSREG: DEF 19X, WD, 6X, XADB, 4X, 24X, 4VH#0, 4X, YBUS, PASSR, SQ, / W\D, B#O, 3X, 16X PASSDIN: DEF 19X. WD, 6X, XADB, 4X, 8X, B#1, 15X, 4X, 4X, YBUS, PASS, NOTSQ, / W)D. B#O, 3X, 16X I)INIOAD: DEF 19X, WD, 6X, XADB, 4X, 8X, B#1, 15X, 4X, 4VH#0, IYBUS, PASS, NOTSQ, / \E. B#0, 3X, 16X RFT(.TESTI: DEF 19X, WD, 6X, XDAB, 4X, 24X, 4X, 4VH#0, YBUS, SUBR, NOTSQ, / WE, B#OD B#l, 02X, 16VH#0000 Constants NOT: EQU f3# 1 IF ZERO: EQL Ht#1 1'Nt: EQ }t H#2 ONE16: EQL 1611#0001 NI'G(()NE16: EQl 16H#FFFF ZER()16. EQ[' 16}1#0000 IIIGII16: EQ' 161i#FFFF LOW16: EQtU 16t1#0000 NIL: EQ[' 1G61#F'FFF NODETY'PE: EQE 16tH#0000 ELEMIENT: EQU' 16H#0001 LEAF. EQt: 16H#0OOF INTERIOR: EQt' 16H#OOFO FSTCHIID: EQU' 1611#0001 SNDCHIID: EQEU 16H#0002 TRDCIHILD: EQU 16H#0003 LO\WOFSND: EQl 16H#0004 I, ONOFTRD: EQUt 16H#0005 QU IT: EQU 12H#3FF; highest legal cs address END

120 The following is a partial ANMDASM assembly listing of 2-3 tree insertion algorithm. AIDASDM assembly source code written by hand for each line of Pascal program. For each line of AMDASM source code, the bit pattern is generated where X indicates a "don't care". 989 000EF;procedure INSERT( x: elementtype; var S: nodeptr ); 990 000EF 991 000EF;var pback: nodeptr; (* pointer to new node returned by insertl *) 992 000EF; lowback: integer (* low value in subtree of pback *) 993 000EF; saveS: nodeptr; (* place to store a temporary copy of the pointer S *) 994 OOOEF; 995 000EF;begin 996 OOOEF; (* checks for S being empty or a single node should occur here, 997 000EF; and an appropriate insertion procedure should be included *) 998 000EF; (* for our purposes, we assume that S has a'legal' 2-3 tree *) 999 000EF; 1000 000EF; insertl( S, x, pback, lowback ); 1001 OOOEF 1002 000EF INSERT: JSB, INSERT1 XXXXXXXX XXX XXXX XXX 1XXXX XXXXXXXX XXXXXXXX XXXXOO01 XXXXXX1 1 XX XXXX XXXXXXXX XXX 1XXXX XXXXO000 00000001 1003 OOOFO; 1004 OOOFO Link return argument 1005 OOOFO; 1006 OOOFO REGLOAD RPNEW, RPBACK & 1007 OOOFO / CONT XXXXXXXX XXXXXXXX X)CXX1XXXX XX01XXX XXXXXXXX XXXX1110 XXXXX'X 11 00110101 11000110 11 00X OOX XXXXXX XXXXODOXX 1008 000FI; 1009 000FI REGLOAD RLOW, RLOWBACK & 1010 00OF1 / CONT XXXSXXXX X XXXqXUXXX 1CXX XXXX XX01XX.XX XCXXXXX XXXX 1110 XXXXXXII 01000110 11000110 11100 XXX XXXIXxXXX XiXXXXXXX 1011 000F'; 1012 000F2 if pback <> nil 1013 00F2; 1014 000F2 REGTESTI RPBACK, NIL & LSTATBUF &:1015 000F2 / CONT XXXXXXXX XXXXXXXX XXX 1XXX XX1OXXXX XXXXXXlX XXXX1110 XXXXXXll XXXi10101 11000001 1011O1XX 11111111 11111111 1016 000F3; 1017 000F3 CRTN, IFZERO XXXXXXXX XXXXXXXX XX 1XXXX XXXXXXXXXXXXX XXXX1010 X0000101 1XXXXXXX XXXXXXXX XX1XXX XXXXXXXX XXXXXXXX 1018 000F4; 1019 000F4; then 1020 000F4; begin 101 000F4; (* create new root; its children are now pointed to by S 1022 000F4; and pback *) 1023 000)F4 1024 000F4; saveS:= S; 1025 000o4F 1026 000F4 REGLOAD RNODE, RSAVE & 1027 000F4 / CONT XXXXXXX X_ XXXXX XXXOF 1XXXX XXO1XX X'XXXXXX XXXX 1110 XXXXXX11 00101111 11000110 11100XXX XXXXX XXXXXXX 102'8 000F5 1029 000F5; new( S, interior );

121 1030 00F5; 1LINE ADDR META ASSEMBLER ASSEMBLY PROG PAGE 29 1031 OOOF.5 REGLOAD RFREENOD, RNEWPTR & 1032 000F5 / CONT XXXXXXXX XXXXXXXX XXX1XXXX XXO 1XXXX XXXXCXXXX X XXIX 10 XXXXXXI1 10101101 11000110 I11100XIXI XXXXXXXX XXXXXIXXX 1033 000F6; 1034 000F6 ADDNLOAD RFREENOD, H#0006 &; increment node storage offset 1035 O00F6 / CONT XXXXXXXX XXXXXXX XXX XIXXX XX1OXXX XXXXXXX XXXX1 110 XXXXXX11 XXXX 11010 1100011 10100XXX 00000000 00000110 1036 000F7; 1037 00F7 REGLOAD RNEWPTR, RNODE &; RNODE <- pointer from'new' 1038 000F7 / CONT XXXXXXXX XXXXXXX XXX1XXXX XXO1XX X XXX XXXXXXXX XXX110 XXXXXX11 11010010 11000110 11100XXX XXXXXXXX XXXXXXXX 1039 000F8; 1040 000F8; S'.kind:= interior; 1041 000F8; 1042 000F8 PASSIMD INTERIOR & LDOUTBUF &; DOUT <- interior 1043 000F8 / CONT XXXXXXXX XX ~xXXaXX XXX 1XXXX XX 1XXXl XX1XXXXX XXXX 1110 XXXXXXX11 XXXXXXXX 11000110 1111OXIXX 00000000 11110000 1044 000F9; 104 5 000F9 LDOUT & 1046 00F9 / CONT XXXXXXXA xxxxxX( X XXIXXXX XXXXXXXX XXXXXXXX X1XI1110 XXXXXX1 I XXXX XX XXX X XXI XXXX XXXXXXXX XXXXXXXX 1047 OOOFA; 1048 00OFA CALADREG RNODE & LMARBUF &;calculate address 1049 00OFA / CON'T XXX XXX XXX 1XXXX XXOOXXXX XXXXXXX XXX1 ) 1110 XXXXXX1 XXX0010 11000100 10110XXX XXXXXXX XXXXXX 1050 OOOFB; 10.51 00OFB LNMAR & WRITE &;write to &RNODE 1052 00OFB / CONT XXXXXXXX XXXXXXXX XXIX 1XXX XXXXXXXX X1IXXXX IX XXXl l 110 XXXXXX 1 XXXXXXXX X11XXXXXX XXX 1XXXX OXXXXXXXX XXXXXXXX 1053 OOOFC; 1054 OOOFC; S'.firstchild:; saveS; 1055 OOO)FC 1056 OOOFC PASSREG RSAVE & LDOUTBUF &;DOUT <- saveS 1057 OOOFC / CONT -XX~XXXX XXXXXXXXX1XXXXX XXOIXXXX XX XXXXX X3AXX1110 XXXX~XXI 11 11IIXXXX 11000110 1111XXX XXXXXXXX XXXXXXX 1058 00OFD; 1059 OOOFD LDOU"T & 1060 00OFD / CONT XXXXXXXX XXXXXXX XX XXXIXXXX XXXXX XXXXXXXX XlXllO1110 XXXXX1 XXXII XXXXXXX XX X I XXX XXXIXXXX XiX XXXXXX 1061 00OFE; 1062 OOOFE CALADIMD RNODE, FSTCHILD & LMARBUF &; write to &RNODE 1063 OOOFE / CONT XXxXXXXXX XXXXXXXX XXI1XXXX XXlOXXXX 1XXXXXXX XXXX 1110 XXXXXX11 XXXXX0010 110000011 10110OXXX 00000000 00000001 1064 00OFF; 1LINE ADDR META ASSEMBLER ASSENM'BLY PROG PAGE 30 1065 000FF LMAR & WRITE &;write to &RNODE 1066 000FF / CONT X X XXXXXXX XXXXXXXX X X XX1XXX I X X XXXXXX1X XXX 1110 XiXXXXiX11 XXXXXXXX XXXXXXX X X XXIX XXXXXXXiX X~XXXIXX

122 1067 00100; 1068 00100; S^.secondchild:= pback; 1069 00100; 1070 00100 PASSREG RPBACK & LDOUTBUF &;DOUT <- RPBACK 1071 00100 / COIN1T 001XXXXXKXOXX XXXXXXX XXXXXXX XXI 1XXXXX 1110 XXXXXX1l 0101oixxX 11000110 1111OXXX XXXXXXXX XXXXXXXX 1072 00101; 1073 00101 LDOUT & 1074 00101 / CONT XX~AXXXXXX XXXXXXXX XXXI XXXX X XXXXXXXXXX X1XX1110 XXXXX 11 XXXXXCXX XXXXXXXX XXX XXXXXX XXXXXX XXXX 1075 00102; 1076 00102 CALADiMD RNODE, SNDCHILD & LMARBUF &; calculate address 10,7 0010 / CONT 1077 000X2XXXXX XXX IXXX A1XX OXX IXXXXXXX XXXX 110 XXXXXl11 XXXX0010 11000011 1011OXXX 00000000 00000010 1078 00103; 1079 00103 LMAR & WRITE &;write to &RNODE 1080 00103 / CONT 100 013i X) XX XXX i x XXXXXXXX XXXXXXDX1X XX1X1110 XXXXXX11 XXXXXXXX XXXXXXXX XXX 1XXXX XXXXXXXX XXXXXXXX 1081 00104; 1082 00104: S^.thirdchild:- nil; 1083 00104; 1084 00104 PASSIMD NIL & LDOUTBUF &;DOUT <- NIL 1085 00104 / CONT XX XXXXXXX XXX1 XXX XXI XXX X x XX1XXXXX XX1X110 XVXXXX XXXXXXXX 11000110 1OXXX 11111111 11111111 1086 00105 1087 00105 LDOUT & 1088 00105 / CONT XX18- XXX8XC XXX XXXXX XXXXXXXX XXXXXXXX X lX1X 110 XXXXXXI 1 XXXXXXXX XXXXXXX XXXXXX 1XXXXX XXX XX XAXXXXX 1089 00106 1090 00106 CALADIMD RNODE, TRDCHILD & LMARBUF &; calculate address 1091 00106 / CONT XXXXXXX XXXXXXXX X CXI XXX XXI OXXXX IXXlXXXXXXX XX 1110 XXXXXX11 XXXXOO10 11000011 10110XXX 00000000 00000011 1092 00107; 1093 00107 LMAR & WRITE &;write to &RNODE 1094 00107 / CONT XXXXX XXXAXXXX AXXX XXXX XXX X XXXX1XXX XX XXX1X lIx XX li 10 XXXXX 11 XXXXXXXX XXXXXXXX XXX XXXX XXXXXXXX XXXXXXXX 1095 00108; 1096 00108; S'.lowofsecond:= lowback; 1097 00108 ~ 1098 00108 PASSREG RLONWBACK & LDOUTBUF &; DOUT <- RLOWBACK 1099 00108 / CONT XXXXXXXX XXXXXXXX XX XX X1XXXX XXO1XXXX XXlXXXXX XXXX1110 XXAXXXXX11 011OXXXX 11000110 1111OXXX XX'XXXXXX XXXXXXXX 1LINE ADDR META ASSEMBLER ASSEMBLY PROG PAGE 31 1100 00109; 1101 00109 LDOUT & 102 001o9 / CONT X1Xxx0x x 00'1 O N xTx XX 1XXX XXXXXXXX XXXAXX Vxx X1XX 1110 XXXX X I 1 XXXXXXXX XXXXXXXXXXX1X)AXX XXXXXXXX XXXXXXXX 1103 0010A; 1104 0010A CALADIMD RNODE, LOWOFSND & LMARBUF &; calculate address 1105 0010A / CONT XXXXXX:X XXXX-XXXX XXX1XXXX XXlOXXXX 1XXXXXXX XXXXlllO XXX'XXXl XXXXOO1 11000011 1010XXX 00000000 00000100

123 1106 OO1OB 1107 0010OB LMAR & WRITE &;write to &RNODE 1108 0010B / RTN XXXX X- X XXXXXX XX XXX1XXXX XXXXXXXXX XXXXXX1X XXlX1010 XXXXXX I X XXXXXXXX XXXXXXXX XXX1XXXX XXXXXXXX XXXXXX 1109 0010C; 1110 001OC; end; 1111 0010C; 1112 0010C;end; (* INSERT *) 1113 00lC; 1114 0OIOC 1114 0010C; 1116 0010C;OC 1117 0010C; 1118 0010C; 1119 0010OC;begin 1120 0010C 1121 0010C; Initialize'start node' address and'start stack' address 1122 OO0010C 1123 0010C START: IMMDLOAD RNODEBEG, H#0000 & 1124 OOIOC / CONT XXXXXXXX )XXXXXXX XXX XXXX Xl l XXX XXXXXXXqCX XXXXl 10 XXXXXX11 XXX 1001 11000110 11100XXX 00000000 00000000 1125 0010 D 1126 0010D IMNMDLOAD RSTKBEG, H#OOFO & 1127 00OO1OD / CONT XXXXXXXX XXX X XX X XX XX11 XI X X XXXXXXXX XXXX 1110 XXXXXX 1 XXXX1011 11000110 11100XXX 00000000 11110000 1128 0010E; 1129 00 OE IMNIDLOAD RTOPSTK, H#OOFO & 1130 0IOE / CONT XXXXXXXX XXXXXiXX XXXX XXXX XX 1 lXXXX XXXXXXXX )XXX1 110 XXXXXX1 1 XXXX1 100 11000110 11100XXX 00000000 11 110000 1131 OOIOF; 1132 OO10F; reset( data ); 1133 OO10OF 1134 OO10OF; new( root, leaf ); 1135 O10 OF 1136 OO10F REGLOAD RFREENOD, RNEPATTR & 1137 001OF / CONT XXSXXXXX rVXXXXXXXX X XXXX XXXXXXXX X'XXXll10 XXXXXX) II; 1 10101101 11000110 11100XXX XXXXXXXX XXXXXXXX 1138 00110; 1139 00110 ADDNLOAD RFREENOD, H#0002 &; increment node storage offset ILINE ADDR MlETA ASSEMBLER ASSEMBLY PROG PAGE 32 1140 00110 / CONT X XXXXXX XXAXXXX XXXIXXXX XX1OXXXX XXXXXXXX XXX 1110 XXXXXX11 XXXX1010 11000011 10100XXX 00000000 00000010 1141 00111; 1142 00111 REGLOAD RNEWPTR, RNODE &; RPNEW <- pointer from'new' 1143 00111 / CONT XXXXXXXX XX XXX XXX1XXXX XXXO1XXXX X01 XXXXXXXX XXXX1110 XXXXXXll 11010010 11000110 11100XXX XXXXXXXX XXXXXXXX 1144 00112; 1145 00 112; root'.kind:= leaf; 1146 00112 1147 0011'2 PASSINMD LEAF & LDOUTBTF &; DOUT <- leaf 1148 00112 / CONT XXXXXXXX XXXXXXXX XXX 1XXXX XX 11XXXX XX1XX)XXX 1XXX 1110 XXXXXX11 XXXXXXX 11000110 111 lOXXX 00000000 00001111 1149 00113; 1150 00113 LDOUT &

124 1151 00113 / CONT X-X'XXXX XXXXXOXX XXX1XXXX XXXXXXXX XXXXXXx X1XX1110 XXXXXX1I 1XXXXXXX X XXXXX XXX1XXXX XXXXCXXX XXXXXD XX 1152 00114; 1153 00114 CALADREG RNODE & LMARBUF &; calculate address 1154 00114 / CONT X X XXXXXX XXX1XXXX XXOO)00OC IXXCXXX X 1110 XXXO11 XXXXOO10 11000100 1011OXXX XXXXXXXX XXXXXXXX 1155 00115; 1156 00115 LMAR & WRITE &; write to &RNODE 1157 00115 / CONT X'XXXXXX XXX'XXX XXXXIXXXX XXXXXXXX XXXXXX1X XX1Xll1O )CXXXXXX1 XXXXXX XXXXXXXX XXX 1XXX XXXXXXXX XXXXXXXX 1158 00116; 1159 00116; root^.element.key:-= 5; 1160 00116; 1161 00116 PASSIMI) H#0005 & LDOUTBUF &;DOUT <- 5 116 00116 / CONT 1162 0016 NTx)OXX XXX1XXXX XX 11XXXX XXIXXXXX XXXX1 110 XXXXllX XXXXXCXXX 11000110 1111OXXX 00000000 00000101 1163 00117; 1164 00117 LDOUT & 1165 00117 / CONT X160XXX01 CX XXXXXXXXX IXO1XXXX XXXXX XXXXXXXX X XX1 10 XXXXXX 11 XXXXXXX XXXXXX XXX 1XXXX X XXXXXXXX 1166 00118; 1167 00118 CALADIMD RNODE, ELEMENT & LMARBUF & 1168 00118 / CONT XXXXXX.'XX XEVXX-XXXX XXX1XXXX XXIOXXXX 1XCXXXXX XXXX1110 XXXXXXi1 XXXXOO1O 11 000011 10110XXX 00000000 00000001 1169 00119: 1170 00119 LMlAR & WVRITE &;write to element of &RNODE 1171 00119 / CONT XXXXX-XX X-XX-XA-XXX XlXX 1XXXX XXXDXXXa XXXXX 1x XX lX 1110 XXXXXX1 1 XXXXXX1 XXXXXXX XCX 1XX XX XXXXXXXX XXXXXXXX 1172 001l1A; 1173 0011A; Printtree( root ); [LINE ADDR META ASSEMBLER ASSEMBLY PROG PAGE 33 1174 0011A; 1175 0011A;(* 1176 0011A read( data, numofdata ); 1177 0011A; for i:= to numofdata do 1 178 001 1A begin 1179 0011A; readIn( data, newp.key ); 1180 0011A;*) 1181 0011A INSERT( 7, root ); 1182 0011A; 1183 0011A IIMMI)LOAD RX, H#0007 & 1184 0011. / CONT XXXXX-X'XX XXXXXXXX XXX 1XXXX XXI 1XXXX XXXXXXX X XCXXX 1110 XXXXXX 11 XXXX0001 11000110 11100XXX 00000000 00000111 1185 001 1B 1186 001lB JSB, INSERT XXXXXX.'XX XXXXXXXX XXX1XXX XXX XXXX XX XXXXXXXX XXXX0001 XXXXXX1i XXXXX XXXXXXXX XXX 1X_~XX XXX XO0O 11101111 1187 0011C; 1188 0011C INSERT( 3, root); 1189 0011C 1190 0011C ININfDLOAD RX,H#0003 & 1191 0011C / CONT SSXXXXXXXX XXXX XXX 1XXXX XX11 XXX XXXXXXX XXi1110 XXXXX;X11 XXXX0001 11000110 11100XXX 00000000 00000011

125 1192 0011D; 1193 0011D JSB, INSERT XXXXXXXX XXXXXXX XXX IXXXX XXXXXXXX X-XXXX XX XXXXO001 XXXXXX1 1 XXXXXXXX XXXXXXXX XX 1XXXX XXXX0000 11101111 1194 0011E; 1195 0011E; INSERT( 2, root); 1196 0011E 1197 0011E IMBDLOAD RX,H#0002 & 1198 0011E / CONT XXXXXXXX XOXXXXXXX XXX 1XXX XX 1XXXX XXXXXX0XX XXXX1 110 XXDXX )11 XXXX0001 111000110 11100XXX 00000000 00000010 1199 0011F; 1200 0011F JSB, INSERT XXXXXXXX XXXXXXX XXX 1XXXX XXXXXXXX XXXXXXXX XXXX001 XXXXXX11 X XXX X XXXXXXXX XXX XIXXX XXX0000 11101111 1201 00120; 1202 00120; INSERT( 4, root ); 1203 00120 1204 00120 IMMDLOAD RX, H#0004 & 1205 00120 / CONT 2 XX0 XX004QNXXXXXX XXX 1XXXX XX1 1XIX XXXXXiXX XXXX1110 XXXXXXI XXXXOOO1 1100000110 11100XXX 00000000 00000100 1206 00121; 1207 00121 JSB, INSERT XXXXX XXXXXX XXXXX XXXX1XXXX XXXXXXXX XXXXXXX XXXX0001 XXXXXX11 XXXXXX X XXXXXXXXXXX X XX 000 11101111 1208 001'22 1209 00122;(* 1210 00122; writeln; writeln('INSERT', newp ); 1211 0012'2; Printtree( root ); iLINE ADDR META ASSEMBLER ASSEMBLY PROG PAGE 34 1212 00122'; end; 1213 00122;*) 1214 00122; 1215 00122;end. 1216 00122; 1217 00122 JNIP,QUIT & WDDEF; Stop the program XXXXXXXX XXXXXXXX XXX 1XXXX XXXXXXXX XXXXXXX XXXX 0011 XXXXXX 11 XXXXXXX XXXXXXXX XXXIXXXX XXXXOOX XXX XX XXXX0011 11111 111 1'218 00123; by jumping to the highest address........ 1'219 00123 122'0 00123

BIBLIOGRAPHY 126

127 BIBLIOGRAPHY [AGE76] T. Agerwala, "Microprogram Optimization: A Survey," IEEE Transactions on Computers, Vol. C-25, No. 10, October 1976, pp. 962-973. A. V. Aho, and J. D. Ullman, "Principles of Compiler Disign," Reading, MLA: Addison-Wesley, 1974. [BAN7, 1 U. Banerjee, D. J. Kuck, and R. A. Towle, "Time and Parallel Processor Bounds for Fortran-Like Loops," IEEE Transactions on Computers, Vol. C28, No. 9, September 1979, pp. 660-670. [DAS 76] S. Dasgupta, and J. Tartar, "The Identification of Maximal Parallelism in Straight Line Microprograms," IEEE Transactions on Computers, Vol. C-25, No. 10. October, 1976, pp. 986-991. [DAS79] S. Dasgupta, "The Organization of Microprogram Stores," AC.M Computing Surreys, Vol. 11, March 1979, pp. 39-65. [DASSO] S. Dasgupta, "Some Aspects of High Level Microprogramming," ACAM Computing Surveys, Vol. 12, No. 3, September 1980, pp. 295-323. [DWAX78] S. Davidson, and B. D. Shriver, "An Overview of Firmware Engineering," C'omputer, \Vol. 11, No. 5, May 1978, pp. 21-33. [DAV8l] S. Davidson, D. Landskov, B. D. Shriver, and P. W. Mallett. "Some Experiments in Local Microcode Compaction for Horizontal Machine," IEEEL Transactions on Computers, Vol. C-30, No. 7, July 1981, 460-477. [DEW\ 76] D. J. DeWitt, "A Machine Independent Approach to the Production of Optimal Horizontal microcode," Ph. D. Dessertation, The University of Mlichigan, August 1976.

128 [FIS79] J. A. Fisher, "The Optimization of Horizontal Microcode within and beyond Basic Blocks: Application of Processor Scheduling with Resources," Ph. D. Dissertation, New York University, New York City, October 1979. [FIS81 a] J. A. Fisher, "Trace Scheduling: A Technique for Global Microcode Compaction," IEEE Transactions on Computers, Vol. C-30, No. 7, July 1981, 478490. [FIS81 b] J. A. Fisher, D. Landskov, and B. D. Shriver, "Microcode Compaction: Looking backward and Looking forward," Proceedings National Computer Conference, 1981, pp. 95-102. [FIS82] J. A. Fisher, "Very Long Instruction Word Architectures and the ELI-512," Research Report #253, Yale University, Department of Computer Science, December 1982. [IS083] S. Isoda, Y. Kobayashi, and T. Ishida, "Global Compaction of Horizontal Microprograms Based on the Generalized Data Dependency Graph," IEEE Transactions on Computers, Vol. C-32, No. 10, October 1983, pp. 922-933. [LAI 183] J. Lah and D. E. Atkins, "Tree Compaction of Microprograms," Proceedings the 16th Annual AMicroprogramming Workshop, October 1983, pp. 11-22. [LAN80] D. Landskov, S. Davidson, B. Shriver, and P. W. Mallett, "Local Microcode Compaction Techniques," ACM Computing Surveys, Vol. 12, No. 3, September 1980, pp. 261-294. [LIN83] J. L. Linn, "SRDAG Compaction - A Generalization of Trace Scheduling to Increase the Use of Global Context Information," Proceedings the 16th Annual Microprogramming Workshop, October 1983, pp. 11-22. [,MtXL78] P. W. Mlallett, "Methods of Compacting Microprograms," Ph. D. Thesis, University of Southwestern Louisiana, December 1978.

129 G. J. NMyers, and D. G. Hocker, "The Use of Software Simulators in the Testing and Debugging of Microprogram Logic," IEEE Transactions on Computers, Vol. C-30, No. 7, July 1981, pp. 519-523. [N1C84] A. Nicolau, Personal communication. [PAR81] A. C. Parker, and W. T. Wilner, "Microprogramming - The Challenge of VLSI," Proceedings National Computer Conference, 1981, pp. 63-68. [PAT76] D. A. Patterson, "STRLAIM: Structured Programming System for Correct Firmware," IEEE Transactions on Computers, Vol. C-25, No. 10, October 1976, pp. 974-985. [RAM174] C. V. Ramamoorthy, and M1. Tsuchiya, "A High-Level Language for Horizontal Microprogramming," IEEE Transactions on Computers, Vol. C-23, No. 8, August 1974, pp. 791-801. [ROB79] E. L. Robertson, "Microcode Bit Optimization is NP-complete," IEEE Transactions on Computers, Vol. C-28. No. 4, April 1979, pp. 316-319. [SXL76] A. B. Salisbury, "Microprogrammable Computer Architecture," American Elsevier Publishing Co., Inc., 1976. [SI N80] NI. Sint, "A Survey of High Level Microprogramming Language," Proceedings 13th Annual AMicroprogramming Workshop, November 1980, pp. 141153. [TOK78] M. Tokoro, T. Takizuka, E. Tamura, and I. Yamaura, "A Technique of Global Optimization of Microprograms," Proceedings 11th Annual NMicroprogramming Workshop, 1978, [T0K81] NI. Tokoro, E. Tamura, and T. Takizuka, "Optimization of Microprograms,

130 IEEE Transactions on Computers, Vol. C-30, No. 7, July 1981, pp. 491-504. [\EG82] S. R. Vegdahl, "Local Code Generation and Compaction in Optimizing Microcode Compilers," Ph. D. Thesis, Carnegie-Mellon University, December 1982. [VEG83] S. R. Vegdahl, "A New Perspective on the Classical Microcode Compaction Problem," ACMI SIGAIICRO Newsletter, Vol. 14, No. 1, March 1983, pp. 11-14. [WNIL5 1] N1. V. Wilkes, "The Best Way to Design an Automatic Calculating Mlachine," In Manchester University Computer Inaugural Conference, Ferrante, London, July 1951. ['W0078] G. Wood, "On the packing of Micro-operations into Micro-instruction Words," Proceedings the 11th Annual Microprogramming Workshop, 1978, [N~009a] VW. G. Wood, "The Computer Aided Design of Microprograms," Ph. D. Thesis, University of Edinburgh, Scotland, 1979. [XN OO79b] W. G. Wood, "Global Optimization of Microprograms Through Modular Control Constructs." Proceedings the 12th Annual Microprogramming ll'orkshop, 1979, pp. 1-6. [YAU 74] S. S. Yau, A. C. Schowe, and M. Tsuchiya, "On Stroage Optimization of Horizontal Microprograms," Proceedings 7th Annual Workshop on Microprogramming, October 1974, pp. 98-106.

UNIVERSITY OF MICHIGAN 3 901 5 03466 47091111111111111 3 9015 03466 4709