THE UNIVERSITY OF MICHIGAN SYSTEMS ENGINEERING LABORATORY Department of Electrical and Computer Engineering College of Engineering SEL Technical Report No. 59 OPTIMAL SELECTION OF FUNCTIONAL HARDWARE UNITS FOR MICROPROGRAM CONTROLLED CENTRAL PROCESSING UNITS by David Leary Hinshaw under the direction of Professor Keki B. Irani April 1972 Under contract with Rome Air Development Center Research and Technology Division Griffiss Air Force Base, New York Contract No. F30602-69-C-0214

ACKNOWLEDGMENTS The research presented in this dissertation has been made possible through the efforts of many people. Professors K. B. Irani, B. W. Arden, B. A. Galler, A. W. Naylor, and F. H. Westervelt as members of my doctoral committee provided both guidance and encouragement. I would particularly like to thank Professor K. B. Irani who acted as chairman of my doctoral committee. Special thanks must also go to Dr. M. P. Ristenbatt who provided needed support and encouragement during my first year at the University. I wish to thank Ms. Kass Barker who cheerfully typed several final versions of the dissertation, Ms. Ann Fulmer who typed Chapter IV, and Ms. Joyce Doneth who did the final corrections. Most importantly I wish to thank Carole. Her encouragement, patience, and understanding made this work possible. This work was supported by the Systems Engineering Laboratory of the University of Michigan through Rome Air Development Center Contract F30602-69-C-0214. David L. Hinshaw Ann Arbor, Michigan April 1972 ii

To C AROLE iii

TABLE OF CONTENTS LIST OF TABLES vi LIST OF FIGURES viii LIST OF SYMBOLS xi ABSTRACT xvii Chapter Page I Introduction 1 1.1 Related Research Areas 4 1.2 CPU Design Method 7 1.3 Two Methods of Computer Design 12 II Program Type Language 16 2.1 Characteristics of PTL 17 2. 2 Data Representation 31 2.3 Program Type Language Statements 40 2.4 Differences Between PTL and Other Languages 75 2. 5 Translation of PTL Programs 83 III A Model of Microprogram Control 100 3.1 The Model 102 3.2 Cost Performance Equations 105 3.3 Base Hardware 128 IV Optimization 135 4.1 General Problem 136 4. 2 General Approach 141 4.3 Optimization Procedure 153 V Hardware Selection Example 186 5.1 Design Method 187 5. 2 Base Hardware and Hardware Options 188 5 3 Cost Performance Curves 209 iv

TABLE OF CONTENTS (Continued) Page IV Conclusions 238 Appendix A Implementation of High Level Language Instructions in PTL and IBM 360 Model 25 Microcode 243 Appendix B Program Type Implementations 268 Appendix C What Is Microprogramming? 279 Appendix D Automatic Generation of B Tables 300 Bibliography 304 v

LIST OF TABLES Table Page 2. 2.1 Internal Data Representation Attributes 38 2. 2.2 Internal Data Organization Attributes 39 3.2.1 Left Shift 1 End Around 134 5.2.1 Base Hardware 197 5. 2.2 Base Hardware Instruction Format 199 5. 2. 3 Microinstruction Examples 203 5.2.4 Base Hardware Programming C onsiderations 204 5. 2. 5 Summary of Hardware Options 205 5.2.6 Hardware Option Costs 207 5. 2. 7 Added Microbits 208 5.3.1 Program Types Used in Examples 211 5.3.2 Time, Storage, and Hardware Option Requirements of Different Implementations for Program Types of Table 5.3.1 212 5.3.3 Frequency of Occurrence and Frequency of Execution of the Program Types of Table 5.3.1 216 5.3.4 Base Hardware Values 217 5.3. 5 Optimum Cost Performance and Required Hardware Options for Standard Values 220 5. 3. 6 Optimum Cost Performance and Required Hardware Options - Control Storage Page Size 64 Words 223 vi

LIST OF TABLES (Continued) Table Page 5. 3. 7 Optimum Cost Performance and Required Hardware Options Control Storage Page Size 32 Words 225 5.3. 8 Optimal Cost Performance and Required Hardware Options Frequency of Execution Program Types 1,2,3 3.0 229 5. 3. 9 Optimal Cost Performance and Required Hardware Options Frequency of Execution Program Types 1,2,3 10.0 230 5.3.10 Optimum Cost Performance and Required Hardware Option Frequency of Execution Program Types 19,20,21 3.0 232 5.3.11 Optimum Cost Performance and Required Hardware Options Frequency of Execution Program Types 19,20,21 10.0 233 5.3.12 Optimum Cost Performance and Required Hardware Options Control Storage Bit Cost $.05 236 5.3.13 Optimum Cost Performance and Required Hardware Options Control Storage Bit Cost $. 20 237 vii

LIST OF FIGURES Figure Page 1.2.1 CPU Design Method 11 2. 5.1 Block Diagram of Process Required to Translate PTL Programs into Microcode 84 2. 5.2 Function Performed by Blocks 2 and 3 of Figure 2. 5.1 87 3.3.1 Minimum Base Hardware 131 4.1.1 Input Data for Optimization Program 140 4.2.1 Fastest Implementation Method 143 4. 2. 2 Exhaustive Search Technique Control Memory Free 147 4.2.3 Exhaustive Search Technique 149 4.2.4 Find Minimal Element of QUE 152 4.3.1 Optimization Procedure 157 4.3.2 Algorithm for Calculating x' from x 160 4.3.3 Fastest Implementation and Alternates for Hardware Point X = 01011 167 4.3.4 Graphical Representation of Cost Performance Tradeoff 170 4.3. 5 First Difference Selection Example 170 4.3.6 Slope Selection Method 173 5.2.1 Base Hardware Configuration 195 5.2.2 Base Hardware Plus Hardware Options 196 viii

LIST OF FIGURES (Continued) Figure Page 5.3.1 Optimum Cost Performance Curve Standard Configuration 219 5.3. 2 Optimum Cost Performance Curve Control Storage Page Size 64 Words 222 5.3.3 Optimum Cost Performance Curve Control Storage Page Size 32 Words 224 5.3.4 Optimum Cost Performance Curves Control Storage Page Size 128, 64, 32 Words 226 5.3. 5 Optimum Cost Performance Curves Varying Frequency of Execution of Program Types 228 1,2,3 5.3. 6 Optimum Cost Performance Curves Varying Frequency of Execution of Program Types 19,20,21 231 5.3. 7 Optimum Cost Performance Curves Control Storage Bit Costs $.05, $. 10,$. 20 235 Al Model 2025 Processor Simplified Block Diagram 249 A2 Word Type 250 A3 Typical IBM 360 2025 Processor Microcode 252 Cl Wilkes' Microprogram Control 280 C2 Section of a CPU 293 C3 Microoperations for Figure C2 294 C4 Some Possible Register to Register Transfers for the CPU of Figure C2 and the Microoperations of Figure C3 295 ix

LIST OF FIGURES (Continued) Figure Page C5 Simplified Block Diagram of Interdata Model 3 296 C6 Microoperations Interdata Model 3 297 C7 PDP-8 Microoperations 299 D1 Program Type Implementation 303

LIST OF SYMBOLS This list contains in roughly chronological order the symbols which are used with some frequency in this dissertation. Symbol Reference Significance P 3.2 List of program type. H 3.2 The set of hardware options. Bj 3.2 The implementation matrix specifies which hardware options are required by the implementation methods for the jth program type. R' 3.2 Some subset of H. i C 3.2 Cost vector for ith hardware option. 3j 33.2 The inclusion vector specifies which options must be present if jth element of C is used for cost of the ith hardware option. X. 3.2 The exclusion vector specifies which options must not be present if jth element of C is used for cost of the ith hardware option. p(Rk, U1,X 3.2 The p function has a value of 1 if Rk satisfies the requirements specified i i by U. and X1. J x xi

Symbol Reference Significance 0 3.2 The occurrence vector gives an estimate of the number of occurrences of each program type in a typical program. F 3.2 The frequency vector gives an estimate of the frequency of execution of each occurrence of each program type in a typical program. Vj 3.2 The execution time vector is the microcycles required to perform the jth program type function for each implementation method. MJ 3.2 Microbit vector for jth hardware option. w 3.2 The number of wordforms used by the base hardware. W3j 3.2 The word form vector specifies to which of the word forms the microbits required by the jth hardware option are to be added. cb 3.2 Cost of the base hardware. c 3.2 Cost of a bit of control store. s mb 3.2 Base microbits. k 3.2 Maximum number of bits per field (field encoding limit). xii

Symbol Reference Significance E 3.2 The subroutine selection vector indicates which program types will be implemented as in-line code and which will be implemented as subroutines. s 3.2 Number of microinstructions to initiate B a branch. SR 3.2 Number of microinstructions required to initiate a return from a subroutine. Si 3.2 The storage vector lists the number of microinstructions required for each implementation of the jth program type. N' 3.2 Selection vector lists the current choice of implementation method for each program type. CH 3.2 Cost of hardware options and base hardware. M(R ) 3.2 Width of microinstruction word. CNS 3.2 Cost of control store for nonsubroutine imple mentations. CSS 3.2 Cost of control store for subroutine mple mentations. TNS 3.2 Execution time nonsubroutine implementations, TS 3.2 Execution time subroutine implementation. xiii

Symbol Reference Significance T(N ) 3.2 Average execution time for an implementation seto PT 3.2 Performance T/T P r4.3 Required performance. CMT 4.3 Cost of hardware options and minimum control storage. X 4.3 Hardware option set. CT 4.3 Cost of hardware options and storage for fastest implementations. C (x) 4.3 Minimum cost of hardware point x. n(x) 4~3 Numerical ordering of x. h(x,x') 4.3 Hardware options of hardware option points x", n(x)< n(x") < n(x'). NP(x) 4.3 Computes x' the next hardware option point after x in numerical which could have a cost less than x 4.3 Increase in execution time if 2th implementation method for jth program type is replaced by the kth implementation method for the jth program type. Asjk 4.3 Reduction in controLstorage if the fth implementation method for the jth program type is replaced by the kth implementation method for the jth program type. xiv

Symbol Reference Significance dT 4.3 Total increase in execution time. dS 4.3 Total reduction in storage. a.. 4.3 The ith implementation of the jth ji program type. Di 4.3 The total number of implementations of the jth program type. a 4.3 Set of all program type implementations. f 44.3 f ca 4.3 3J] = {aji ajiEc } I set 4.3 I set contains one and only one element from each BJ set. I set 4.3 Set of fastest implementations for a hardware point. T 4.3 Maximum acceptable increase in max execution time. S 4.3 Required savings in control storage. req IN 4.3 I set after Nth iteration of slope selection algorithm. S(IN) 4.3 Number of words of control storage required by implementation set IN. T(IN) 4.3 Number of microcycles to execution implementation set IN a number of times defined by F and 0. xv

Symbol Reference Significance AT(IN) 4.3 AT(IN) = T (IN) - T(I1) AS(IN) 4.3 AS(IN) = S(I1 ) - SN) xvi

ABSTRACT Optimal Selection of Functional Components for Microprogrammable Central Processing Units by David Leary Hinshaw Chairman: Keki B. Irani The introduction and the wide acceptance of both microprogram control and medium to large scale integrated circuitry have drastically altered the manner in which central processing units are implemented. The use of microprogram control and the tendency to acquire hardware in functional units allows a microlevel hardware software tradeoff to occur during the design of the central processing unit. The combined effect of microprogram control and the availability of powerful functional units has been to alter the CPU design environment. The overall goal of this research is to develop a design philosophy which is applicable to the new CPU design environment. The work proceeds in four steps: 1) the development of the general design method; 2) the initial development of the Program Type Language (PTL); 3) the development of the CPU model; and 4) the development of an efficient optimization procedure. The Program Type Language is an initial version of a machine independent intermediate language. The language is designed to meet two principle objectives. First, the language allows valid statistics to xvii

be gathered on the functional requirements of the computer user. Each program type is selected to provide a basic function which is a major building block of a highly used computer area. Second, the language is machine independent and it meets the requirements of an UNCOL language. The CPU is modeled as a base hardware set plus a set of hardware options. The base hardware set defines a general architecture and the hardware options allow major design variations about that architecture. The hardware options consist mainly of functional units and the use of the hardware option concept in the model is desirable in view of the rapid acceptance of large and medium scale integration. The cost of a hardware option is dependent upon the choice of the other hardware options as is the number of microbits required to control each hardware option. The specification of microinstruction word size is very flexible, allowing both field encoding and word encoding to be modeled. The design method is user directed. The user writes or translates his programs into PTL. Since, PTL is constructed to allow valid statistics to be gathered on user requirements; PTL provides a mechanism for determining user requirements. The design method makes optimum use of the human design. The connection between the program types and the CPU model is the implementation methods. There is xviii

one or more implementation method for each program type. Each implementation method specifies a CPU configuration. The human designer specifies the implementation method and is concerned with performance - cost (hardware) tradeoffs at the program type level. The optimization program performs the performance - cost tradeoffs for the entire set of program types using the individual implementation methods supplied by the designer. The designer does not have to attempt to evaluate the combined effort of adding an optional piece of hardware on 100 or more program types. He can focus his attention on each program type one by one. The optimization technique involves three distinct parts. First, the use of bounding techniques allows the search procedure to be applied to the hardware option space (typically H space < 10 ) instead of the implementation space (typically I space > 1012). Second, a modification of the Lawler Bell technique is combined with a bounding procedure to eliminate large sections of the hardware option space. Third, a slope selection technique was developed to select the optimal implementation set for any fixed hardware option set. xix

Chapter I INTRODUCTION This study concerns itself with the design of central processing units. It is directed at the CPU design environment which is rapidly coming into existence. This new environment is influenced by four developments. The four developments are the increasing use of microprogram control, the decreasing cost of microcontrol storage, the increasing availability of both medium and large scale integration, and the increasing ratio of software development to hardware development costs. The changing environment will have an effect on CPU design. The effects caused by the developments mentioned above are interrelated but can be broadly classified as follows. The effect of medium and large scale integration is to force the design of computer systems which are composed of powerful functional blocks. The effect of microprogram control and inexpensive control storage is to allow hardware software tradeoffs to be made at the control storage level. The effect of the increased ratio of software development to hardware development will be twofold. First, more powerful instruction sets will be provided. This tends to move software functions into the CPU design area. Second, different approaches will be developed to allow software to be transferred 1

2 from computer to computer. The work presented here is directed toward the development of the tools required to optimally design computers in the new environment. The effort is extended over a broad area (languages, computer models, optimization) and culminates in a computer aided design method for the optimal selection of functional components for microprogrammable central processing units. The work is divided into several interrelated parts. First, a general design method is proposed. The design method requires an accurate method of predicting the operational requirements to be placed upon the microprogrammed computer. The operational requirements are the functions which the computer user wishes to perform. Second, a new language is proposed. The language is designed to allow accurate measurement of the operational requirements of programs written in the new language. Also, the language is designed to be both easily used and efficiently converted into microcode. Third, a cost performance model of a microprogrammed central processing unit is developed. The model views the central processing unit as a base hardware unit plus a set of optional hardware units which are used to implement a set of program types. The program types are the operations of the new language. Each program

3 type may have several implementations with each implementation requiring a different set of hardware options. Fourth, an optimization procedure is developed. The procedure allows the very large state space resulting from combinations of implementation methods and hardware options to be searched efficiently. The procedure selects the minimum cost set of hardware options and implementation methods which achieve a desired performance level. Last, several examples are presented.

4 1. 1 Related Research The area of microprogram control and automated design are both related to the work presented in this report. There has been significant interest in both subjects and the history and advantages of each have been reported in numerous articles. Some excellent references on the subject of microprogram control are Tucker (T2, introduction to microprogramming), (Ti, emulation of IBM 360 series of computers), Flynn (F2, a key paper which stimulated interest in user microprogramming), Wilkes (W3, a historical review by the inventor of microprogramming), Rosin (R3, a good explanation of microprogramming and a strong supporter of user microprogramming), Davis (D3, a recent article with an annotated bibliography - IBM viewpoint), and Husson (H5, a book on the subject with excellent material on existing microprogrammable computers). Throughout the literature the terms vertical and horizontal microprogram control have been used to describe two broad classes of microprogram control. In addition, various authors have introduced "special terms" to describe particular concepts or physical implementations of microprogram control. Appendix C provides an introduction ctto microprogram control, gives a definition of vertical and horizontal microprogram control, and places the "special terms" in perspective.

5 The optimal selection of functional components can be viewed as a hardware (functional components) firmware (microcontrol) tradeoff. It is apparent that hardware firmware tradeoffs are made any time a microprogrammable computer is designed. However, there is no research which has been reported upon in the literature which examines the question of hardware firmware tradeoffs. The work reported upon in the literature has been mainly concerned with the optimization of performance by moving machine language routines into microcode (F2, H1, H2, R1, R3, S7, T3), the specification of the type of microprogram control which would achieve a speed cost tradeoff for a fixed architecture (G1, R2, R4), or the improvement of diagnostics via microprogram control (B1, C1, J1). The general area of design automation is very large and expanding. The most comprehensive survey of the area has been given by Breuer in two papers (B5,; B6) the most recent printed in January, 1972. It is interesting to note that Breuer predicted, as one of a number of predictions, that synthesis programs would have to be developed which o uld work with standard complex functions (i. e., function units). There are no published papers that deal with the problems which are studied in this dissertation. Many of the papers are concerned with automated wire routing, card placement, card layout, etc. The papers

6 which deal with CPU design at a higher level are mainly concerned with the development of computer description languages (B3, D2, D5, G3, S2, S6), and the development of languages to describe the input output behavior of computers (B3, D6, F5, G2, M3, P1)o None of the studies attempts to optimize the computer design in the sense of performing performance cost tradeoffs. The hardware design routines supporting the languages tend to provide a direct translation of the language descriptions.

7 1. 2 CPU Design Method The design method considered here attempts to combine the human designer and computer design aids in the most efficient fashion. The method is composed of four steps. First, a set of operations called program types is determined. The program types are the operations which the computer will be designed to implement. Second, statistics on the usage of each program type are gathered. The statistics on program type usage allow the computer to be designed with the performance calculations weighed according to program type usage. Third, the computer designer provides several different hardware combinations, each with different cost and performance figures, for each program type. Fourth, an optimization program selects and combines the multiple program type implementations provided in step three. The output from the optimization program is the minimum cost computer hardware configuration which achieves some desired average performance. The design philosophy (Figure 1. 2. 1) is implemented in the following fashion. The program types are provided as part of a proposed language. The language, which is called the Program Type Language (PTL), is described in Chapter II.

8 A new language was proposed instead of using an existing language because it was felt that no existing language combined the features (see Chapter I) which we required. The language is designed to allow valid statistics to be taken on operations which a user wishes to perform. It is not expected that all programs will be written in the new language, just as not all programs are currently written in machine languages. However, compilers, interpreters, and other heavily used programs will probably be written in PTL, with the remaining programs compiled into PTL. The statistics which we gathered specify the frequency with which an operation occurs in a typical program and the frequency with which each program occurrence is executed. The statistics would be gathered on programs for a particular computer installation or a particular set of users. The statistics are used by the optimization program to allocate resources (hardware options) to maximize average operation performance. As was stated before, the computer designer is not concerned with the efficient implementation of the Program Type Language as a whole. Instead, the designer is concerned with the implementation

9 of each individual operation of the language. The computer designer works with a hardware base and a set of hardware options. Such a model (the model is developed in Chapter III) seems reasonable in the light of the rapid acceptance of medium and large scale integration. The designer specifies several different implementations for each program type operation. Each different implementation of the same operation uses different sets of hardware options and would have different cost and performance figures. Thus the designer is working at the operation level and does not have to be concerned with tradeoffs effecting the implementation of the entire language. This elimination of one level of complexity. should enable the designer to work more efficiently. The implementation methods, their required hardware options, and performance figures form the second input to the optimization program. The final input into the optimization program is a performance requirement. The performance requirement is given in terms of the average execution time of the operations of the Program Type Language. After accepting all inputs the optimization program picks the minimum cost system which will satisfy the performance requirements. The CPU design method is a part of the overall system design

10 philosophy. The language is machine independent; thus, programs written in the language (in particular, compiler, interpreters, etc.) may shift from machine to machine requiring only a retranslation. The use of the implementation method concept is very useful in the translation of the programs into microcode. The implementation methods are actually microcode routines. Therefore, the operations may be replaced in-line by the implementation microcode. The assignment of variables to storage presents some problems. The estimated execution time of each program type assumes that the operations are located in high speed storage. Thus, any extra data movement from core storage to high speed storage would not be accounted for in the performance computation unless an estimate of the number of such data transfers was known a priori. Thus the optimum assignment of data to high speed registers is important. However, the extension of existing register assignment algorithms should be sufficient (see Section 2. 5). It is hoped that the design philosophy presented in this section will come into better forcus as the language chapter (II), the model chapter (III), the optimization chapter (IV), and the examples chapter (V) are read.

11 Program Statistics Hardware Options on Operation Usage Operation Implementations Cost and Performance Operations Optimal Required --' Selection Required Performance Program Minimum Cost System Hardware Options Implementation Methods CPU Design Method Figure 1.2.1

12 1.3 Two Methods of Computer Design A description of the traditional method of computer design will be presented and compared with the computer design method presented in this dissertation. Because the computer field is changing very rapidly and because there are many different groups of individuals designing computers there are exceptions to what we describe as the traditional approach to computer design. However, most of the computer systems in existence today were designed in a manner similar to what will be called the traditional approach. The two methods arise from different hardware environments. The traditional approach developed when instructions were implemented via hardwired circuitry. The approach we are advocating assumes instructions are implemented via microprogram control. Thus the cost constraints imposed upon the two methods are different. The two computer design methods differ in the choice and implementation of instruction sets. In the traditional approach, the computer is designed to implement a limited set of machine or assembly language instructions. The instruction set usually includes arithmetic, logical, branching, and I/O instructions. The implementation of more complicated operations (for instance, list manipulation or square root) is usually accomplished in software at or above the assembly language level. The hardware implementation (assembly language) of

13 a more complicated instruction occurs only if the instruction has extraordinarily high usage. The most obvious examples of such "special" instructions are floating point instructions. The more complicated instructions were implemented infrequently for two reasons. First, the cost of implementation was prohibitive, Second, the requirements of the user were not well known. Thus, the problem is one of measuring or in some fashion asserting what the user wishes to do. The user's real desires are often masked because he has been forced to modify his methods to conform to a limited instruction set. In the method which we are suggesting, the first step is to try to determine what the user would truly wish to do. This decision is arrived at by studying higher level language s. Since higher level languages are designed specifically to allow a user to solve his problems in a natural fashion, the languages should represent a good measure of what people would like to do. In addition, attention will be paid to studies of compilers, sorting routines, graphic displays, etc. to determine what operations are required to implement these functions. The net result is a set of operations (program types) which are combined into an intermediate language. Each program type is a function which would be useful to a large user class. The set of program

14 types covers the set of operations typically offered by an assembly language and much more. The set of program types which will be proposed in Chapter II is not expected to be the ideal set or the final set of program types. It is expected that the set of program types will evolve and grow through feedback from users. Since program types are implemented via microprogram control, the modification and addition of program types should not be as difficult as it would be if they were implemented via hardware control. The two approaches evolve as follows. In the traditional approach, a machine language is developed and a computer is designed to efficiently implement the machine language. However, the machine language is not powerful in the sense of performing in one instruction most operations common to higher level languages or compilers, or execution systems. Therefore, each higher level language must be translated by a compiler unique to the computer's machine language into machine language. In the method we are suggesting a powerful intermediate language is used. The language is developed with the user in mind and has many operations common to higher level languages. For each computer, one program is developed which translates programs coded in the intermediate language into microcode. Since

15 the computer was designed to efficiently implement the operations of the intermediate language, such programs should run efficiently. Higher level languages are translated into the intermediate language and then into microcode. The match between the higher level language and the intermediate language should be good because of the power of the intermediate language. The total number of high level language compilers should also be reduced (this is the UNCOL (S7, S8, S9) concept and is explored further in Chapter II). With the traditional approach the assembly language is defined, and the higher level languages are defined. However, there is often a mismatch between the higher level languages and the assembly language, and thus a mismatch between the user and the computer. In our approach the intermediate language (program types) is chosen. This language contains those operations which a large set of users wish to do. Then the computer is designed to efficiently implement the intermediate language. Since the intermediate language is well matched to the user there is.a good match between the user and the computer.

Chapter II PROGRAM TYPE LANGUAGE The Program-Type Language (PTL) is presented in this chapter. Actually, what is presented in this chapter is a language concept. PTL is the first solid representation of the concept. It is felt that PTL will change and grow, molding the language concept to the user's needs, through actual usage. The language concept and the resulting restrictions placed upon PTL are discussed in Section 2. 1. In Sections 2.2 and 2.3 the Program Type Language is presented. In Section 2.4 the differences between PTL and other languages are discussed. In particular, the necessity for a new language instead of an existing language is discussed. In Section 2.5 the problems associated with the translaton of PTL program into microcode are discussed. 16

17 2. 1 Characteristics of PTL Programs written in PTL will be utilized for two different purposes. First, the programs, program statistics, and statistics on the expected use of the programs will be used to select optimal hardware configurations for central processing units. Second, the programs will be compiled into microcode. The first use of PTL will allow the design of CPU's (see Section 1. 2) which are optimized for certain groups of programs or are optimized for heavily used operations. The second use of PTL will allow programs to execute with greater efficiency since programs written in microcode have been shown (T3, H4) to run faster than similar problems coded in machine languages. It is expected that compilers, interpreters, and other frequently used programs will be written in the Program Type Language and then will be translated into microcode. Thus, the language is designed for users working below the level of problem or procedure oriented languages. The Program Type Language must be designed to enhance the two major uses of the language. First, PTL must allow realistic statistics to be gathered on those functions a programmer is actually trying to implement. Second, the language should facilitate the translation of PTL programs into efficient microcode. What other features should be embodied in the language? The language should be easily used. The language should emphasize

18 the basic, but often powerful operations required by the user. Thus, the new language (PTL) should embody the following features. (1) allow valid functional statistics to be gathered, (2) be easily translated, (3) be machine independent, (4) be easily used, (5) emphasize basic-operations. We will discuss each requirement in more detail in the following paragraphs. 1. Statistics The statistical information on functional usage (add, and, link list, etc.) within PTL programs is utilized to optimally select CPU hardware components. Therefore, the language must be designed to allow valid functional statistics to be taken on PTL programs. Otherwise, the selection of hardware options will not be optimal. For example, if the contents of a data quantity are to be shifted left one place and shift operations are not available in the language, the shift may be performed by a multiplication by two. However, programs written in such a language would not provide a good measure of those operations a programmer wishes to implement. Any time a

19 programmer implements a shift operation, the language masks his intent since the operation must be implemented in an indirect manner. PTL was developed both by studying other programming languages, and by studying the potential applications of the language. The basic operations, required for arithmetic functions, string manipulations, bit manipulation, linked list manipulation, binary tree manipulation, stack manipulation, iteration control, branch control, and other operation types, are included in the Program Type Language. This implies that the basic operations required to write scientific programs, compilers, interpreters, graphic programs, and many other program types (we believe most other program types) are included within PTL. Therefore, programs written in the Program Type Language should provide good statistics on the basic functions required by the program. There is a limit to the number of program types which will be provided by PTL. The problem is one of limiting the number of program types and still retaining the ability to gather realistic statistics on functional usage. To understand the problem better, consider what effect incorrect functional statistics might have on the selection of hardware components. First, suppose that shift operations are not provided as PTL operations. Therefore, if a programmer requires shift operations they will be written by him in some manner using combinations of adds, multiplies, etc. Thus, the optimization program would not select a hardware configuration which included a

20 hardware shift option, even if the programs required a large number of shifts, because the shifts were done via adders, multipliers, etc. This is an example of not including as a PTL operation a function which could be implemented directly and easily in hardware. Thus, such a hardware option (function) would not be chosen in the optimization process because there is no way to deduce from the programmer's expansion that the function would be utilized. As a second example, consider the function, multiply, and let us suppose that multiply hardware is not available. Therefore, the function will have to be implemented using adders and shifters (assuming now that both operations are available). Thus, if no PTL multiply operation were available the selection of hardware would not be affected since the multiply operations would be implemented by the programmer using shifts and adds and their frequency of occurrence would be measured and utilized in the optimization. Thus, in that sense the optimization would be acceptable. However, two problems occur. First, since the programmer specifies the implementation of multiply by utilization of other PTL operations, any inefficiency introduced by the selection of an inappropriate algorithm can not be removed by the optimal hardware selection procedures and will remain. Second, there may be introduced additional overhead in coupling the individual microcoded shift and add routines. That is,

21 if each operation (shift and add) were implemented as microsubroutines the branching overhead would be increased since several calls to shift and add routines instead of one call to the multiply routine would be made. As pointed out in Section 2. 5, the use of microsubroutines will turn out to be limited, and therefore, the inefficiencies introduced by additional microsubroutine linkages will be minimal. With the preceding examples in mind, the selection criteria for program types will be explained. The program types attempt to span all functions which could be implemented as one microcycle hardware options. Thus such functions as add, shift, logical disjunction, complement, etco, are provided. This selection rule relates to the first example above where the effects upon hardware selection of not providing a shift operation were explored. However, some common sense must be used. One could specify an endless number of functions which perform Boolean operations on bits i, j, and k, and store the result in bit n of some word or words. All such functions could be done in one microcycle, but most would have a low utility and would clutter the language. Thus, a second criterion of utility is coupled with the first criterion. If an operation was expected to have a high usage (i.e. high utility) it was included in the language even if it was felt that

22 the operation could not be provided as a hardware option. If an operation could be provided as a hardware option but was felt to have zero utility or expected use, it was left out of the language. One final point should be made. The user is allowed to add operations to the language. The addition of operations can be accomplished in at least two ways. The language could be made extendable by the addition of a definition facility. The addition of a definition facility should be feasible if the statement formats are limited to those currently available in PTL. The definition facility would allow the definition of a new operation NAME and a list of operands. Then the operation would be defined in terms of existing PTL operations using the list of operands specified in the format definition statement. Such a feature would allow a user to make PTL conform to his particular application. The definition facility would not affect the statistics and/or the hardware optimization since the frequency of occurrence of a newly defined operation could be reflected back to its component PTL operations. It is true, as was discussed in the previous paragraphs, that any inefficiency introduced by the selection of an inappropriate algorithm can not be removed by the optimal hardware selection procedure. The language can also be extended by providing a new implementation facility. The new implementation facility uses both the

23 programmer and the program type implementation designer. With the new implementation facility the programmer specifies a new operation and the implementation designer provides a microcode implementation for the particular machine the programmer has available. The addition of a new implementation facility should be feasible if the statement formats are limited to those currently available in PTL. The method requires that the PTL translator be expandable. The translator must be able to accept a new operation and its implementation microcode without rewriting the translator. The advantage of using a new implementation facility would be the guarantee of an efficient implementation for the new operation. Unfortunately, a PTL program with a new operation could be executed only on computers for which the new program type had been implemented. Thus, the ability to move PTL programs from one computer to another computer would be reduced. The definition facility and the new implementation facility are applicable to different problems. The definition facility is useful to define operations which are efficiently implemented using combinations of existing PTL operations. The operations defined by the definition facility allow a programmer to write his program in a simpler form. The new implementation facility is useful when an operation is required which can not be efficiently implemented (in terms of microcycles to execute) using combinations of existing PTL operations and should be implemented directly in microcode,

24 2. Translation The language should aid, not obstruct, the translation of PTL programs into efficient microcode. PTL does not have any features except named data quantities and condition sets which require special effort to translate; however, both problems are solvable. The translation problem is discussed in Section 2.3. PTL is an intermediate language, and the programmmer is required to formulate his problems in a rather strict fashion. For example, the language does not provide automatic conversion of data types. Thus, the programmer is forced forced into an awareness of any conversions which occur in his program. Automatic conversion of data types is a convenient feature and should be provided in higher level languages. However, PTL is not intended for people who write in higher level languages. Therefore, since data conversions are time consuming, they are not done automatically but must be specified by the programmer in the hope that not as many will be used. The language allows the use of temporary storage locations which do not have to be declared. Therefore, the temporary storage variables are easily used and provide an effective and natural method of informing the translator of data items which are required only temporarily.

25 3. Machine Independence The language should be machine independent for two reasons. First, if the language is machine independent only one new compiler (a PTL compiler) will have to be written for each new machine. Second, since statistics on program utilization and statistics on PTL operation usage are used to select hardware, the language should be free of any machine dependence. The concept of reducing software development costs through the use of a machine independent intermediate language is not new. Probably the best documented arguments for the use of such a language are those presented to justify the UNCOL concept. The Universal Computer Oriented Language was reported upon [S10, S11] in 1958 by the Share Ad-Hoc Committee on Universal Languages. Their concept consists of three levels of languages. The highest level are the POL's (Problem Oriented Languages), the second level is the UNCOL language, and the lowest level is the machine language level (ML's). The idea is to provide only one translator for each higher level language. Each POL translator would be written in UNCOL [or PTL in our casel and produce its output in UNCOL [PTL]. In addition there would be required only one translator for each computer. That translator would compile the UNCOL language [PTL] into machine language [ML].

26 Thus, a compiler for a POL is written in UNCOL [ PTL] and produces code in UNCOL (PTL). The compiler can be run on any machine which has an UNCOL [ PTL] to ML translator. If a new computer is developed, existing software can be used by developing a translator [ UNCOL [ PTL] to the new machine's ML] and recompiling existing software into the new ML. The potential savings in software development costs are significant. For example, if there are N higher level languages (POL's) and M computers with M different machine languages (ML's), then using the machine independent intermediate language (PTL) i.e., the UNCOL concept, only N + M translators must be developed versus NxM translators using the traditional higher level language to machine language concept. There is a cloud of doubt hanging over the previous discussion. If the UNCOL concept is a valid idea, why aren't people using the UNCOL language? The answer is simple. The UNCOL language was never fully developed. There were at least two attempts [ S9, C2] to start an UNCOL language but both were not significant efforts and produced languages considerably below the level of sophistication required for an UNCOL type language. In discussions with T. J. Steel, one of the authors of the Share report on UNCOL and the developer of one of the preliminary versions of UNCOL, the

27 question of the demise of the UNCOL language was discussed. Two major reasons for the failure of JNCOL were given. First, at about the time the UNCOL language was under consideration people became very optimistic about new techniques for compiler generation. Many people felt compilers would be produced very cheaply and very rapidly. Thus, part of the economic motivation for UNCOL was lost as was funding for its development. Second, at the same time there was an interest in higher level languages, in particular, the idea of the development of one higher level language. Thus, if one higher level language was going to be developed there would be no reason to develop one universal intermediate language. Is there a need for a PTL-like language? Jean Sammet in her book on programming languages [ S1] states that the motivation for UNCOL still exists. She calls UNCOL a significant but unimplemented concept. The economic motivation for such a language exists. Software is an ever increasing portion of the cost of computer system development. Higher level languages and problem oriented languages are appearing in ever increasing numbers. The need for a PTL-like language, that is, a machine independent intermediate language, does exist. If the language has machine dependent features, any hardware selection procedure would be biased toward the machine dependent features of the language. For example, if the language contained

28 integer operations but did not contain floating point operations, the hardware selection procedures would never select floating point hardware. In this sense, the language would be machine dependent since it would be efficiently implemented only on machines with integer hardware units. These features are operation features and have been considered in the section on statistics. Thus, we are considering in this section only those features which allow PTL programs to be easily changed from one machine to another. The following features have been incorporated into PTLJfor machine independence. The language uses named data quantities and there are no references to machine dependent features such as registers, local storage, or even main storage. The microtranslator will be responsible for the assignment of named data quantities to machine storage areas for a particular machine. The I/O commands are simple and cause the transmission of either data or I/O signals. This simplicity should allow reasonable I/O device flexibility and still provide independence. There is no standard precision assigned to floating or integer variables. The required maximum precision for each variable or sets of variables is assigned by the programmer, Thus, the language does not assume some standard word length for data.

29 4. Easily Used The language should be easy to use. What is a convenient, easy to use language means different things to different people and depends heavily upon which problems are being solved. Fortran is 2 easy to use if mathematical expressions of the form (x2 y)/2 are being written, Fortran is more difficult to use if expressions of the form x(j) Si. e i=l j are being written, Fortran is very difficult to use if one wishes to interchange the 3rd and 4th bits of an integer format word in the computer memory. PTL is not easily used in the sense of writing FORTRAN type arithmetic tatemet An expression such as (x y)/ would arithmetic statements, An expression such as (x,. y)/y would require several PTL instructions to implement. However, the DO operations of FORTRAN are provided in PTLo In addition, there are operations which manipulate linked list, binary trees, and character strings, Therefore, the language could be said to be useful in the following sense. A large set of basic operations is provided where each subset provides a powerful set of instructions for a particular problem area.

30 5. Basic Operations The language should employ the basic but often powerful operations required by the user. This is essentially a restatement of 1 (statistics), 2 ( clearly translated), and 4 (easily used). What is really being implied is a limit on 1, 2, and 4. We do not want to supply every function possible. This would result in too many different operation types which would complicate both the hardware selection process and the optimization process. Thus, the suggested set of operations tries to cover each area with a trade off between user convenience ad a limited number of operations. The string manipulation operations of PTL serve to illustrate the point we have been trying to make. The string manipulation instructions have evolved from study of the SNOBOL language. The instructions are the basic instructions required for string manipulation, They allow a search for a pattern match, deletion, insertion, replacement, and concatenation of strings. The instruction set is not as full as SNOBOL and hence not as easily used. Such SNOBOL operations as alternation, and conditional value functions are not present in PTL. However, the functions performed by alternation and conditional value instructions can be accomplished by using two or three PTL instructions. Thus, the PTL language is useful for string manipulation since the string manipulation operations

31 performed in SNOBOL can be performed using PTL with a moderate increase in effort. The stack operation, bit operation, iteration control, and the other operations of PTL were generated in a similar fashion. 2o 2 Data Representation The Program Type Language (PTL) is a machine independent language. There are no references to registers, memory locations, etc. Data in the PTL language is represented as named quantities. Named dataquantities may have as a name any combination of 10 or less alphanumeric characters. The first character of a name must be an alphabetic character. Each named quantity (data variable) has one or more attributes. The attributes are assigned to a named data quantity by a compile statement (2.3.0). Compile statements will be discussed in section 2.3o For the present we will discuss the attributes and the use of named quantities without considering how the named quantities are generated. The set of attributes divides naturally into two classes named type I and type II attributes. Type I attributes are concerned with the internal representation of data. Type II attributes are concerned with the internal organization of data collections, The data representation attributes (type I) are listed in

32 Table 2.2.1. Each named data quantity may be assigned at most one type I attribute. A named data quantity will accept data of only one internal representation. The length or precision (whichever is appropriate) of a named quantity must be specified concurrent with a type I attribute. The maximum length or maximum precision should be specified. Precision is specified by the maximum number of decimal digits. Length is specified by the maximum number of bits (binary string) or characters (character strings) which will be stored in the named data quantity. Care should be exercised in the specification of length or precision for two major reasons. First, if the maximum length of character strings, bit strings, or decimal data, is overestimated the storage requirements will be exaggerated. Second, the estimates of precision will be used to assign either single or double precision operations upon floating and integer data, and therefore, affect not only the storage requirements, but also the execution speed. The previous statement presupposes that PTL programs are run on a conventional machine which has single precision and double precision floating hardware. The translator would use the precision assigned to each variable to select either single precision or double precision operations. If the PTL program were translated to a-machine which had no floating point hardware, then the translator would select microroutines with varying precision.

33 One of the problems associated with the design of a machine independent language is the specification of the length or precision of variables. The length (precision) assignment for each named data quantity in a PTL program is made by the programmer. The programmer assigns a maximum length (precision), but the PTL translator is allowed to select a length (precision) which is greater than that re - quested. If a PTL program which specified a precision of 35 bits were to be translated onto a machine which had operations with a single precision of 32 bits and operations with a double precision of 64 bits, the named data quantities would be assigned by the translator to be double precision. If the program had a named data quantity with a precision of 65 bits, special routines would be required since the machine could not directly perform operations with a precision of 65 bits. The method which the PTL translators will use to assign an internal length or precision violates machine independence in the strict sense. A PTL program which is run on different machines would produce different results in the least significant digits. However, if the programmer has specified the maximum required lengths (precision) correctly, the answers should be correct to the desired significances on both machines.

34 The data location (DL) attribute is used to identify a named data quantity in which the address of other named data quantities may be stored. This allows the use of indirect addressing in accessing data. If a named data quantity with the DL attribute is used in an operation statement, indirection is assumed. The contents of the named data quantity will be used as the address of data and not as data. The internal data organization attributes (type II) are listed in Table 2. 2. 2. A named data quantity may or may not be assigned a type I attribute. If no attribute is assigned the named data quantity is assumed to represent a single data item. If named quantity is assigned the array attribute, it must also be assigned size and dimension values. This is done with an assign statement. A named data quantity which is assigned an array, or stack attribute, will also be assigned a type I attribute of either integer, floating, or single length character or single length decimal. Thus, named data quantities which are organized into an internal data organization of either an array or a stack must contain homogeneous data. The linked list and tree attributes are handled differently from stack and array attributes. The two linked list attributes are linked list item (LLI) and linked list head (LLH). The linked list item attribute is assigned to a named quantity which may have another type II attribute. The LUL attribute reserves additional storage

35 within the named data quantity to be used for linked list pointers. If the named quantity is inserted into more than one linked list, LLI(N) may be written and space for N linked list pointers is saved. Thus each linked list item must contain information relating where its pointer information ceases and its data begins. Also, if an item is linked in more than one list, and is removed from one of the linked lists, its position in the other lists is not affected. Thus a linked list with multiple pointers will incur more overhead than a singly linked list. The linked list head (LLH) attribute is assigned to a named data quantity which is to be the head of a linked list. The named data quantity with a LLH attribute can have other attributes including a LLI attribute. Thus a named data quantity could be the head of one list, a member of another list and also contain data. The manner in which linked lists are manipulated is explained in section 2. 3. The tree attributes are tree node and tree root, The trees are assumed to be binary trees, The tree root attribute is used to indicate that a named data quantity is the root of a tree. The named data quantity can also have other attributes. The root name of a tree is used in some tree operations and therefore requires additional storage. The tree node attribute causes the assignment of additional storage within the named data quantity to be used for the pointer required by a binary tree. A data item can be a node of more than one

36 tree, and therefore the TN(N) notation may be used to reserve storage for insertion into N binary trees. A data item may be referenced by either its name or by its group name and displacement. For example, a single data item which is named Fl would be referenced as Fl while the third data item of a vector (one dimensional array) named F2 would be referenced as F2(3) or F2(F1) if F1 had the attributes integer, single data item, and also had the value of 3. The elements of linked lists or binary trees are accessed differently than arrays, or single data items. Each linked list, or binary array has an associated key (K) which contains the location of one of its elements. The choice of which element is addressed by the key is controlled by linked list or binary tree operations (Section 2. 3). An element of a linked list LL1 would be referenced as LL1 (K) if the element were a single data item. If the element were an array, the third member of the array would be referenced as LL1 (K, 3). It is possible to transfer the location (K) of an element into a named data quantity with attribute DL. In this case the item would be addressed using indirection. In place of named quantities it is possible to use constants. A constant in a PTL statement is indicated by the constant value placed inside quotation marks. The attributes of the constant values can be assigned in two ways. The attribute can be specified explicitly,

37 immediately after the first quotation mark and separated from the constant value by a comma. If the attribute is not explicitly defined, it is implied by the constant valueo If the constant value contains only numeric values it is assumed to be a base 10 integer constant. If the constant value contains only numeric values and a single period,, the constant value is assumed to be a base 10 floating point constant. All other constant values which'are without explicit attribute definition are assumed to be character strings. To eliminate the need for the declaration of temporary storage areas and to provide the compiler with the knowledge that a value is required temporarily, temporary storage names are provided. The temporary storage areas are named IT1 through!TN. Temporary data storages are used to store temporary and intermediate results and have no explicit attributes. The attributes of a temporary data store match the data currently in storage. The special named quantity!DUMP is used whenever an operation statement requires an information sink (named quantity) but the logic of the program does not require the data to be saved. The attributes of!DUMP match the data to be dumped into it. Thus,!DUMP acts as an infinite garbage can.

38 Data Location DL Data location, address of data is stored in the named data quantity. Character String CS(N) Character representation; maximum number of characters in the string is N. Binary String BS(N) Binary string representation; maximum number of bits in the string is N. Integer I(N) Fixed point data representation; maximum length decimal digit to be represented has length of N. Complex C(N) Complex floating point data representation; maximum length precision is N decimal digits. Floating Point F(N,M) Floating point data representation; maximum length precision is N decimal digits; maximum range is 10M to 10-. Decimal D(N) Decimal representation; maximum number of decimal digits in N. Table 2. 2. 1. Internal Data Representation Attributes Type I

39 Array A(i, i3i,'. iN) A N dimensional array with 1-origin indexing is defined. The length of the 1st dimension is il, the 2nd is i2, etc. Stack S(N) A push down stack (LIFO) is defined: the maximum capacity of the stack is N items. Link List Head LLH The head of a linked list is defined. Link List Item LLI(N) Space is provided so the named data quantity may be inserted into N lists. Tree Root TR The root of a binary tree is defined. Tree Node TN(N) Space is provided so the named data quantity may be inserted into N binary trees. Table 2. 2. 2. Internal Data Organization Attributes Type II

40 2.3 Program Type Language Statements There are two basic statement types in the Program Type Language (PTL). The first statement type is the compile statement which is used to supply the microcompiler with information (data type, array size, etc.) needed to compile the operational statements. The second is the operational statement which is used to specify the CPU control and microprogram sequencing information. The compile statements will be presented first, then the operation statements will be given. The compile statements are used to supply information which is required to interpret the operation statements. The two major functions of the compile statements are the description of named quantities and the specification of data storage requirements. In addition, the initial value of a named quantity may be specified using compile statements. The dimensions of a named quantity must be specified concurrently with any attribute specification. The dimension may be specified implicitly, explicitly, or require a combination of both. For example, the length given a named data quantity possessing the character attribute explicitly specifies the required storage. However, a named data quantity possessing the attributes of array and floating point will require both an implicit and an explicit declaration. The size and dimension declaration explicitly specifies the number of floating point values which

41 must be stored. The precision declaration would implicitly imply either a single or double precision floating point specification. The variable would be assigned to whichever precision was greater than or equal to its specified precision. If the requested precision is greater than that provided by the hardware, the required precision will be implemented using special routines. It would be reasonable to flag the occurrence of such "large" precision requirements to notify the user of the inefficien - cies in execution speed which are being introduced. The specification of dimension, precision, or any other numerical value associated with an attribute of a named quantity follows the appropriate attribute (section 2. 2) and is enclosed in parentheses. The named data quantities may have as a NAME, any combination of 10 or less alphanumeric characters. The first character of a NAME must be an alphabetic character. The compile statements are listed below. Examples will be given after each statement type to clarify the explanation, 2.3.Oa BEGIN NAME This declares the beginning of a main program with name NAME. 2. 3o Ob DECLARE SUB NAME (LIST) This declares a subprogram named NAME. LIST is a list of parameters which will be passed to the subroutine by the calling program~

42 2.3.Oc END SUB This declares the end of a subroutine. 2.3. 0d END PROG This declares the end of a program. 2.3.1 DECLARE NAME ATTRIBUTES A named data quantity with attributes and name as specified is created. Examples a) Create an integer vector named L1. The length of the vector is to be 10, and each element of L1 is to have a precision of at least 5 decimal digits. DECLARE L1 A(10), I(5) b) Create a 10 by 10 matrix of floating point numbers. Each number is to have a precision of 10 decimal digits, and the name of the matrix is to be MATRo DEC MATR A(10,10), F(10) c) Create a character string of length 10 and name CH1 DEC CH1 CS(10)

43 d) Create a linked list named LIST1. LIST1 is to serve as the head of the list and contain no other information. DEC LIST1 LLH e) Create a linked list named LIST2. LIST2 may be attached to 2 other linked lists, and is to contain 4 floating point numbers of precision 5. DEC LIST2 LLH, LLI(2), A(4), F(5) 2. 32 DECLARE NAMEFORM ATTRIBUTES A group of named quantities is assigned the same attributes. Instead of listing the names, the NAMEFORM specifies the general class of names which are to be assigned the specified attributes when encountered within the program. NAMEFORM is a set of alphanumeric groups enclosed in parentheses. Each group starts with an alphabetic character and may have up to 10 alphanumeric characters. Any NAME which starts with one of the groups of the NAMEFORM set is assigned the specified attributes of the corresponding DECLARE statement. Examples a) All named data quantities starting with Al, A2, AB, or D are to be single valued integer quantities with a precision of 10 decimal digits. DEC (A1,A2, AB,D) I(10)

44 b) All named data quantities starting with X are to be floating point vectors of length 6 with a precision of 5 decimal digits. DEC (X) A(6) F(5) 2.3.3 VALUE NAME = N1(V1),,Nm(Vm) Data is assigned to a named data quantity. If the named data quantity is a single item only V1 is specified. If multiple items are required (NAME is an array) V1 through Vm are specified. If any Vi is to be assigned to K consecutive position in storage, then Ni is set to K, otherwise Ni is omittedo The data is read from left to right, and the rightmost array dimension changes most rapidly. Examples a) L1 of example 2. 41 a is to be assigned all l's VALUE LI 10(1) b) MATR of example 2. 41b is to be set to all O's VALUE MATR 100(0.0) c) MATR (7, 1) through MATR (7, 5) is to be set to 1.0, 2.0, 3.0, 4.0, 5.0, respectively. VALUES MATR (7,1) (1.0, 2.0, 3.0, 4.0, 5.0)

45 The general form of the operation statement is shown below (2. 3. 4). 2.3. 4 LABEL OPERATION BRANCH The three components of the operation statement are: LABEL: The LABEL (1 to 10 alphanumeric characters with no intermittant blanks) is optionaL If present, the LABEL is used to identify the operation statement. Any reference to an operation statement of a program (in particular a branch to an operation statement) is made using the LABEL identifier. OPERATION: The OPERATION is optional, but either the OPERATION or the BRANCH must be present. The OPERATION specifies the operation to be performed by the CPU. The form of the OPERATION depends upon the OPERATION type. The OPERATION types will be listed below. BRANCH: The BRANCH is optional, but either the OPERATION or the BRANCH must be present. The BRANCH can cause a break in the sequential execution of control statements. The form of the BRANCH depends upon the BRANCH type. The BRANCH types will be listed after the OPERATION types. The set of OPERATION is divided into general categories such as arithmetic, bit manipulation, etc. Each category is further subdivided into OPERATION subsets. An OPERATION subset may have

46 1 or more elementso The elements of each OPERATION subset have the same format and operate on the same data types. To aid in the understanding of the OPERATION a detailed discussion of OPERATION (SI- S2 y S3)(23.5a) will be given. OPERATION (Sl-S2y 53) is the first OPERATION subset and is part of the general category BIT MANIPULATION. The information is presented as follows. First the format of the OPERATION statement is shown. The OPERATION specified in 2.3. 5a has several elements in the OPERATION subset and therefore the OPERATION identifier is replaced by y. The list of OPERATIONS of the OPERATION subset (Sl- S2yS3) is listed on the same line as the format. If any of the OPERATIONS requires additional explanation, the information is given on the lines below the format statement. The named quantities in the format statement are represented by S1, S2, and S30 In an actual statement this item would be replaced by named data. The use of different data items in the format (S1-S2yS3) is not meant to imply that SI, S2, and S3 could not be the same named quantity. In fact all three data items in the format (S1, S2, S3) could be replaced by the same variable. At the end of the format line are subsets of the attributes of tables 2,2.1 and 2.2.2. They indicate which types of data items the OPERATION statement may have as operands. The list of attributes is not included if the choice is obvious.

47 Many of the PTL functional operations have one or more associated condition sets. Condition sets are two-state devices which are set or reset dependent upon the results of PTL operations. The sequencing of PTL programs can be conditioned upon the state of one or more condition sets. The setting of PTL condition sets can be handled in two ways. The programmer can be forced to indicate, in the PTL statement, that the result of an operation should modify the appropriate condition sets, or the language can be defined so that each time a PTL operation is executed, the associated condition sets are appropriately modified. The second method was chosen for PTLo The automatic control of condition sets relieves the programmer of one more detail. The language is cleaner and more easily read without the notation required to specify the control of conditions sets. Therefore, the automatic control of condition sets should simplify the generation of PTL programs. There are some potential problems. The programmer should be cognizant of condition sets. The automatic control of condition sets tends to push them into the background. However, condition sets are specified directly in any branch statement which uses one or more condition sets.

48 A branch can be conditioned only upon the current state of a condition set. Using a PTL logical operation the status could be saved in a logical variable and tested later. However, the necessity to perform branching upon the previous state or states of a condition set would seem to occur infrequently. For OPERATION (S1 -S2 i S3), the condition sets are delayed, i. e., the state of the condition set switches are not changed until the end of the cycle. This has two effects. First, a branch can be made on the previous state of the condition sets during the present OPERATION statement. Second, a branch cannot be taken until the following microcycle. There are certain OPERATION types for which a branch can occur dependent upon the results during the same cycle (or at least appear to be dependent upon the current OPERATION)! These cases will be explained when they occur. OPERATION SUBSETS 2,3.5 General category bit manipulation and logical operations 2.3.5a S1-S2yS3 y [A, V, O, NAND, NOR] B, I, F condition set ZERO, ALL ONES The named quantities S2, S3 are logically combined using the specified operation and the result is placed in named quantity SI. The condition set ZEROis set if the result is all zeros. The condition set ALL ONES is set if the result is all l's.

49 2.3.5b S1- PS2 3E[NOT] B, I, F condition set ZERO, ALL ONES The complement of S2 is placed in SI. If the result is zero, ZERO is set. If the result isall l's ALLONES is set. 2.3. 5c S1-S2 TR ADDRESS S3 LABEL (S3;A)(S2;I)(S:S3;I, F,CS,D) condition set OUT BOUNDS The Nth item of Table S3 is placed in S1, where N is the integer valued S2. If N is larger than the length of Table S3 a branch to LABEL is executed. 2.3. 5d Sl-S2 TR CONTENT S3 S4 LABEL (S3: S4; A) condition set NO MATCH A translation of S2 is made by matching its contents with Table S3. If a match is found the contents of the corresponding location of S4 is transferred into SI. The data attributes of (S2, S3) and (S1, S4) are required to be the same. If no match between S2 and Table S3 is found a branch to LABEL is executed.

50 2.3. 5e SI-SHIFT S2 y1Y2S3 (S3: I)(S1: S2; BI, F) yle [LEFT, RIGHT] y2E LEND AROUND, CARRY OUT, ARITHMETIC] condition set ZERO, ALL ONES The contents of S2 are shifted S3 places as specified by Y1Y2 and then stored in S1. 2.3.5f S1-S2-SHIFT S2yiy2S3 (S3; I)(S1:S2; B,I,F) yle [LEFT, RIGHT] y2E [END AROUND, CARRY OUT, ARITHMETIC] conditions set ZERO. ALL ONES Same as 5e, except that S2 is itself modified. 2.3.6 General category arithmetic conditions 2.3.6a S1-S2ylS3 vle[+,,*,/] F,C,D conditions sets ZERO. LESS ZERO, GREATER ZERO OVERFLOW, UNDERFLOW 2.3.6b Sl-y2S3 y2E[e, logeS3, 10 logl0S3] F conditions sets OVERFLOW, UNDERFLOW S3 2.3.6c Sl-S2y3S3 y3e [S2 ] I,F conditions set OVERFLOW, UNDERFLOW

51 2.3.6d S1- y4S2 4E [sin, cos, arc sin, arc cos, tan, cot, arc tan, arc cot] conditions sets ZERO, LESS ZERO, GREATER ZERO, UNDERFLOW, OVERFLOW 2.3.6e Sl-y5S2 y5 E [L, ABS] I, F conditions set ZERO, LESS ZERO, GREATER ZERO The symbols L,I represent the floor and ceiling of S2 respectively. ABS produces the absolute value. 2.3.6f FLOAT Si S2 Named data quantity Si (with attribute I) is converted to an internal floating representation and placed into named data quantity S2 (with attribute F). S1 is not modified. Si and S2 can not be the same named data quantity. 2.3.6g INTEGER SI S2 condition set OVERFLOW Named data quantity S1 (with attribute F) is converted to an internal integer representation and placed into named data quantity S2 (with attribute I). Sl is not modified. If S1 is too large to be contained in S2 the condition set overflow is set. S1 and S2 can not be the same named data quantity.

52 2.3. 7 General category stack operations Several of the stack operation types carry a branch LABEL as part of their format. These operation types have an implied branch which is tested before the execution of the desired stack operation occurs. If the branch condition is matched the branch occurs and the operation is not performed. The conditions being tested represent the state of the stack after completion of the last stack operation. There is one common stack (ST) which does not have to be set up, but is always available. 2. 3.7a SI-ST LABEL I, F, C, D conditions set STACK EMPTY* STACK FULL Branch to LABEL if STACK EMPTY, otherwise place top item of STACK in SI. 2.3.7b ST-S1 LABEL I, F, C, D conditions set STACK EMPTY, STACK FULL Branch to LABEL if STACK FULL, otherwise place data item Si into STACK 2,3.7c CLEAR ST conditions set STACK EMPTY, STACK FULL

53 2.3.7d CREATE ST (S1) S2y yE [I, F, C, D] conditions set STACK EMPTY, STACK FULL A STACK of type y with name S1 and length S2 is created. 2. 3 7e DESTROY ST(S1) A stack named Si is destroyed 2. 3o 7f S2-ST(S1) LABEL conditions set STACK EMPTY, STACK FULL If STACK S1 is empty a branch to Label occurs, otherwise top item of STACK Si is placed in S2. 2.3.7g ST(S1) -S2 LABEL conditions set STACK EMPTY, STACK FULL If STACK SI is full a branch to LABEL occurs, otherwise S2 is placed into STACK S1. 2.3.7h CLEAR ST (S1) conditions set STACK EMPTY, STACK FULL 2.3.8 General category linked lists A few general points should be made before presenting the linked list operations. Each linked list has a key associated with it. The key contains the address of some member of the linked list unless the list is null. Most of the linked list operations use the value of the key, modify the key, or transfer the value of the key into a named

54 data quantity. Each element of a linked list must contain space for a linking area. The linking area contains pointers to the next and to the preceding element of the linked list. (Some implementations might use other schemes. ) A named data quantity may be a member of more than one list. The attributes LLI(N) specifies the number of linked lists (N) which may contain the named data quantity simultaneously. The linking area size is a function of N. The insertion or deletion of a data item from one list does not affect its membership in any other linked list. The linking area also contains information to delineate the end of the linking area and the beginning of the data area. Most of the linked list operations include a LABEL and have an associated group of condition sets. A branch to LABEL is executed during the time span of the linked list operation if any of the branch conditions occur. The state of the condition sets (after the branch) can be tested to determine which condition caused the branch. A branch statement can be contained on the same line as a linked list operation. However, any branch caused by the linked list operation will have precedence over the branch

55 statement. Thus, a branch to LABEL 1 on the result of a previous add operation being zero could be placed in the same line as a linked list operation which branched to LABEL 2 if the linked list were empty. If the linked list was not empty and the result of the previous add was zero; a branch to LABEL 1 would result. If the list was empty and the result of the previous add was zero; a branch to LABEL 2 would result. 2o 3.8a SET LINK LIST Si Named data quantity S1 (having attribute LLH) has its linked list key initialized. 2, 3. 8b DESTROY LINK LIST S1 Named data quantity S1 (having attribute LLH) has its linked list key initialized, and the elements of the linked list have their associated pointers reinitialized. 2. 38c REMOVE S1 LINK LIST S2 3 LABEL; 3e[+ 1, 0, -1] condition set OUT BOUNDS The element of linked list S2 addressed by the key (3=0) or the preceding element (3= -1) or the next element (/3= +1) is removed from the linked list. The element's linking area is reinitialized and the address of the element is

56 placed in SI. The key is updated to point to the element preceding the removed element. If the key with the 3 offset points outside the list a branch to label is executed. 2.3.8d Sl-LINK LIST S2 ( LABEL f3e [+ 1, 0, -1] condition set QUT BOUNDS The address of the element of link list S2 addressed by the key ( (=0), or the preceding element (3= - 1), or the next element (j=+i) is placed into S1. If the key with the 3 offset points outside the linked list a branch to LABEL is executedo 2.3.8e REPLACE SI IN LINK LIST S2 j LABEL 3 e [+1,0, -1] condition set OUT BOUNDS The data item pointed to by the key with an offset of either +1, 0, or -1 is removed and replaced by data item S1. The key is set to the value of the key with offset. If the key with offset points outside the linked list, a branch to LABEL is executed. 2.3.8f INSERT S1 IN LINK LIST S2 3 e [+1, -1] Data item S1 is placed in linked list S1 either before (3=-1) or after (P=+1) the data item pointed to by the key. The key is updated to point to the inserted data item.

57 2. 3 8g INCREMENT POINTER LINK LIST S1 LABEL condition set OUT BOUNDS The key associated with linked list S1 is incremented by 1 2.3. 8h DECREMENT POINTER LINK LIST SI LABEL condition set OUT BOUNDS The key associated with linked set SI is decremented by 1 2.3.8i KEY (S1)-S 2 LINK LIST LABEL condition set OUT BOUNDS The key is updated to point to the Nth item of linked list Si where N is the value of S2. If the number of elements of Si is less than S2 a branch to LABEL is executed. 2 3. 8j S2-KEY (SI) LINK LIST The key associated with linked list SI points to the Nth element of list SI. The value of N is placed into S2. 2. 3 8k DO LABEL LINK LIST S The statements from the DO statement to the statement LABEL are iterated N times where N is the number of elements in linked list Slo The key is set for the first element on the first iteration, to the second element on the second iteration, etc.

58 2.3.81 DO LABEL LINK LIST KEY SI Same as 2. 4.8k except the iteration is done for the current value of the key to the last element of the list. 2. 3 9 General category string manipulation 2. 3 9a CONCATENATE S1 to S2 String SI is concatenated to string S2 EXAMPLE SI ='AB' S2 ='CD' after CONC SI to S2 execution S2 ='CDAB' 2.3.9b MATCH S1 IN S2 S3 LABEL A pattern match search in string S2 using string Si is performed. If a match is found S3 contains the position of the first character of the match. The search is performed from left to right. If no match is found a branch to LABEL is executed. EXAMPLE S1'ABC' S2'123ABCD' after MATCH Si IN S2 S3 LABEL execution S3 = 4

59 20 3. 9c MATCH P S1 IN S2 S3 LABEL This operation is the same as 2.3. 9b except that the search is started in S2 at the character position specified by S3. If no match is found, or S3 points outside S2 a branch to LABEL is executed. EXAMPLE SI ='AB' S2 ='ABCDABF' S3 = 2 after MATCH P S1 in S2 S3 LABEL execution S3 = 5 2. 3 9d INSERT Si IN S2 AT S3 LABEL String Si is inserted into string S2 after the character position specified by S3. If length of SI and S2 exceeds the maximum length of S2 a branch to LABLE is executed. EXAMPLE S1 ='BC' S2 ='ADEF' S3 = 1 after INS S1 S2 S3 LABEL execution S2 ='A B C D E F'

60 2.3,9e GET S4 FROM SI AT S2 THROUGH S3 LABEL The string of characters in string SI between the character positions specified by S2 and S3 is placed into string S4. If S2 or S3 points outside of Si a branch to LABEL is executed. EXAMPLE SI ='ABCD' S2 = 2 S3 =4 after GET S4 Si S2 S3 execution S4 ='BCD' S1 ='ABCD' 2.3.9f DELETE S2 THROUGH S3 FROM SI LABEL The string 6f characters between the character positions specified by S2 and S3 is deleted from string S1. If S2 or S3 points outside of SI a branch to LABEL is executed. EXAMPLE S1 ='ABCD' S2 = 2 S3 =4 after DEL S2 S3 SI execution SI ='A'

61 2.3.9g REPLACE BY S4 BETWEEN S2 AND S3 IN S1 LABEL The string S4 replaces the characters in string SI between S2 and S3. If S2 or S3 points outside Si a branch to LABEL is taken. EXAMPLE Si ='ABCDE' S2 =3 S3 = 5 S4 ='2' after REP S4 S2 S3 SI LABEL execution S1 ='AB2' 2. 3. 9h S1-STRING LENGTH S2 The length of string S2 is placed in S1. 2.3 10 General category - iteration control 2. 3 10a DO LABEL S I The statements from the'DO' statement to the statement LABEL are iterated SI times. The iterations can be terminated by a branch out of the scope of the DO. 2. 310b ADO LABEL ARRAY SO INDICES Si, S2, * ~SN ORDER The statements from the'ADO' statement to the statement LABEL are iterated. The named quantities specified by S1, S2, * SN are treated as the indices of array SO, with Si

62 the first index (the right most), etco The indices specified by ORDER (which is a named data quantity) are counted from 1 to the dimension of the indices. The counting is done so that all elements of array SO which may be specified by the selected indices are counted. ORDER is treated as a hexidecimal variable. The lower order hexidecimal digit of ORDER specifies the index which is counted most rapidly. The second hexadecimal digit specifies the index which is counted the second most rapidly, etc. EXAMPLE ADO LABEL AR Al IN S1,S2,S3,OR1 Assume Al has dimensions 2,3,4 and OR1 is 316. Then S1, S2, S3 would be counted as follows S3 S2 Si start iteration 1 S2 1 1 S2 2 1 S2 3 1 S2 4 2 S2 1 2 S2 2 2 S2 3 2 S2 4 end iteration Notice that the value of S2 is unchanged during the execution of the loop instruction.

63 2o 3 10Oc ADO1 LABEL ARRAY SO INDICES S1,S2,... SN ORDER I This operation statement is the same as 2.3. lOb except that none of the indices are set to 1. The indices are counted from their current value. 2.3. lOd IDO LABEL S1 S2 a (S2: I)(S1: A) l<a < (DIMENSION ARRAY S1) The statements from the'IDO' statement to the LABEL statement are iterated. The named quantity S2 is set for the first iteration and incremented by 1 before each of the succeeding iterations. The named quantity S1 is an array of order N with a specified length for each of the N dimensions. The value of a selects which dimension controls the iteration. If a is i and the length of the ith dimension of SI is j then j iterations will be executed. EXAMPLE Assume Al has order 4 with dimension length (3,4, 2, 5). Then IDO LABEL Al S2'3' will cause 2 iterations to occur. IDO LABEL Al S2'1' will cause 3 iterations to occur. 2.3. 11 General category binary trees A few general comments should be made before

64 presenting the binary tree operations. Each binary tree has a key associated with itself. The key contains the address of some member of the tree unless the tree is null. Most of the binary tree operations use the value of the key or modify the key. Each item inserted into a binary tree must provide space for linking information. The space is provided by assigning each named quantity which is to be inserted into binary tree the TN attribute. The binary tree operations are involved with inserting, deleting, and sequencing through binary trees for the location of the data items of the tree. Thus, addresses are the result of binary tree operations. The address specifies the location of some member of the tree. 2.3. 11a SET TREE ROOT SI Named data quantity Si (having attribute TR) has its binary tree key initialized.

65 2. 3 lib DESTROY TREE ROOT Si Binary Tree S1 (having attribute TR) has its binary tree key reinitialized and any elements (nodes) of tree Si have their tree linkage areas reinitialized. 2o31 c MOVE TREE KEY Si a LABEL a c [UP, DOWN LEFT, DOWN RIGHT] condition set OUT BOUNDS Binary Tree Si has its associated key moved UP, DOWN LEFT, or DOWN RIGHT if a =UP, DL, or DR. The key contains the address of the corresponding node of tree S1. If the command would cause the key to move off the tree (the corresponding node does not exist) a branch to LABEL is executed. 2. 3 11d MOVE TREE KEY S1 ROOT The key associated with binary tree Si is reset to point to the root of the tree (S1). 2.3, lie REMOVE TREE SI a LABEL a e (UP, CURRENT, DOWN LEFT, DOWN RIGHT) condition set NO NODE, BINARY VIOLATION The node of tree S1 relative (UP, CURRENT, DL, DR) to the position of the key is removed from the tree. The data quantity removed from the tree has its tree linkage area reset.

66 If a node does not exist at the position specified by a relative to the key, a branch to LABEL is executed and condition set NN is set. If removal of the node and the subsequent regrouping of the tree would produce a nonbinary tree, the node is not removed, a branch to LABEL is executed and condition set BV is set. 2.3. 11f REPLACE TREE S1 a BY S2 LABEL a e (UP, CURRENT, DOWN LEFT, DOWN RIGHT) condition set NO NODE The node of tree S1 relative (UP, NODE,DL,DR) to the position of the key is replaced by named data quantity S2. The data quantity removed from the tree has its linkage area reset. If a leaf does not exist at the pointer specified by a relative to the key a branch to LABEL is executed. 2.3. lg INSERT TREE SI a S2 LABEL a (UP, DOWN LEFT, DOWN RIGHT) condition set NO LEAF Named data quantity S2 is inserted into the binary tree S1. S2 is inserted between the node currently addressed by the key and the node specified by a relative to the key. The orientation of nodes is kept constant. If S2 is inserted

67 between two nodes with orientation down right the three nodes will have orientation down right. If there is no node position (pointer + a) where a is DL or DR, S2 is simply added to the tree at that position. If a is UP and no node exists, a branch to LABEL is executed, 2o3 11h Do Label Tree SI a a e (PREORDER, POSTORDER, ENDORDER) The statement from the DO statement to the statement LABEL are iterated for N times where N is the number of nodes in the tree. The pointer is set to point to each leaf of the tree in either preorder, post order, or endorder according to a. 2.3. 1li S2-TREE KEY (S1) The address of the node of tree Si currently contained in tree Si's key is transferred to S2. 2. 3 12 General category input/output The I/O instructions represent a large set of operations. For example, the I/O READ instruction may be an I/O READ TAPE, I/O READ DRUM, or I/O READ DATA LINE, etc. operation. It is doubtful if CPU hardware could be provided which would perform the operations in one microcycle or less. Therefore, each I/O instruction in a PTL

68 program will probably be replaced by a branch to a large I/O subroutine. 2.3. 12a SENSE I/O S1 S2 Status information from the I/O device specified by SI is stored in location S2. 2.3.12b COMMAND I/O S1 S2 Control information located in area S2 is transmitted to the I/O device specified by Si. If S2 is enclosed in quotes the S2 name represents a standard command which is known to the translator. 2.3. 12c READ I/O S1 S2 S3 Data from I/O device Si is read into area S2. Location S3 is optional, but if present is used to store status information at the end of the read operation. Location S3 is set to null status at the beginning of the read operation. 2.3.12d WRITE I/O S1 S2 S3 Data from area S2 is written into the I/O device specified by Si. Location S3 is optional, but if present is used to store status information at the end of the write operation. Location S3 is set to null status at the beginning of the write operation.

69 2. 3 12e CONVERT OUTPUT SI FORMAT S2 The single data quantity Si is converted to an output form according to format. The output form is stored in S2. FORMAT: 1,N INTEGER to N place BINARY 2, N INTEGER to N place OCTAL 3,N INTEGER to N place DECIMAL 4,N INTEGER to N place HEXADECIMAL 5,N DECIMAL to N place DECIMAL 3, N DECIMAL to N place DOLLARS, CENTS 7, N FLOATING POINT to N place INTEGER 8,N,M FLOATING POINT to N place DECIMAL with M significant digits 9, N, M FLOATING POINT to N place SCIENTIFIC with M significant digits. 2.3. 12f CONVERT INPUT S1 FORMAT S2, S1 e C S2 e I, Fl, F2, D The character string S1 is converted into an internal data form according to FORMAT. The data item is stored in S2.

70 FORMAT: 1 DOLLAR, CENT to DECIMAL 2 DECIMAL to DECIMAL 3 BINARY to INTEGER 4 OCTAL to INTEGER 5 DECIMAL to INTEGER 6 SCIENTIFIC to FLOATING 7 DECIMAL to FLOATING 2.3.13 General category - Special Functions 2.3. 13a INCRS1 Named data quantity Sl (with attribute I) in incremented by 1 2. 3.13b DECR S1 Named quantity S1 (with attribute I) is decremented by 1 2.3.14 General Category - Interrupts The processing of an interrupt is visualized as a three step process. The first step is to turn on the interrupt. Once an interrupt is turned on, the occurrence of the interrupt can be accepted by the CPU. The second step is the acceptance of the interrupt. An interrupt is accepted by being placed in the interrupt queue (I queue). The interrupts in the I que are ordered on a basis of priority. Each interrupt is assigned a priority number. The interrupt with the highest priority is placed first in the I queue. Elements

71 in the I queue with the same priority number are ordered on a first-in first-out basis. The third step is to acknowledge the interrupts. To acknowledge an interrupt the currently executing process (program) is stopped, and its status is stored. The stored status must be sufficient to restart the interrupted program without error. After the currently executing program is stopped and its status stored, a branch to the proper interrupt routine is taken. Once the interrupt is handled, the status of the previous executing program is restored and the program is restarted. There - fore, provision must be made to form a queue of interrupted program status information. The interrupted program queue is called the R queue. The R queue contains both the status of the interrupted program and the information about the interrupt condition which caused the interrupt. 2. 3. 14a SET INTERRUPTS ON S,..., SN K LABEL This instruction turns on the interrupts for device conditions S1, S2,..., SN. The interrupts specified by Sl,..., SN may be a particular device interrupt condition, or an I/O device name, in which case, all the interrupts associated with the device are turned on. The value K assigns a priority to S,..., SN. If one of the interrupts S1,.., SN is acknowledged a branch to LABEL occurs.

72 2. 3. 14b SET INTERRUPTS OFF S1,S2,...,SN This instruction turns the interrupts off for device conditions S1, S2,..., SN. The interrupts specified by S1,.., SN may be a particular device interrupt, or an I/O device name in which case all the interrupts associated with the device are turned off. 2.3. 14c HOLD INTERRUPTS K All interrupts which occur with priority K or lower are not acknowledged but are held in a que for later processing. The interrupts are arranged in the que with those interrupts with the highest K first. Within the same K class the interrupts are ordered on a first in first out basis. 2.3. 14d RELEASE INTERRUPTS K All interrupts K or higher which are in the interrupt queue are released for acknowledgment. The queue is processed from the top of the queue. If a new interrupt occurs before the interrupt queue is empty, the new interrupt is placed in the queue and ordered with the highest K first. Within the same K class the interrupts are ordered on a first in first out basis. 2.3.14e EMPTY I QUEUE K All interrupts in the I QUEUE with priority K or less are deleted.

73 2.3. 14f INTERRUPT RETURN SI If S1 is not present the top process in the interrupted process queue R queue is restarted. If SI is present the top process in the R queue is deleted and a branch to Si is taken. 2.3. 14g GET INTERRUPT STATUS S1 The interrupt information of the last acknowledged interrupt (top of R queue) is placed in SI. The R queue is not affected. 2.3.15 GENERAL CATEGORY - BRANCH OPERATIONS The branch types are few in number and rather basic. The power of the branching operations stems mainly from the condition sets. The condition sets in total allow an effective check upon the operations which have been performed by the CPU. The branch types are listed below. BRANCH Types 2.3. 15a BR LABEL y F(CS, A,V,-) y is the identifier for some set of OPERATION types. F(CS, A,V, -) is a function of the CONDITION SETS associated with y and the operation of AND,OR, NEGATION. The branch to LABEL occurs if the present states of the condition sets cause F(CS, A, V, -) to be true.

74 2.3. 15b BR LABEL S1 y S2; ye [=,<,, >, >] 2.3.15c BR LABEL S1 y C; ye [=, <, <,, > ] C is a constant 2.3.15d BR LABEL I/O COND S1 A branch to LABEL if the I/O status which is stored in S1 agrees with COND. 2.3. 15e BR SUB(NAME) (LIST) A branch to the subroutines named NAME is executed. LIST is a set of parameters which must be passed between the subroutines and the calling program. 2 3. 15f RETURN A return from a subroutine to the calling program is made,

75 2.4 Differences Between PTL and Other Languages We would like to compare and contrast PTL with both higher level languages and assembly level languageso Only one assembly level language will be used for comparison and that will be the IBM 360 assembly language. The purpose of the comparisons is to defend the decision to develop a new language instead of using an existing language. The comparison will indicate that PTL lies somewhere between higher level languages and assembly level languages in complexity, not equidistant, but closer to high level languages. Let us compare PTL and IBM 360 assembly language first. What advantages does PTL have over the assembly language? First, the instruction set is much more powerful. The string manipulation, convert, stack, etco, operations are not present in the IBM 360 assembly language. Therefore, it should be easier to write PTL programs than assembly language programs if stack, list, etc., type operations are required. The PTL program should also be faster for these complex operations since direct implementations of complex operations in microcode is usually faster than using several machine language instructions. At the very least, one level of interpretation is eliminated for these operations. The PTL language is architecture independent since it does not refer to registers, local storage areas, or other machine dependent entities. A program written in an assembly language is architecture

76 dependent and can be run only on one machine or a limited class of machines unless the program is run under emulation or simulation. A program run under simulation always makes inefficient use of computer resources and generally runs at a slow rate. A program run under emulation tends to execute reasonably fast but makes less than optimal use of the computer resources. It could be argued that a machine language could be written which would contain the PTL operations as its base. This approach would present some problems. Some PTL operations require as many as four operands. The operands can be located in registers or main storage and therefore, each instruction can have several format types. Thus, the instruction code length would necessarily be quite long to accommodate the many different instruction types. As a practical matter, it would seem reasonable to restrict the proliferation of format types by binding some of the operand locations to either main core or registers. The binding of operands can reduce the program execution speed. This occurs for two reasons. If an operand is in main core but should be in a register (or high speed local storage) for instruction execution, a main core memory cycle will be required. If such an instruction is executed several times within a short loop all its operands should be maintained in registers, but this could be impossible since the format binds some of the instructions operands to main core. The opposite

77 effect can also happen, It may occur that a sequence of instructions require more registers than are available. Therefore, instructions which move operands back and forth between main core and register storage will be required. The amount of operand movement will be increased if the formats restrict the location of some operands to registers. The major inefficiency occurs because the machine language has to fix the location of operands and can not make efficient use of high speed local storage on a global basis. The use of local storage will be optimized only on an instruction by instruction basis. It should be pointed out (see Section 2o 5) that the assignment of data quantities to storage locations is one of the major tasks of the PTL translator. A comparison between PTL and other higher level languages will be made on a function by function basis. Most of the comparison will be made between PTL and Fortran with references to PL1, SNOBOL, APL, and others. The equivalent of NAMELIST in FORTRAN or DATA in PL/1 would require several PTL statementso The PTL statements would implement the NAMELIST operation using string search, string get, and input convert operations. The input output capabilities of APL and PTL can not be compared on an equitable basis. APL is an interpretive language with

78 rather limited I/O capabilities. More sophisticated I/O features could be added to APL if desired. The testing and control of I/O devices is more complete in PTL than in FORTRAN and PL1. In PTL the program can specify exact I/O conditions to branch on and can send specific I/O control signals. (APL has no explicit I/O control so no comparison can be made.) The arithmetic operations in PTL are individually very easy to use. The operands can be subscripted in a manner similar to FORTRAN. However, each arithmetic operation requires all the operands to have the same internal data representation. Thus any mode conversion must be specified by the programmer. In Fortran and PLI mode conversion is accomplished automatically without any programmer effort. A difference between PTL and many of the higher level languages is the inability to express multiple arithmetic operation in one line. Thus in Fortran x = (y * 2 * z)/w can be expressed in one line. In PTL the same expression requires three lines of code. This problem could be resolved with the use of a precompiler which converted FORTRAN-like arithmetic statements. APL has the most sophisticated arithmetic operations of any language of which we are aware. The generalized matrix product, outer product, rotation, summation of vector components, etc.,

79 allow very powerful programs to be written in a few statements. The APL statements mentioned above can be simulated by using PTL statements. The simulation is made easier using the array iteration facilities of PTL. Fortran and PL1 would require more statements than would PTL to emulate those APL operations. The iteration control in PTL is somewhat limited in scope since the index variable can only be incremented in steps of 1, and the index is always positive. The iteration for arrays and lists provided in PTL are very powerful and can be very useful. The string manipulation operations of PTL are the basic set of concatenation, insert, delete, get, replace, and match. The more complicated string manipulation operations which are available in SNOBOL can be accomplished in PTL by combining several PTL operations. String manipulation is harder to do in PTL than in SNOBOL. This is to be expected since SNOBOL is a special purpose string manipulation language. The list operations in PTL are basic but adequate to handle single list type operations. The operations one normally performs in lists are reasonably simple. The capability to search a list for a match of a particular subelement of each list element is provided by an iterate operation which controls a loop and iteratively selects the location of each list element. Whatever search is required is then done within the loop, A limited set of list operations is

80 provided in PL/1 but is not available in FORTRAN or SNOBOL. The full complement of stack operations are provided in PTL. There are very few stack operations that are required anyway. However, stack operations are not provided in the other languages at all. The branch operations in PTL cannot really be compared with those in higher level languages. The manner in which they are specified is quite different. The comparison between PTL, 360 assembly language, and higher level languages has not been made to assert the superiority of one or the other of the languages. The comparison has been presented to place PTL in perspective relative to the other languages. Why design a new language? What can't an existing language such as APL, FORTRAN, PL/1, or SNOBOL do as well as PTL? First, the languages mentioned above and all other languages with which we are familiar mask the user's true intent. Thus, functional statistics gathered on programs written in these languages would not be valid. Such statistics would not provide for truly optimal hardware selection. For example, Fortran does not have list operations, bit manipulation operations, or string operations. PL/1 does not have binary tree operations or stack operations. The list operations which are provided in PL/ are not sufficient for even minimum list manipulation. SNOBOL does not

81 have floating point arithmetic operation, APL does not have explicit floating point or integer arithmetic operations, PL/1 does not have sufficient stack, linked list, or binary tree operations. Thus, the languages are not amenable to the estimation of valid statistical information on functional usage. Second, the languages would be difficult to compile into efficient microcode. In particular, APL and PL/1 have a large number of basic constructs. Thus, the number of in-line microcode subroutines which would have to be supplied to the compiler would be excessively large. Some of the constructs of PL1, APL, SNOBOL, and FORTRAN are so complicated that the microcode for such routines would be excessively large. Such routines might require more temporary storage and variable storage than can be supplied by local storage. This would imply that some storage would be assigned before translation and could result in inefficient code. The problem of storage assignment is explained more fully in Section 2. 5. Third, it seems reasonable to write compilers, interpreters, and other heavily used programs in just "one language". Then, as was mentioned earlier, only one new translation is required for each new computer system. PTL is designed to compile efficiently and contains stack operations and other operations which are used in writing interpreters and compilers. Thus PTL seems a good candidate for the "one language".

82 The preceding paragraphs have considered the languages (SNOBOL, PL1, APL, FORTRAN) as a group. It might be well to consider their limitations on an individual basis. SNOBOL is a special purpose language applicable to only one problem area and it thus not a suitable candidate for a PTL-type language. FORTRAN is more general-purpose than SNOBOL. However, FORTRAN is very scientifically oriented and is missing a number of operations which are necessary for a PTL-type language. APL is a very general language but has several problems. The variables in APL require dynamic storage allocation since they can vary in size from instruction to instruction. The dynamic allocation of storage would add unnecessary and unwanted overhead to both the translation process and the execution process. The remaining objections to APL are the author's bias. I think that APL is too concise. It is possible to write APL code which is practically unreadable. Also, the concise notation carries over to branch conditions and can cause end conditions within the branching operations which are sometimes hard to foresee. PL1 is a very general language. As was pointed out earlier some desirable program types are not available in PL1; however, the biggest problem with PLI would be the generation of efficient microcode. The ability to dynamically allocate memory would be a serious problem, reducing both the efficiency of the compiler and the efficiency of the final microcode. Just the massive

83 size of PLI would make the generation of the PL1 program type implementations very difficult. 2. 5 Translation of PTL Programs The purpose of this section is not the design of a PTL translator. Instead, a block diagram (Figure 2. 5. 1) of a PTL microtranslator is presented. The block diagram represents the maj or steps required to translate a program written in PTL into microinstructions suitable for execution upon a microprogrammable CPU. The functions performed by each block of Figure 2.5. 1 will be discussed below. In particular, those problems introduced by the use of microprogamming will be elaborated. The first block is a scanner. It performs a standard set of operations. The symbol table is formed. Each program statement is placed in a standard form. The operation name is changed into an identifier code, and the operands are ordered with each operand replaced by its symbol table entry. If any of the operands are arrays, indexing variables and indexing constraints are noted. If a statement includes a branch operation, it is also placed into a standard form with the branch operation name replaced by an identifier code. Any errors, in data assignment, instruction labeling, or instruction formatting are flagged. The operations performed by the first block are basically the same as those performed by a non-microprogram scanrner analyzer.

84 Input PTL Program data type error Create Symbol Table data type error Reformat Instructions | 1 Reformat Instructions Error invalid instruction Create Flow Graph L —- i- ---— 1 3 Optimize Branch Tests I I_ - — 4 Insert Microcode 4 Operand in Symbolic Form Optimize Data Assignment Insert Operand Movement,5- Microcode Microcode Block Diagram of Process Required to Translate PTL Programs into Microcode Figure 2. 5.1

85 Therefore, the design of the scanner analyzer portion of the translation should present no new problems. The second and third blocks of Figure 2. 5. 1 are used to minimize the overhead associated with the setting and resetting of condition sets (see Section 2. 4. 1). If a CPU does not have hardware which checks and sets a flip flop or other two-state device for a desired condition, such hardware must be simulated by the microprogram. For example, the add operation has an associated condition set which is set if the result of the addition is all zero. If the hardware does not perform the test, then several microinstructions are required to set the condition set. In the worst case, the result of the addition must be compared with zero. Then the condition set word must be read out of storage, set or reset depending upon the result of the previous compare operation and then rewritten. If these additional microoperations are required for every add operation that appears in a PTL program and similar microoperations are required to handle other PTL operations, the bookkeeping associated with condition sets becomes very restrictive. However, even though most PTL operations produce a condition set or condition sets, the status of the condition sets for that operation in that position in the program is never tested. The function of blocks two and three is to eliminate the modification of condition sets for operations whose results are not tested. If the setting of condition

86 sets requires no overhead, that is to say if the condition sets are handled by built-in hardware, there is no need to eliminate unused setting of condition sets because they do not result in the addition of microcycles. For such CPU's block 2 and block 3 would be eliminated from the microcompiler. The function of block 2 is to produce a flow diagram of the program. This is simply a graph where each node of the graph is either a program label or a conditional branch. The arcs of the graphs are directed arcs and each arc represents a conditional branch from a branch operation to a program statement. The function of block 3 is to determine which statements within the program require the modification of their associated condition sets. The method used to select the appropriate statements is quite simple. The program flow diagram produced by block 2 contains all conditional branches. For each conditional branch a search is made using both the PTL program (output'from block 1) and the program flow diagram to determine which PTL operations could affect the appropriate condition set. Only those operations whose results are tested by a branch statement via a condition set -will add microcode; for the modification of the appropriate Condition sets. Most PTL statements whose results are not being tested will not modify condition sets. The examples of Figures 2. 3. 2a and 2.3. 2b indicate the function performed by blocks 2 and 3.

87 Figure 2, 5, 2 Assume the following operations ADD S1<:E- S2 + S3 Condition sets ALL ZERO, OVERFLOW SHIFT S1< --—.S2 SHIFT S3 Condition sets CARRY OUT BRANCH BR (Condition True) LABEL Program 1 Statement # 1 START S1<-.S2 + S4 2 A S2 — S2 SH S1 3 S 1. --- S2 + S3 4 S<S3<-..- S2 SH S2, BR(OV)LABEL 5 S —- S2 + S4 o BR A 7 LABEL S1- -S2 + S4 8 STOP For this program, only the status of the add overflow condition set must be modified and need be modified only after execution of statement 3. (a) START (1) i A(2) (4) I. BR(OV) (6) ^ LABEL (7) (8) STOP Function Performed by Blocks 2 and 3 of Figure 2. 5. 1 Figure 2. 5. 2

88 Program 2 Statement # 1 START SI<C S2 + S4 2 S1 1 S SH S3, BR A 3 B S<I- -S2 + S3 4 S1< —S1SHS3, BR A 5 C S1<(I S2 + S5 6 S1< —— S1 SH S4, BR (OV) B 7 A S2< —-S1 SH S4, BR (OV) C 8 STOP For this program, every add operation must be checked to determine if the add overflow condition set must be modified. (b) START (1) (2) /, B (3 f, (4) C\ (5) BR (OV) \ BR (OV) A (7) STOP Function Performed by Blocks 2 and 3 of Figure 2. 5. 1 (Cont.) Figure 2. 5. 2

89 The only statements which modify condition sets are primary predecessor statments. A predecessor statement of a condition branch is a statement which is executed before the conditional branch and whose execution could affect one of the condition sets tested by the conditional branch. A primary predecessor statement is a statement which is a predecessor statement of a conditional branch and for which there is a path from the statement to the conditional branch which contains no other predecessor statement for the same conditional branch. Thus, if there were no labels (entry points) between a conditional branch and its predecessor statements there can be only one primary conditional branch. It is true that one primary predecessor may be primary for several condition sets tested by a conditional branch. Only primary predecessor statements need to modify the state of conditional sets. Besides branch statements which are conditional upon the state of a condition set, PTL allows branch statements which are conditioned upon a named data quantity being less than, greater than, or equal to zero. In some cases, those branch tests can be transformed into tests on the state of condition sets. In such cases, the operations performed by blocks 2 and 3 can be applied to those branch tests also, It is difficult to estimate the amount of computer time required to execute the operations performed by blocks 2 and 3. The operations

90 performed are complex. However, the Fortran H compiler (L5) performs similar operations in the sense of producing flow diagrams and searching for predecessors. The definition of predecessors by Fortran H is completely different, although the search procedure would be somewhat similar. Since the compilation time of the Fortran H compiler is not excessive, we expect that the operations specified by blocks 2 and 3 are feasible in the sense of computation time. How effective would blocks 2 and 3 be at eliminating unnecessary overhead resulting from condition set status modification? The process should be very effective, Most conditional branches which test condition sets occur within a very few statements of the operation which is being tested via the condition set mechanism, Thus, it will rarely be the case that a branch can be made to a statement between the operation being tested (via condition sets) and the conditional branch which specifies the test. Thus, in the typical case there will be only one primary predecessor (the operation statement) and the condition set overhead is minimal. There can be more than one primary predecessor for a condition set of a conditional branch statement. If all the primary predecessors can actually be tested, no additional overhead is generated. The situation depicted in Figure 2.3.3b is an example for Which there is more than one primary predecessor for a conditional branch.

91 There is no way to determine from the flow graph of the program whether all the primary predecessors are actually to be tested. As was pointed out earlier, such areas should be in the minority. Thus, we feel that the operations specified by blocks 2 and 3 can be implemented with a reasonable amount of computer time. We also feel that the use of blocks 2 and 3 will eliminate most of the unnecessary overhead associated with the use of condition sets. The fourth block represents the code insertion part of the translator. In this section each PTL operation is replaced by an in-line section of microcodeo The microcode is designed to operate on fast storage, either registers or high speed local storage. Therefore, the microcode is in a symbolic form with the operands represented by their symbolic names. If there are array or string references, and calculations on the dope vector are required, the microcodes to perform those calculations are also inserted. The output from block 4 would be the microprogram, if all data were located in fast storage and were accessible by its symbolic name. The fifth block represents the data assignment and data optimization section of the program. Each section of in-line microcode which implements a particular PTL function was designed to work upon data held in fast storage. Therefore, if the operands required

92 by the microcode are not contained within fast storage, two things occur. First, microcode must be inserted to move the operands from main storage into the fast storage. Second, the total execution time of the operation will be increased since time must be spent to read operands into and possibly out of main storage. The purpose of the fifth block of Figure 20 3. 1 is to assign named data quantities to storage such that the execution time of the program will be minimized. Both microtranslators and traditional compilors must perform register assignment. However, the problems of register assignment are somewhat magnified in a microtranslator for two reasons. First, use of microcode magnifies any loss in execution speed resulting from an inefficient data assignment. The difference in execution speed for a register to register instruction versus a register storage instruction on a machine might be 3 to 1. However, the difference in speed between performing an operation in microcode with the operands in fast storage or main storage can be 10 to 1 or even larger. Second, the assignment of named data quantities by a translator whose target language is a microlanguage will have to be more of a global assignment than that required by a traditional compiler. The local storage space available to a traditional compiler is usually

93 limited in size. For example, on the IBM 360 computers, there are 16 general purpose registers. Since the assignment space is limited, local assignment algorithms tend to be reasonably successful, Most local storage areas for microprogrammable computers contain. 64 words or more. The larger the fast storage area the more chance that a global assignment scheme will be significantly better than a local assignment scheme. Microprogrammable computers appear to have three levels of storage. There are the registers which are connected to the arithmetic logic units and the local storage areas which can be read into and out of the registers in a microcycle. The third storage area is main core. Thus, the assignment problem is somewhat complicated because there are two high speed storage areas (registers and local storage). However, the number of registers are usually limited and must be assigned locally and not globally. The fifth block has special significance with respect to the integration of subroutines with a main program. It would be very nice if the assignment of named data quantities could be done globally, even across subroutine linkages. This can be accomplished if the assignment of named data quantities is done for both the main pro - gram and its subroutines at the same time. Thus, the fifth block

94 starts to take on the characteristics of a very sophisticated loader. The procedure would work as follows. Subroutines would be translated up to the beginning of block 5 and stored until the main program is translated. Then a global assignment is accomplished by combining the main program and its subroutines as entries into the fifth block of the translator. There is reason to question the necessity of a "main program and subroutines named data quantity assignment" procedure. The importance of the procedure will depend upon five factors: the number of operands within the subroutine; the amount of high speed local storage; the location of subroutines relative to loops; the frequency of execution of the subroutines; the number of independent calls of each subroutine. The number of operands within the subroutine and the amount of local storage determine the usefulness of a global assignment scheme. If the number of operands and the space required for their storage is larger than the amount of local storage, the use of global assignment would be nearly equivalent to an assignment scheme local to the subroutine. If the number of operands andthe space required for their storage is small and the amount of local storage is large, the use of global assignment across subroutine

95 boundaries would be beneficial. If a subroutine is located within a tight loop or if a subroutine has a high frequency of execution the effect of data assignment for the subroutine will be more pronounced. The number of independent calls of each subroutine affects the difficulty of a global assignment of named data quantities. The operands of a subroutine are bound. Therefore, if a global assignment is attempted the procedure must attempt to work out starting with the subroutines and then assigning operands for each of the calling areas. Such assignments would be tricky at besto Is it possible to develop a sophisticated global data assignment procedure? The answer is yes. Several register assignment schemes have already been implemented. The IBM Fortran H compiler attempts to optimize register assignment. There are only 16 general purpose registers and 4 floating point registers so the optimization must be somewhat localized in the Fortran H compiler. There is a Fortran compiler which optimizes register assignment for the Univac 1108. The 1108 has 48 registers, so the assignment must be somewhat global. It is interesting to note that the compiler can optimize for the main program anc its subroutines as a unit. There are several register assignment procedures [ D4] which have

96 been proposed by Day. The speed of the algorithms developed by Day is heavily dependent upon the profit vector which associates a profit for the assignment of an operand to a register. Therefore, it is difficult to predict the value of his assignment algorithms. However, he has established atheoretical base from which to proceed. Thus, there are in existence one or more compilers, which perform global assignment and there is a theoretical base from which new global assignment algorithms may be developed. The discussion up to now has assumed that PTL programs would be translated into microcode. However, the term microprogramming and microcode refer to two different types of microprogramming and microcode. The two types of microprogramming are vertical microprogramming and horizontal microprogramming. A microprogram system is seldom either vertical oi horizontal, but a type lying between the two. (See Appendix C. ) The Program Type Language (PTL) can be translated into either vertical or horizontal microcode. However, the translation into one or the other will affect both the translation process and the resultant microcode. We wish to compare and contrast some of the differences between the translation of PTL programs into vertical microprogramming and horizontal microprogramming.

97 Horizontal microprogramming allows more complete control of the CPU than vertical microprogramming. In particular, parallelism can be exploited more easily by using horizontal microprogramming rather than vertical microprogramming. Thus, the coding of the in-line section of microcode for each program type can on the average be done more efficiently using horizontal microprogramming. If horizontal microprogramming is available, the translation program should attempt to take advantage of parallelism when the in-line sections of microcode (program types) are merged to implement a program. The translator needs to test the last microinstruction for the current in-line section of microcode and the first microinstruction for the next section of in-line microcode to determine if parallelism would allow the two instructions to be merged into one microinstruction. Any conditional branching associated with the two microinstructions must also be checked for compatabilityo Both horizontal and vertical microprogramming allow access to high speed local storage. Thus, for both types of microprogramming operand, assignment is very important. For horizontal microprogramming, it may require an equivalent of 10 microcycles to transfer an operand from main core into high speed local storage.

98 Therefore, the difference in execution speed between maintaining an operand in high speed local storage or main core may be very large. For vertical microprogramming, it may require an equivalent of only 1 microcycle to transfer an operand from main core into high speed local storage. This situation occurs when each vertical microinstruction resides in main core. Even for vertical microprograms which reside in main core, the assignment of operands can make a significant difference in the overall execution speed. Both vertical and horizontal microprogramming, when used to implement PTL programs, eliminate one level of interpretation. Thus, PTL program types are implemented directly in microcode, instead of being implemented in machine language instructions which would then be implemented in microcode, If the vertical microinstructions are really miniinstructions then less than a full level of interpretation is eliminated. The miniinstructions provide finer control of the CPU than machine instructions but they do not operate at the gate level as do horizontal microinstructions. From the previous discussion, it is apparent that the translation of PTL programs into horizontal microcode will produce faster executing programs than vertical microcode for equivalent machines. However, the length of each control memory word for horizontal microcode is

99 two or three times greater than control memory for vertical microprogramming. Thus, the cost of control memory for large programs is somewhat prohibitive if horizontal microprogramming is utilized.

Chapter III A MODEL OF MICROPROGRAM CONTROL One of the objectives of this study is the optimal selection of components for central processing units. Therefore, a cost performance model of the central processing unit must be devised. We will view the CPU as a collection of hardware units which are controlled by a microprogram. The functions to be performed by the CPU will consist of a set of program types. Each program type (Chapter II) is expected to be a frequently used computer operation. The CPU hardware configuration will be modeled as a base hardware unit and a set of optional hardware units. Each program type will have one or more implementations. A program type implementation is a sequence of microinstructions which will cause the CPU hardware to perform the desired function. Each of the implementations for a particular program type will use different hardware options and will have a different execution time and a different hardware cost. The execution time and cost of a program type implementation can be computed and the optimal cost performance curve for a set of program types can be computed. The preceding paragraphs have been a brief description of what will be developed in this chapter. The development will proceed as 100

101 follows. First, the general model will be presented. Second, the appropriate definitions will be given. Third, the equations of cost and execution time will be developed. Fourth, the problems associated with the selection of a base hardware set will be discussed.

102 3. 1 The Model The model is composed of two parts: the operations model and the CPU model. The operations model describes the basic functions to be performed by the CPU. The CPU model describes the available hardware with which operations may be implemented. The operations model is the Program Type Language (PTL) with the addition of statistics on usage. Each PTL operation is one function to be performed by the CPU. Since any non-PTL operation is described using PTL operations and hence is executed using PTL operations, PTL operations are constrained to represent the basic functions performed by the CPU. This emphasizes the importance of selecting PTL operations wisely. The statistics on usage are required to calculate performance and are described by values for both frequency of occurrence and frequency of execution. These quantities will be defined formally in Section 3. 2. The information supplied by the statistics on usage allows the performance calculations to weigh each program type as a function of its frequency of execution. The statistics on usage are also used in the storage cost calculation since they contain information concerning the number of occurrences of each program type for a general class of microprograms. The CPU model is used to describe the CPU or rather a collection of possible CPU's. The CPU is modeled as a base hardware

103 and a set of possible hardware options. The hardware options are functional units which may be added to the CPU to improve the performance. The base hardware may be as primitive or as sophisticated as desired. However, it is generally true that a more primitive base hardware will induce more hardware options than would a sophisticated base hardware. Each hardware option has a cost. The hardware option cost is composed of two parts. First, there is the actual cost to physically construct the unit and add the unit to the base hardware. Second, there is the additional cost of control memory resulting from any additional microorders required by the hardware option. The additional microorders result in a lengthening of the microword. The CPU is modeled by specifying a base set of hardware, the cost of the base hardware, and the number of microorders required to control the base hardware. Also specified are a set of hardware options, the cost of each hardware option, and the number of microorders required to control each hardware option. The two parts of the model are connected by the implementation methods. The operation model specifies what functions are to be performed, how often each function occurs in a program, and how often each function is executed. The CPU model specifies the available hardware and the cost of the hardware.

104 Each implementation method specifies the microinstructions required to implement a particular operation using some subset of the hardware. Therefore, an associated cost in terms of control storage cost and hardware costs and am execution time in terms of microcycles can be computed for each implementation. In the next section formal definitions will be given for the terms discussed above. Also, the cost equations will be given in the same section.

105 3. 2 Cost Performance Equations In this section, the cost performance equations will be developed. The equations represent the cost in dollars versus the execution time in microcycles for the implementation of a specified set of program type operations. Before developing the cost performance equations, a set of definitions will be given. The first group of definitions involve program type implementations. The program type vector P is a list of all program types. The hardware option vector H is a list of all available hardware options. Each hardware option is a piece of circuitry which may be added to the basic CPU configuration. The designers will supply one or more implementation methods for each program type Pj. Each implementation method is a sequence of microinstructions which will cause the CPU to perform the desired function. It is required that the designers specify for each program type one implementation method that uses only the base hardware. The hardware designer may then specify "improved" sequences of microinstructions that require one or more of the available hardware options. The improved sequences are expected to either use fewer control memory words or execute faster than the basic implementation method or both.

106 The specification of the hardware options required for each implementation method is given by the implementation matrix B3. One row of each implementation matrix B3 will contain all zeroes since one implementation method uses only the base CPU configuration. Program Type; P: A program type vector P is the list of statements allowed in the algorithm language. P. = the jth program type. Hardware Option; H: The hardware option vector H is the list of available hardware options. H. = the jth hardware option. Implementation Matrix; BJ: The implementation matrix Bj specifies which hardware options are required for the implementation methods for the jth program type. The ith row of matrix Bj specifies which hardware options are required for the ith implementation method of the jth program type. BQk= 1 the kth hardware option is required for the fth

107 implementation method of the jth program type. B 0 the kth hardware option is not required for the ~th implementation method of the jth program type. Extensive use of subsets of the hardware option set H will be made throughout the remainder of this work. The hardware option select vector Ri provides a convenient way of specifying such a subset. The superscript i is used to order any set of select vectors and is used for identification purposes only. Even though R is a vector it will be convenient to occasionally refer to the set of hardware options specified by R1 as the set R. -H Option Select; R The H option select vector R1 specifies a choice of hardware options. Rk= 1, thekth hardware option (Hk) is selected. R; = 0, the kth hardware option (Hk) is not selected. The superscript i is used to order any set of select vectors. The cost of the hardware is the cost of the base hardware plus the cost of any hardware options (denote that subset of H by R )

108 which are chosen to complement the base hardware. The cost of an individual hardware option may be dependent upon the other k elements of Rk. For example, there may be two hardware options which can be combined into one package if both are chosen. Thus, the cost of both hardware options would be less than the sum of their individual costs by the amount of the reduction in packaging costs. The ability to express dependent cost relationships among the hardware options is achieved by providing a vector of costs for each hardware option. Each element of the cost vector C1 of the ith hardware option has two associated vectors. The inclusion vector U1 specifies which hardware options must be present in R if the jth element of C (C.) is to be used as the cost of the ith hardware option, andthe exclusion vector X. specifies which hardware options must not be present in Rk if the jth element of CkCi) is to be used as the cost of the ith hardware option. The set of inclusion vectors and exclusion vectors required by C are denoted by Ui and Xi respectively. Both Ui and X1 are also used with the microbitivector M1 which is defined later. The hardware option cost is given by Equation 3. 2. 1 which i i i k i i) en follows the definitions of C\U, X, and p(Rk,U.,XI) given below. j j

109 Hardware Option Cost; Cj: The hardware option cost vector C lists the set of possible costs for the jth hardware option. The Inclusion Matrix; Uj: The ith row of the inclusion matrix IU specifies which hardware options must be included with the jth hardware option if the ith cost and the ith microbit value for the jth hardware option are to be used. U = 1 the kth hardware option must be included with the jth hardware option if the Ci cost value and the M1 microbit value are to be used. Uik 0 the kth hardware option may or may not be included with the jth hardware option if the C cost value and the Ml microbit value are to be used. The Exclusion Matrix; X: The ith row of the exclusion matrix Xj specifies which hardware options must not be included with the jth hardware

110 option if the ith cost value and the ith microbit value for the jth hardware option are to be used. Xik= 1 the kth hardware option must not be included with the jth hardware option if the Ci cost value and the Mi microbit value are to be used. X i 0 the kth hardware option may or may not be included with the jth hardware option if the C& 1 cost value and the Mi microbit value are to be used. The p function p(Rk, l X) has a value of 1 if the set of hardware options specified by R satisfies the requirements of hardware option inclusion and exclusion specified by U. and Xi and is 0 otherwise. The p function is used to select the cost of a hardware option (i) from its cost vector (C ) and the required microbits from its microbit vector (Mi to be defined later).. It is natural to expect that p(Rk, Uj, Xi) will be zero if the ith hardware option is not specified k i (Rk =0). This can be accomplished by requiring that U.i =1 fo a * Alo I h seti for all j. Also, the set Ui and the set X must satisfy the condition

ll k that there is at most one value of j for each possible R such that p(R,U,X.) is 1. For example, if there were four hardware options and the cost of hardware option 3 depended upon the inclusion or exclusion of hardware options 1 and 2 but was independent of hardware 3 3 option 4, the U and X matrices might appear as below. U3 = 1 0 X0 1 0 1 1 0 0 0 0 1010 0100 0 oI 0 0 1 o o0 1110 0 0 0 0 3 k The cost value C1 would be selected if R were (0010) or (0011), 3 k C2 would be selected if R were (0110) or (0111), etc. p function; p(R,J, ): The p function gives a value of 1 if R satisfies the requirement specified by U1, and Xi p(RUjyXl =l if Z(Rk U.)= and (RL Xi) = 0 otherwise = 0

112 3. 2. 1 Cost of hardware option i = cj p(RUiX) for option set Rk k i Note that p(Rk,U Xi) will be 1 for only one value of j. Occurrence Vector; O: The occurrence vector O is the estimated number of occurrences of each program type in a typical program. 0. = the number of occurrences of the P. in a "typical program". Execution Frequency Vector; F: The execution frequency vector F is the estimate of the percentage of program type executions which will be spent in each occurrence of a program type. F. = estimate of the percentage of program type executions which will be spent in each occurrence of the jth program type. Execution Time Vector; Vj: The execution time vector is the estimated number of microcycles required to perform the jth program type function

113 for each implementation method. V- = the estimated number of microcycles required to perform program type j using the ith implementation method. The cost of the CPU includes the base hardware cost (Cb), the cost of hardware options, if used, and the cost of the control memory. The cost of the control memory is a function of the cost of each bit of control memory (Cs), the length of the microprogram, and the length of the microinstruction word. The microinstruction word length is determined by the number of bits required by the base hardware and the number of bits required k by the hardware option set (R ). The base hardware bits are specified by mb and include the microorders required to control the base hardware and to sequence the microprograms. The number of microbits required by an individual hardware option may be depenk dent upon the other elements of R. For example, there could be one hardware option (the addition of a data path bus) which would force the addition of data path selection circuitry in other hardware options which would require an additional microorder.

114 The ability to express dependent microbit relationships among the hardware options is achieved by providing a vector (M1) of microbit values for each hardware option. The inclusion matrix (U ), the exclusion matrix (X ), and the p function are used to determine which element ofM to use as the microbit value. The microbits for the ith hardware option are given by Equation 3. 2. 3. The concepts employed are identical to those used to select the cost value for the ith hardware option. Hardware Option Microbits; Mj: The hardware option microbit vector Mjlists the set of possible microbits for the jth hardware option. 3. 2.2 Microbits for option i M. p (Rk U.X.) "y j. J J for option set R J In specifying the microinstruction word length it is possible to include both field encoding and word encoding (Appendix C). Field encoding is handled by allowing the hardware option designers to specify the number of microbits required for each hardware option in place of the number of microorders. Thus the designers can field encode the information required to control a hardware option in any manner that they desire. Normally a limit, k, is set on the maximum number of bits per field. This requirement results from speed considerations; the more bits per field, the more gate

115 delays in the decoding process The word encoding is handled by allowing the designer to specify a value for w, which is called the word form variable. If the value of w is greater than 1, there is more than one word type. If there is more than one word type, a vector WJ of 1's and O's for each hardware option (j) is supplied. The Wj vector specifies to which (there may be more than 1) of the w word forms the microbits required by the jth hardware option should be added. Thus if w were 2, and the base microbits were 10, and hardware option 1 added 3 bits to word form 1, and option 2 added 2 bits to word form 2, the length of the microword for the base hardware plus options 1 and 2 would be 13 bits. However, if the microbits required by hardware option 1 were to be added to both word forms. 1 and 2, the length of the microinstruction for the base hardware plus option 1 and 2 would be 15 bits. Word Form; w: The word form, w, specifies the number of different word types used by the base hardware. Word Type; W3: The work type, Wj, specifies to which of the word types the microbits required by the j hardware option are to be added. WJ - 1 add the microbits of the jth hardware option to word type i.

116 WI = 0 do not add the microbits of the jth hardware option to word type i. Base Hardware Costs; cb: The base hardware cost cb is the cost of the basic CPU hardware configuration including the hardware required for the basic microsequencing. Control Memory Bit Costs; cs: The control memory bit cost cs is the cost per binary storage element of the control memory storage array. Base Microbits; mb: The number of microbits required to control the base set of hardware and to sequence the microprograms. Field Encoding Limit; k: The field encoding limit k is the maximum number of bits per field. The concept of microsubroutine needs to be explored briefly. The microprogram implementation of individual PTL operations required by a program written in the program type language can be accomplished in two ways. Each operation can be replaced by an in-line section of microcode or each operation can be replaced by linkages to the proper microsubroutine. One or the other must be

117 done for each occurrence of an operation within the program. The usage of microsubroutines results from a desire to reduce the control storage requirements. If only one copy of the microinstructions required to implement each operation is maintained in the control store, and if the number of microinstructions required to branch to a subroutine is small, a major reduction in control storage requirements is possible. The replacement of in-line microcode by microsubroutines will cause a decrease in the execution speed of the program. The decline in execution speed results from two major factors. First, each linkage to or from a microsubroutine requires one or more microcycles. Second, the use of microsubroutines makes it a more difficult task to optimally allocate high speed memory (register and scratch pad memory) for either temporary or permanent storage of data. Therefore, additional microcyles may be required to move data from core storage to high speed memory for processing if sufficient high speed storage is not available. The two points mentioned above will be discussed in the next few paragraphs. If microsubroutines are used to implement a PTL operation, each occurrence of the operation within a program will be replaced by a branch to the microsubroutine. The microsubroutine will effect the performance of the required operation and a return branch to

118 the next PTL statement is then executed. Thus, an additional overhead of subroutine linkage occurs when microsubroutines are used in place of in-line microcode. It is possible to add sophisticated microroutine branching hardware to reduce the number of microcycles required to branch to and from a microsubroutine. However, the addition of hardware offsets the saving in control storage resulting from the use of microsubroutines. If a microroutine is to be efficient in terms of performance any data required by the microroutine should be located in the highest level of storage. Microroutines are written under the assumption that data is in either high speed registers or scratch pad memory. If the microroutine is written to be used as in-line code, the exact specifications of storage locations for data can be accomplished when the PTL program is compiled into microcode. In fact, an important part of the compilation process would be the optimal allocation of high speed storage registers and scratch pad memory to achieve efficient execution. However, if the microroutine is written as a microsubroutine, the location of the data must be fixed without regard to the intended PTL program. It would be difficult, if not impossible, to make efficient use of high speed storage if in-line microcode is replaced by microsubroutines. Therefore, the average execution time of

119 PTL operations might deteriorate because of excessive references to main core. Nevertheless, the savings in control storage that result from microsubroutines might be significant, or in certain cases, the reduction in performance might not be significant. We sum up with the following observations. Many of the PTL operations are easily implemented and would require very few lines of microcode. For these operations it would seem that implementation should be in the form of in-line microcode. One of the costs associated with microcode, in terms of control storage and execution time, is the need to move data into and out of the high speed storage areas. Since microroutines are written to work on data already contained in high speed storage, the storage costs and the execution time attributed to microroutines do not reflect the cost of data movement between main core and high speed storage. It should be remembered that a PTL program compiled into in-line microcode can optimize the location of data so as to minimize data movement between main core and high speed storage. Some operations may be so complicated and occur so often in a program that very large amounts of control storage can be saved. If these operations are few in number it may be the case that a slight degradation in performance is acceptable, or it may be

120 possible to improve the performance by reserving some of the high speed storage for the exclusive use of microsubroutines. In general, we believe that the use of microsubroutines should be avoided if possible. The use of microsubroutines to implement program type operations should not be confused with PTL subprograms. A PTL subroutine would be composed of many program type operations any of which could be implemented as a microsubroutine. Thus, microsubroutines are a method of implementing program type operations, not PTL subroutines. The decision to use in-line microcode or microsubroutines to implement each occurrence of a program type is made a priori for each program type. The choice is indicated by the subroutine selection vector. Subroutine Selection Vector; E: The subroutine selection vector E indicates which program types will be implemented as in-line code and which program types will be implemented as subroutines. E. = 1 The jth program type will be implemented as microsubroutines.

121 E. = 0 The jth program type will be implemented as in-line sections of microcode. The calculations involving microsubroutines require the definitions of several new variables. The definitions require no elaborations and are listed below. Subroutine Branch; sB: The subroutine branch sB gives the number of microinstructions required to initiate a subroutine branch. Subroutine Return; sR: The subroutine return sR gives the number of microinstructions required to initiate a return from a microsubroutine to the calling area. The other value needed for calculating the amount of storage is the storage vector Si. The storage vector gives the number of microinstructions required for each implementation method of the jth program type. Storage Vector; Sj: The storage vector Sj lists the number of microinstructions required for each implementation of the jth program type.

122 Si = the number of microinstructions required for the ith implementation method of the jth program type. The last definition is that of the selection vector. The selection vector is used to simplify the equations of cost and performance. It lists the current choice of program type implementation. Selection Vector; N: The selection vector Ni lists the current choice of the implementation method for each program type. N. = current choice of the implementation method for the jth program type. The superscript i is simply used to order any set of selection vectors. The cost performance equations will be developed for two cases. The difference between the two formulations is the use of special branching hardware. The effect of special branching on cost is a decrease in the length of control storage, and hence, cost and the effect on performance is usually an increase in average execution time. The cost equations will be developed for both subroutine branching and no subroutine branching. Then the execution time equations will be developed. The selection vector is used exten

123 sively since the cost equations and the average execution time equations are being developed for a particular set of implementation methods. However, the equations are valid for all combinations of implementation methods. Equation 3. 2. 3 calculates the hardware options (R ) required by the set of implementations specified by the selection vector (N5). The cost(CH(R )) of the base hardware and the hardware options is given by Equation 3. 2.4. The cost of the hardware options is calculated by summing the cost (Equation 3. 2. 1) of each hardware option specified by Rk. Remember that R- is a vector with a 1 in position j if the jth hardware option is a member of the current set of hardware options. Equations 3. 2. 5 and 3. 2. 6 are used to calculate the width (number of bits) of the microinstruction word. If word form (w) is a 1, the number of bits is found by summing the base microbits (mb) and the microbits (Equation 3. 2. 2) required by the hardware options specified by Rh. If word form is greater than 1, the width of each word is computed, and the number of bits of the largest word is the microinstruction word size. The width of the jth word (with w > 1) is found by summing the base microbits and the microbits required by the hardware options specified by RP and assigned to the word by W!, J

124 The cost of control memory for the two cases is given by Equations 3. 2. 7 and 3. 2.8, respectively. If microsubroutines are not used (CNS(N ) Equation 3. 2. 7) then the number of occurrences of each program type and the number of microinstructions for the implementation method for each program type are used to calculate the length of the program. If microsubroutines are used (CSS(N ) Equaticn 2. 2. 8), then the number of occurrences of each program type and the number of microinstructions required for each subroutine call, plus the number of microinstructions for each program type implementation are used to calculate the length of the program. The execution time (TNS(N&)) of the jth program type as a function of N is given by Equation 3. 2. 9 for nonmicrosubroutine implementations. When subroutine branching is used the value given by Equation 3. 2. 9 must be increased by the number of cycles required to initiate a subroutine call plus the number of cycles required to initiate a subroutine return. The number of cycles to branch is assumed equal to the number of microinstructions to branch (Sb) and the number of cycles to return is assumed equal to the number of microinstructions to return. This is reasonable if both the branch code and the return code were straight line code. That is given by Equation 3. 2. 10.

125 The value T (Equatima 3. 2.11) is the average number of microcycles per program type execution. The estimate is made by using the frequency of occurrence, frequency of execution, and the estimated execution time of each program type. The sum over all program types of the product of those three quantities is divided by the total number of program type executions to arrive at the average execution time (Equation 3. 2. 11). The performance measure PT is the reciprocal of the average execution time.

126 IPI 3. 2. 3 Rk = V BJ k J=l J IPI = length of program type vector 3. 2. 4 CH(R5) =<+ SYK C P(UX, R2)) Hardwareb j=1 k if w = 1 3. 2. 5 M(R2) = mb+ Z (Z Mk* (Uk, R )) j=1 k if w>1 3.2.6 M(R) = max (mb + (M p(UkXkR ) i)) over i j1 k I1 I 3. 2. 7 CNS(N2) = ( ((0. * SNU A % )) * M(R ) * Cs Storage j= j Non-Sub 3. 2.8 CSS(N) = ( ((O* SB) + Sj + SR) AE))* M(Rf) *c Storage j=l1 c Subroutine 3. 2.9 TNS (N) = VJ 3. 2.10 TS (N) = V; + sB + sR j N2B j

127 3.2.11 T (Nl)= (F ~ Oj[(V A E) + (Vj + sB + ) Ej j=l J N N i i 12 F.. 3.2.12 PT =, performance measure Te

128 3. 3 Base Hardware In the previous section the existence of a suitable base hardware set has been assumed. In this section the base hardware concept will be examined further. In particular, two points will be discussed. First, what would constitute a minimal base set? Second, what procedures would have to be invoked if one or more program types could not be implemented using the base hardware set? A base hardware set is by definition a set of hardware sufficient to allow the implementation of any program type operation. Therefore, a minimum base hardware set is a hardware set which satisfies the completeness criteria above and, in addition, minimizes some other function. We assume that the other function is cost. If one is strictly interested in the minimum cost set of base hardware, it might be possible to examine and compute the cost of all possible minimum base hardware sets. However, such a path seems very extreme and not terribly revealing. Instead, a near minimum set of base hardware will be presented and examined. The base hardware set is shown in Figure 3.3.1. The hardware consists of the core memory, three registers, the arithmetic logic unit, the data paths, the microprogram controller, and the

129 microprogram storage. Three registers are the minimum number of registers required to access core memory and also perform logical operations upon core memory data. The three registers are memory address register, a memory buffer register, and a result register. The MB and MA registers are required to cycle the core memory. The results register is needed to provide two inputs into the ALU. The MA register cannot be used to provide the second data input into the ALU, since once data was loaded from the MB into the MA, the second data word could not be loaded into the MB, since there would be data in the memory address register (MA), not the address of the second data word. The use of only three registers implies that the microprograms will use certain fixed locations in core as temporary locations, program counter, etc. These locations will be accessed by loading the MA using preset valves supplied by the emit fields. The emit field is a field within the microprogram instructions which allows the programmer to provide fixed (immediate) data for program use. It should be mentioned that there are hidden registers required by the microprogram control unit. A minimum of two registers would be required; one to hold the control information and one to hold the address of the next control storage word.

130 The data paths are bus connections into and out of the ALU. The emit data path allows the microprogram to load immediate data from the control storage through the ALU into any register. There would seem to be little room to simplify the data paths. The cost of the base hardware might be reduced by providing for serial transfers instead of parallel transfers of data. The ALU is assumed to have the ability to perform logical AND, as well as to be able to perform the complement operation. In addition, the output of any bit of the ALU may be tested and branched upon by the microprogram. It is assumed that only one bit of the result may be tested at a time. With the hardware described above a shift would be performed by loading the RR register with all ones (from the emit field), loading the word to be shifted into the MB register, testing each bit of the MB (by passing the MB through the ALU), and then after each test, performing a logical AND between RR and the proper emit word from control storage and storing the result in RR. The final step would be to transfer the RR register to the MB register. An example of a program to do a 1 bit left shift is given in Table 3.3.1. Programming using the base hardware set of Figure 3.3.1 would be tedious to say the least. The microprogram to implement the shift operation would be lengthy and would include many locations of emit (immediate) data. Thus, the savings in base hardware would be partially absorbed by an increase in control storage.

131 The testing which is available allows the microprogram to perform a branch conditioned upon the status of any bit of the ALU output. However, only one bit of the ALU output may be tested at a time. If a shift capability were provided with the ALU, only one bit of the ALU would need to be tested,. since the other bits could be tested by shifting the desired bit to the test position. Such a test setup would increase the size of the microprograms. The ALU could be replaced by a table look up facility. The table location could be specified by a field in the microinstruction and the contents of the MB and RR registers. Such a setup might reduce the base cost, but again the base cost is being reduced by increasing the storage requirements since added storage would be required to hold the look up tables. The ALU could also use different sets of basic operations. The ALU might implement logical OR, as well as complement. So the number of possibilities for a "minimal" base hardware set are rather extensive. Now let's consider the question of an insufficient base hardware set. Such a condition could occur in two ways. First, the hardware may not have the capability to allow the implementation of one or more program type operations. That shall be called an insufficient hardware set. Second, the programmer may feel that to implement a particular operation on the base hardware is too tedious and too ridiculous

132 to attempt. The thought of performing a multiply on the base hardware set of Figure 3.3.1 is rather terrible to imagine. If the base hardware set is insufficient, a change in the optimization algorithm is required. Two different problems may arise. First, some hardware option may be essential and thus should not be a hardware option but should be made part of the base hardware set. An essential hardware option is a hardware option which required by each implementation method of a program type. Such a hardware option should be part of the base hardware. Second, if there are no essential hardware options or if the essential hardware options have been removed and there are still program types which cannot be implemented using the base hardware set, the search algorithm in the optimization program must be modified. This me ans that there are program types which do not have a base implementation. This would occur if the programmer felt that it was unreasonably difficult (or slow) to implement the program type on the base hardware. The modification will simply require the skipping of any set of hardware options which does not have at least one feasible implementation for each program type operation. This could be effected by adding a dummy base hardware implementation which required 106 units of storage and 106 microcycles to execute.

133 A A CORE R MEMORY R iV\ U! i!r-. Emit I Microprogram | Control Control Storage Minimum Base Hardware Figure 3.3.1

134 Assume Registers are 4 bits LOAD RR'1111' IF MB1 =0 BR L5 L1 IF MB = 0 BR L6 L2 IF MB = 0 BR L7 L3 IF MB4 =0 BR L8 L4 TRANSFER RR TO MB BRANCH TO NEXT PROGRAM L5 AND RR WITH'1101' BR TO L1 L6 AND RR WITH'1011' BR TO L2 L7 AND RR WITH'0111' BR TO L3 L8 AND RR WITH'1110' RR TO L4 LEFT SHIFT 1 END AROUND Table 3. 2. 1

Chapter IV OPTIMIZATION Introduction The equations of cost and performance have been developed in the preceding chapter. The goal of this chapter is to develop optimization procedures capable of determining the minimum cost hardware configuration given a desired performance goal. For example, if the average execution time per program type is not to exceed 3.7 usec; the optimization procedure will select the least expensive configuration including control storage cost commensurate with the 3.7 usec execution time. The chapter will proceed as follows. First, the general problem and the magnitude of the optimization space will be discussed. Second, the general approach to the optimization will be discussed. Third, the optimization algorithm will be presented. Fourth, a special optimization technique which is part of the overall optimization will be discussed. 135

136 4.1 General Problem The optimization problem is the selection of the least expensive hardware configuration including control storage which achieves a desired performance goal. The execution time (equation 3.2 11) is the weighted average of the number of microcycles for each program type. Each program type execution time is weighted according to the frequency of execution of each program type. The hardware configuration cost includes the cost of control storage. Thus, each different implementation method for a program type can affect the hardware configuration costs in two ways. First, the hardware options, if any, required by the implementation method affect the total hardware option cost. Second, each implementation method requires a specific amount of control storage and thus affects the storage costs. The space of possible implementations is very large. If each of 50 program types had only 2 implementation methods (a small esti15 mate), the state space would contain on the order of 101 points. Thus, some efficient method of searching this very large space must be developed. Before continuing the discussion of the optimization problem the optimization variables will be discussed briefly. Figure 4.1.1 is the input to the optimization program presented in an easily understood format.

137 The first section lists the hardware options, the cost of each hardware option, and the number of additional microorders required by each hardware option. A separate list would explain the function or functions performed by each hardware option including an explanation of the required microbits. In the example of Figure 4.1.1 the 6th hardware option which is a floating point add/subtract unit costs 2000 units and requires three microbits which are added to the Ist and 2nd word forms. The cost of a hardware option may depend upon the inclusion of other hardware options. For example (in Figure 4.1.1) hardware option 1 will cost 300 units if hardware option 5 were not also chosen and will cost 380 units if hardware option 5 is used with hardware option 1. The microbits (1 if 5, 2 is 5) required by option 1 are also dependent upon option 5. The value of w specifies how many different word forms (word encoding is discussed in Appendix C and Section 3.2) are used by the base hardware microinstruction word format. With multiple word forms, the microbits required by a hardware option are associated with one or more word forms. In the example (Figure 4.1.1) the microbits required for hardware option 6 are added to word forms 1 and 2. The next section specifies the storage and time required to use microsubroutines. A microsubroutine is the replacement of in-line implementation of a program type by a branch to a microsubroutine. The use of microsubroutines to implement program types results in a reduction in control storage requirements but causes a reduction in

138 performance. The number of instructions required to initiate a branch to a microsubroutine and the number of instructions required to initiate a return from a microsubroutine are specified. These values are used to calculate the effect of implementing a program type as a subroutine instead of replacing each occurrence of a program type by an in-line section of microcode. The remaining sections of Figure 4.1.1 contain information relative to each program type implementation. The first row gives the program type number, its frequency of occurrence, frequency of execution, and type. The frequency of occurrence specifies the number of times the program type appears in the program. The frequency of execution specifies the average number of executions for each occurrence. The type specifies whether the program type will be implemented as an in-line section of microcode (type 0) or as a microsubroutine (type 1). In Figure 4.1.1 program type 1 occurs 20 times in the program and each occurrence is executed an average of 5 times for a total average execution of 100 times. It has type 0 and will be implemented with in-line sections of microcode. Under each program type heading is a section labeled implementation, time, storage. These sections contain the information pertinent to each implementation method. The time is the average execution time (microcycles) of each implementation method. The storage is the number of control words required by each implementation

139 method. The strings of l's and O's to the right of the time and storage values specify the hardware options required by each implementation method. In Figure 4.1.1 the third implementation method of program type 2 has a time of 7 microcycles, requires 10 words of control storage and uses hardware options 1, 2, 4, and 7. The general optimization problem is to select one implementation method for each program type so that certain cost performance criteria are satisfied. The number of program types specified for PTL is approximately 120. As was pointed out earlier, the number of possible combinations is very, very large. An exhaustive search of the total implementation space is simply out of the question. The cost function is non-linear. The cost of a hardware option may be dependent upon the choice of other hardware options. In addition, if two or more implementations for different program types use the same hardware options, the cost of the hardware options appear only once. Thus, the choice of an implementation method will induce only a control storage cost if the hardware options required by the current implementation method were required for implementation of other program types chosen earlier. The general problem has been outlined above. The method used to efficiently select the optimum hardware configuration will be presented in the next section.

140 MICRO- WORD HARDWARE OPTIONS COST BITS FORM Test for Zero 5 300 1 Si Test for Zero 5 380 2 Si 2 Count by One 400 1 S2 3 Fast Local Storage 450 0 S3 4 Double Byte Read/Write 500 2 S1, S3 5 Full Word Alu 1000 0 S2 6 Floating Add/Subtract 2000 3 S1, S2 7 Floating Multiply 4000 1 S3 Word form w = 3 SUBROUTINE REQUIREMENTS Words to Branch 2 Words to Return 1 7654321 Prog. Type - 1 Freq. Occ. - 20 Freq. Exec.- 5 Type - 0 Implementation Time Storage 1 3.5 3 10001 00 2 4.0 4 0100000 3 5.0 5 0000000 Prog. Type - 2 Freq. Occ. - 10 Freq. Exec.- 6 Type - 0 Implementation Time Storage 1 5.0 15 111 0101 2 6.5 10 1101110 3 7.0 10 1001011 4 8.0 9 0000000 Prog. Type - 3 Freq. Occ. - 30 Freq. Exec.- 3 Type - 1 Implementation Time Storage 1 20.0 40 1100000 2 25.0 35 1000100 3 30.0 30 0000000 Figure 4.1.1 Input Data For Optimization Program

141 4.2 General Approach The general approach is to solve a different problem. The implementation state space is very large because there are a large number of program types and each program type may have several implementation methods. If the number of hardware options is twenty or less, the number of combinations of hardware options (the hardware 6 10 option state space) is 10 which is at least a factor of 101 smaller than the implementation state space. Twenty hardware options for central processing units is a conservative estimate. It would be very nice to search the hardware option state space instead of the program type implementation state space. In the rest of the discussion the term implementation point will be used to describe a point in the implementation state space and hardware point will be used to describe a point in the hardware option state space. An implementation point consists of one implementation method for each of the program types. An implementation point has a singlecost which is the cost of the hardware options, the hardware base, and the control storage required by the implementation methods. The above statement assumes the existence of frequency of execution values and frequency of occurrence values for each program type. A hardware point consists of one of the possible combinations of hardware options. Each hardware point has a specific cost which is

142 the cost of hardware options specified by the hardware point. Each hardware point may have many implementation points associated with itself. It is possible to partition the implementation state space such that each partition contains implementation points which require the hardware options associated with a particular hardware option state space point. The cost of each implementation point in a partition is the cost of the hardware options associated with the partition plus the cost of control storage for the particular implementation point. Thus, it is possible to compute bounds on large numbers of implementation points (those of a partition) by using the cost of a hardware point (associated with the partition) and a bound on storage cost. It is also possible to bound the performance of each hardware point. The best performance which can be achieved under a particular hardware point can be easily calculated using the fastest implementation method of Figure 4.2.1. The method is simply one of picking the implementation with the smallest execution time for each program type which may be implemented using the hardware options of that hardware point. Up to now it has been pointed out that the hardware option state space points partition the implementation state space points, and that a bound on the cost and performance of each hardware point may be calculated. The approach which will be taken in the optimization is to try to eliminate one point in the hardware option state space and thus many

143 START INPUT HARDWARE OPTION SET I S0 ET 0...th1. NO PROGRAM TYPE GET I'th PROGRAM TYPE I'th OR GREATER IMPLEMENTATION SET J=0 J J+ 1 GET J'l IMPLEMENTATION TIMP FOR lA'th PROGRAM TYPE I -IMPLE MENTATIONt CAN BE IMPLEMENTED NO US ING HARDWARE s OPTION SET w ET IS EXECUTION TIME BYES ET "ET + ET OF IMPLEMENTATION JUST SELECTED. *IMPLEMENTATIONS STORED IN ORDER, MINIMUM EXECUTION TIME FIRST, ONE IMPLEMENTATION FOR EACH PROGRAM TYPE REQUIRES NO HARDWARE OPTIONS. Fastest Implementation Method Figure 4. 2.1

144 points in the implementation state space. To be effective the elimination of a point in the hardware state space should occur without an examination of the implementation state space points associated with the hardware point. The optimization procedure will be presented in the next section. Before presenting the final optimization procedure, an exhaustive search technique will be explored This is done to give insight into the final optimization method, since that method is a modification of the two exhaustive search procedures which will be presented below Throughout the discussion it should be remembered that performance (Equation 3. 2. 12) is the reciprocal of the average execution time (Equation 3.2.11). The methods to be explained below are exhaustive searches over 15 5 a state space reduced from a size of 105 or greater to a size of 10 or less. The first procedure assumes that control storage is free. Thus, the only costs are hardware costs. This procedure is shown in Figure 4.2.2. The second procedure considers control storage costs and is shown in Figure 4.2.3. The procedure of Figure 4.2.2 assumes that the only cost is hardware costs (control memory is free). Then for any fixed combinations of hardware options the cost is fixed. Since the hardware cost is fixed, the selection of each program type implementation is dependent only upon the execution time of the implementation and its requirements

145 for hardware options. For a fixed set of hardware options, the implementation method for each program type should be the fastest that can be implemented under the given hardware options. Therefore, the fastest implementation methods for a fixed cost (the cost of the hardware options) have to be chosen. The search method of Figure 4.2.2 examines every possible set of hardware options. It examines each hardware point and calculates the execution time using the method (fastest implementation) outlined above until a hardware point is found which satisfies the performance requirement. Once a hardware point which satisfies the performance requirement is found, that hardware point becomes the current hardware point. The current hardware point is that hardware point examined up to now which has the lowest cost of any hardware point meeting the performance requirement. After finding the first hardware point, the search is continued, but each new hardware point has its cost calculated before its performance. If the cost is greater than or equal to the current hardware point, the present hardware point is discarded and a new one is selected. If the cost is less than the cost of the current hardware point, performance of the new hardware point is computed. If the performance is not acceptable the hardware point is discarded and a new set is selected. If the performance is acceptable the new hardware point becomes the current hardware point and a new point to test is selected. This process continues until all the

146 hardware points have been tested. The cost is tested before the performance after finding an initial solution because the hardware option cost can be calculated much more rapidly than the performance. Since many hardware options can be eliminated by cost, it seems reasonable to test the cost before the performance. The procedure of Figure 4.2.2 assumes that control memory is free or that only one size of control memory is available. The problem becomes slightly more complicated if the cost of control memory is considered. The goal is still to eliminate hardware points and thus many implementation points without examining every member (set of implementations) of each partition. What is done in the method flowcharted in Figure 4.2.3 is to bound the cost of the implementation points in a partition and try to eliminate hardware points on the basis of the bounded costs. The performance test does not have to be modified from Figure 4.2.2 since the fastest implementation method still computes the minimum execution time which may be achieved under a particular hardware point. Thus, if the execution time is not satisfactory no other set of implementations which are feasible under the set of hardware options of the hardware point will be satisfactory. The procedure of Figure 4.2.3 is similar to Figure 4.2.2. Every possible hardware option set is examined. Points from the hardware option space are selected and the fastest implementation is

147 START NO NEW HARDWARE SELECT NEW HARDWARE OPTION COMBINATIONS OPTION HN J SO CALCULATE PERFORMANCE PN OF HN NNO REQUIRE YES N- PERFORMANCEJr CALCULATE COST CN OF HN HCUR UN NO NEW CCUR CN HARDWARE Fig. ureI 4 OPTION COMBINATIONS ~I I ISELECT NEW HARDWARE. OPTION H CALCULATE THE COST CN OF HN NO'NC C ______T ES CALCULATE PERFORMANCE PN OF HN NO REQUIRED \ PERF /,YES Exhaustive Search Technique Control Memory Free Figure 4. 2.2

148 computed until a set is found which satisfies the performance requirement. The cost of that implementation set (hardware point) is computed. The search is then continued but the cost of the new point is first calculated and tested against the cost of the current hardware option set. The cost of the new point is the cost of the hardware plus the cost of storage. At this point, the optimal implementation point in the partition of implementation points defined by the hardware option point is not known. Therefore, the exact cost of control storage associated with the hardware option is not known. What is done is to use the lower bound on the number of words of control storage. If the cost of the new hardware point plus the lower bound on storage cost exceeds the cost of the current hardware option set the hardware option set is discarded. If the lower bound on cost (C of Figure 4. 2. 3) is below the current cost, the fastest performance is calculated. If the performance does not satisfy the performance requirement (reciprocal of execution time), the hardware point is discarded. If the performance requirements are satisfied, the actual cost (CT of Figure 4. 3. 3) of the fastest implementation is calculated. If the cost is less than the cost of the current hardware point, the new hardware point replaces the current hardware point. However, the previous current hardware point may or may not be discarded. The previous current hardware point is discarded only if its lower bound in cost is greater than the cost of the new current hardware point. If this is not the case the previous

149 SART. CSMIN 5 LENGTH MINIMUM STORAGE ALL HARDWARE OPTION COMB INATIONS CHECKED SELECT NEW HARDWARE STO OPTION HN...... i- CALCULATE COST -- CALCULATE PERFORMANCE CT OF HN PN AND PN OF HN MINIMUM COST C,4.OF HN H CUR HN CP = CMT NO p > Required YES CUR N- PerformanceCU CT ALL HARDWARE OPTION r COMBINATIONS CHECKED SELECT NEW HARDWARE..... —--- ~,.~ OPTION HN FIND MINIMAL ELEMENT OF QUE CALCULATE COST CHN OF HN CALCULATE WORD LENGTH WLN OF CONTROL STORAGE FOR HN CALCUATE P ORN YES |URCC HINBUE I I I PN OF HNST Y., CCURCPICUR' HCUR IN QUPLACE,L I NQUE T' CALCULATE COST CEIr PLACEUR CUR -C- OFHN PUN CCUR'r' Exhaustive Search Technique Figure 4. 2.3

150 current hardware point is placed in a storage queue (called QUE). When a new current hardware point is found, each member of QUE is tested to see if it can be eliminated. A member of QUE is eliminated if its lower bound cost is greater than the cost (of the fastest implementation) of the current hardware point. This procedure is continued until all points in the hardware option space have been examined. Assume that QUE never overflows. After examining all points in the hardware option state space there may be several hardware points in QUE. Each hardware point in QUE has a CMT cost which is below the CT cost of any element in QUE and a PN (fastest implementation) which is acceptable. The hardware points in QUE define a set of implementation points and some technique must be developed to select the optimal implementation point. That function is represented by the box FIND MINIMAL ELEMENT OF QUE of Figure 4.2.3. The flowchart for the procedure is given in Figure 4.2.4. The selection of an optimal implementation point (Figure 4.2.4) from the elements of QUE requires two procedures. One procedure is the slope selection algorithm which selects that implementation point from an implementation point partition defined by a hardware point which achieves a required performance and has minimum cost. The second procedure eliminates the hardware points in QUE using the slope selection algorithm and conditional tests until only the optimal solution is left. The internal workings of the slope selection procedure

151 will be explained in the next section. The combined procedure (Figure 4.2.4) works as follows. Using the slope selection algorithm, the cost (CTB) of the optimal implementation set (ITB) of the first hardware point (HTB) in QUE is calculated. Remove the first hardware point in QUE. If QUE is empty the optimal solution has been found. If not, remove any element in QUE which has a CMT cost greater than C. If QUE is empty the optimal solution has been found. If not, compute the cost (CTB?) of IB of HB' (the first hardware point in QUE). The optimal implementation (either ITB or ITB') is selected. This procedure is then repeated until QUE is empty (using a combination of the CMT test and CTB test). The remaining ITB is the optimal solution. The optimization procedure of Figure 4.2.3 has been presented to aid the understanding of the optimization problem. The procedure used an exhaustive search technique, and although relatively fast, its performance can be improved using a modification of the Lawler-Bell algorithm. Both the Lawler-Bell algorithm and the slope selection algorithm will be discussed in the next section.

152 START DELETE FIRST ELEMENT (HTB) OF QUE US ING SLOPE SELECTION ALGORITHM COMPUTE CTGT ITB, HTB NO,,.Y HES 1 ^ "~EM PTY DI SCARD ANY ELEMENT OF OUE WHICH HAS CM> CTP SOLUTION IS I. DELETE FIRST ELEMENT HTB OF QUE USING SLOPE SELECTION ALGORITHM COMPUTE CTB', Iffp, Hf YES NO< CTI < cTK CTP " CTO':yES IT' ITO' ~<C>CTO~! HTP HTBO NO YES FORMANCE OF> --- I I FUND MINIMAL ELEMENT OF QUE Figure 4. 2.4

153 4.3 Optimization Procedure The optimization techniques (Figures 4.2.1, 4.2.2, and 4.2.3) presented in the last section converted a computationally infeasible problem (1015 points) to one which can be handled (106 points) by present day computing systems. The speed of the optimization procedure of Figure 4.2.3, which is the general optimization procedure, can be further improved by applying a modification of the Lawler-Bell [L4] technique for selecting the elements of QUE. The optimization procedure flow chart is shown in Figure 4.3. 1 Before presenting proof of the validity of the technique, the procedure will be described. The technique is one of partial enumeration. The hardware option state space is assigned a numerical ordering using Equation 4.3.1. The space is searched from no hardware options (0 * 0), to all hardware options (111 -. 1). The numerical ordering does not, in general, correspond to an ordering by cost. If such were the case, once an acceptable performance were found, the search should be stopped since all remaining hardware option points would have a higher cost. Although the numerical ordering does not correspond to an ordering by cost, it does happen that many sub-groups of the numerical ordering have a useful cost ordering. In particular, it turns out that for each hardware point x it is possible to compute the next hardware point x'

154 which could be lower in cost than x. Thus, if x has an unacceptably high cost it is possible to skip all points between x and x'. In addition, if x has an acceptable cost, it is possible to compute the best possible performance of the points from x to but not including x' If the performance is not acceptable the points between x and x' are skipped. Thus, the major difference between the flow chart of Figure 4.2.3 and 4.3.1 is the use of partial enumeration instead of searching the entire state space. The flow chart of Figure 4.3.1 shows only the hardware point selection. The selection of the optimum implementation point for a hardware point is discussed later. In the optimization procedure of Figure 4.3.1 the first objective is to find a feasible solution. The no hardware option case is tested first. If the no hardware option performance is not acceptable, then the all hardware option case is taken. The all hardware case is assumed to have acceptable performance since no other set of hardware could produce a performance which is better. Thus, we assume that the required performance P is achievable (i.e., this is checked at input time). Then the hardware option state space is searched from the point (O... 01) with a numerical order of 1 to point (11 * * 11) with (HI a numerical order of 2 - 1. The main loop starts at point A of Figure 4.3.1. The value SMIN (the minimum number of control storage words eureq d by any implementation point) is used to calculate CT for the current hardware

155 point. CMT is the minimum possible cost (cost of the hardware options plus the cost of SMIN words of control storage) for any implementation point defined by the current hardware option point. Although SMIN is fixed the width of the control storage words is a function of the hardware options specified by a hardware point. The value of CT for a hardware point is the cost of the hardware options specified by the hardware point plus the cost of the control storage required by the fastest implementation point defined by that hardware option point. The value of CMT is compared to the value of C. The value for C is the minimum CT value of any hardware point examined so far which had an acceptable performance. If CMT is greater than or equal to C a jump in the numerical order to the next point lower in cost is taken. The next point is computed by COUNT(X, Xn, END). The routine accepts the value (numerical ordering) of the current point X and then calculates a new value for X which is the next point in the numerical ordering that could have a cost less than the cost of X. The number of points which are skipped can be considerable. If X/ is reached, COUNT returns to END. If the CMT cost of the current hardware point is less than C the routine COUNTS is called. This routine calculates IN which is the next hardware option (in numerical order) that could have a cost less than CM (i.e., the same point which would have been calculated by COUNT), and the hardware option point (IT) which contains the hardware

156 options used by X, and by all the hardware points between X and IN in numerical ordering. Then the optimal performance (P) of IT is calculated. If P is not acceptable the current point is replaced by IN, skipping any points between X and IN. If P is acceptable then the optimal performance P of X is calculated and compared against the required performance. If the optimal performance of X is not acceptable, the next hardware point after X (X+ 1) is chosen and a branch to the start of the loop (A) is taken. If P of X is acceptable,: CT of X is calculated and the values of X, CT, S (the storage required by the fastest implementation point of X), and K (the microbit required by X) are placed in the queue (QUE) of possible optimums. The value of C is set equal to CT if CT is less than the previous value of C. Then the next hardware point after X (X+ 1) is chosen and a branch to the start of the loop (A) is taken. The procedure of Figure 4.3.1 uses a modification of the LawlerBell algorithm. The modifications of the algorithms are extensive since two of the three selection rules must be completely changed, Therefore, the proof of the modified algorithm will be given from the start even though part of the discussion parallels that of the Lawler-Bell paper [L4] Before proceeding some essential preliminaries must be discussed. The value of n(x) given by Equation 4.3.1 introduces a numerical ordering on the hardware option state space. It is also

157 START X, pointer into numerical order ~~x.0~o 1of hardware point IHI, number of hardware options IHI XP 2 -1 Pr, required performance I-. 3 CSMIN, minimum storage required IPERF (X, S, P) SMIN.,I.... -X SPby any implementation point. YES NO COST (X, CX, K) |COST(XP, Cp, KU CT Cx+S K* BITCOST C -C ++S K*BITCOST INSERT CX, CT, S, K, C) INSERT (Xp, Cf, CT, SO, K, C)| (A) COST (X, Cx, K) C C +C SMINKB ITCOST TX,. SMIN 1 YES \NO. I$ 1 —S~y -— ^ ____ END COUNT (X, X END) IS COUNTS (X, IN, IT, Xx, END) COST HARDWARE * -- -I POINT FROM QUE |PERFS(IT, P) NO >P YES X IN PERF (X, S, P) NO.> YES XI]X+1 N/ |CT.CXX+ S KoBITCOST ]___ p|~INSERT (X, Cx CT, S, K, C) I.} Optimization Procedure Figure 4.3.1

158 PERF (X, S, P) S, P computed P best performance for X S storage required by X to achieve P PERFS (IT, P) P computed P best performance for IT COUNT (X, X,END) Algorithm 4.3.2 is used to compute new value of X from current X. Return to END occurs if X> X COUNTS (X, IN, IT, I, END) Algorithm 4.3.2 is used to compute value of IN from X, IT = IN - 1 Return to END occurs if IN> X COST (X,CX,K) calculate C, K CX is the hardware option cost of X K is the number of microbits for hardware option point X INSERT (X, Cx, CT, S, K, C) The current hardware point X, and its hardware options cost Cx, total cost CT, storage S, and microbits K, are stored in the QUE of possible optimums. The value returned (C), is the best cost of any item in QUE. Figure 4.3.1 (Continued)

159 possible to consider a vector partial ordering between elements of the hardware option state space. Each element x of the hardware option state space is represented as a binary vector. Two vectors, x and y have a vector partial ordering if the relationship of Equation 4.3.2 is satisfied. For example x <y if x = (100) and y = (101) but if w = (100) and z = (001) there is no vector partial ordering between w and z. 4.3.1 Let x = (XN,XN_,..., x1) N- 1 N —2 0 n(x)= 2 1 xN +... + 21 ~ x2+ 2 x1 4.3.2 x<y if x.<y. for i=1,2,3,..., N A vector partial ordering implies a numerical ordering (Equation 4.3.3), but not vice versa (Equation 4.3.4). 4.3.3 x<y =4 n(x) < n(y) 4.3.4 n(x) < n(y) = < x<y However, there is a very interesting relation between numerical ordering and vector partial ordering. Assume that the binary n vectors between x0 = (000 00) and x N = (111 * 11) are ordered by the 2 -1 numerical ordering of Equation 4.3.1. It is very simple to compute the next binary vector x' (in numerical order) after any vector x

160 which is not in a partial ordering relation to x. Thus x' is the first vector following x which has the property that x' x'. The algorithm for computing x' from x is given below (Figure 4.3.2). The algorithm is the one used by COUNTS (Figure 4.3.1) where x is X, and x' is IN. 1. subtract 1 from x 2. logically or x with x-1 to obtain x'-1 3. add 1 to obtain x' Algorithm for Calculating x' from x Figure 4.3.2 There are three procedures in the optimization which must be justified. First, if the minimum cost of hardware option point x is greater than the current cost, the hardware option points between x and x' (xt is computed from x using algorithm of Figure 4.3.2) are skipped. Second, if the optimum performance achieved by the union of hardware options of x, and the hardware options of the hardware points between x and x' do not satisfy the performance requirements, the hardware option points between x and x' are skipped. The third procedure is used to select the minimum possible cost for a particular hardware point (i.e., minimize storage costs). The first two will be proven now, and are related to the LawlerBell algorithm. The third is a new optimization method and will be

161 explained separately. The proof of the first procedure requires the definition of the minimum cost of a hardware point C (x). The value C (x) is that m m represented as CMT in the flow chart of Figure 4.3.1. The minimum cost C (x) is essentially a lower bound on the cost of the hardware m options specified by x, plus the base cost, plus the minimum control storage cost. The minimum words of control storage is constant for all hardware points and is the sum of the minimum length storage for each set of program type implementations. The width of each control word varies from hardware option point to hardware option point. The width is the sum of the base microbits and the number of microbits required to control each hardware option of the hardware option state space point. Definition 4.3.1 Minimum Hardware Point Cost; C (x): The minimum cost of hardm ware point is the cost of the base hardware and hardware options specified by hardware option point x, plus the cost of the minimum possible length of control storage. Lemma 4.3.1 If x and y are two hardware option points and there is a partial ordering of x and y such that x < y, then C (x) < C (y) mrn m^

162 Proof: Observe that if H(x) is the hardware option cost of x, and if x <y, then H(x) < H(y). Observe next that if CS(x) is the minimum storage cost of x, and if x <y, then CS(x) < CS(y). Hence Cm(x) = H(x)+ CS(x)+ cb < H(y) + CSy) cb = C (y) where cb is the base hardware cost. Definition 4.3.2 Hardware Option Span; h(x, x'): The hardware option span is the representation of hardware options specified by n(x), n(x+l), n(x+2),..., up to but not including n(x'). Definition 4.3.3 Best Performance; P(x): The Best Performance P(x) gives the value of the optimum performance which may be achieved using the hardware option set represented by x. Lemma 4.3.2: Let n(x') > n(x) and given a desired performance P'; if P(h(x,x')) < P' then for

163 each x" such that n(x) < n(x") < n(x') P(x") < P' Proof: The hardware option set represented by x" is a subset of the hardware option set specified by h(x,x'). But, the addition of a hardware option can never reduce performance.. P(x") < P' Definition 4.3.4 Next Point; NP(x): The next point function NP(x) takes the representation x of a hardware option set and calculates the representation x' of another hardware option set such that 1) n(x) < n(x') 2) x x' 3) no x" exist such that a) n(x) < n(x") < n(x') and b) x x" The next point function is calculated by the algorithm specified in Figure 4.3.2. The next point function is used to skip sets of hardware option (those between x and x') if certain conditions are satisfied. The conditions are given below and they are the rules used in the optimization procedure which is presented in Figure 4.3.1.

164 The value P used in Rule 4.3.2 and Theorem 4.3.2 is the required r performance. The value C used in Rule 4.3.1 and Theorem 4.3.1 m is the minimum cost hardware point of those hardware points examined so far that had acceptable performance. The value C is the cost of the hardware options plus the cost of control storage for the fastest implementation point specified by the minimum cost hardware point. Rule 4.3.1 If C (x) > C skip to hardware option x' = NP(x) Rule 4.3.2 If P(h(x,x')) < P skip to hardware option x NP(x) Theorem 4.3.1 The use of Rule 4.3.1 will eliminate only nonoptimal hardware option points. Proof: by Definition 4.3.4 any hardware option xi' between n(x) and n(x') satisfies the partial ordering relation x < x". by Lemma 4.3.1 Cm(x") > Cm(x) C c(x) > c.'. therefore there is no hardware point between n(x) and n(x') which would cost less than current lowest cost c.

165 Theorem 4.3.2 The use of Rule 4.3.2 will eliminate only nonoptimal hardware option points. Proof by Lemma 4.3.2 for each x" such that n(x) < n(x") < n(x') P(x") < P r...any x" would have a performance which is less than the minimum acceptable performance. Both Rule 1 and Rule 2 skip sets of hardware option points without examining any of the members of the set. Rule 1 uses a cost criteria and Rule 2 uses a performance criteria. With Rule 1, the set of hardware points between n(x) and n(x') is skipped if the minimum cost (C (x)) of x is greater than the best cost (C) hardware point the optimization has currently found. The jump from x to x' is taken since all hardware points between n(x) and n(x') have a minimum cost greater than C (x), if x' is calculated using the algorithm of Figure 4.3.2. The algorithm of Figure 4.3.2 identifies in an efficient manner sections of the hardware option point space which are ordered by cost. With Rule 2, the set of hardware points between n(x) and n(x') is skipped if the best performance of any of the hardware option points specified by n(x), n(x+l),..., n(x'-l) is not an acceptable performance. The test used by Rule 2 calculates a bound on the performance of the

166 hardware points between n(x) and n(x'). If the bounded performance is not acceptable, none of the hardware points between n(x) and n(x') will have an acceptable performance. The second optimization technique is used to obtain the minimal cost set of implementations for a hardware point which is known to have a feasible (in terms of performance) set of implementation methods. The problem can be stated as follows. Given a hardware point and a set of the fastest implementations that are feasible (achieve the desired performance goals) it may be possible to reduce the cost of the solution (reduce storage) and still satisfy feasibility. Figure 4.3.3 is an example of such a problem. For a fixed hardware point (01011) the minimum execution time implementations are checked. The resulting cost is the cost of the hardware plus the cost of the control storage. The hardware cost is fixed since the hardware options are fixed, but the cost of control storage could be reduced. The storage required for a program type implementation can be reduced if there are implementations which can be implemented using the current set of hardware options but require less storage than the checked implementation. Such implementations are marked with an x in Figure 4.3.3. Thus, storage can be saved but at the cost of performance since implementations which require additional microcycles are being utilized. Since control storage is available in unit sizes, the reduction

167 Hardware Options Program Type Time Storage H1 H2 H3 H4 H5 Implementatica PI 1 2 3 0 1 0 1 1/ 2 3 6 1 0 1 1 0 3 3.5 5 1 1 0 0 0 4 4 7 0 0 0 0 0 P2 1 2 7 01011 1 / 2 2.5 4 0 1 0 1 0 x 3 4 8 0 0 0 0 0 P3 1 3 7 1 0 1 0 0 2 3.5 8 1 0 0 1 0 3 4 9 0 1 0 0 1 4 5 6 0 0 0 0 0 x P4 1 2 7 0 1 0 0 1 / 2 3 5 0 1 0 0 x 3 3.5 5 0 0 1 0 1 4 4 3 0 0 0 0 0 x Fastest Implementation and Alternates for Hardware Point X = 01011 Figure 4.3. 3

168 in storage should reduce the total storage requirements in integral units. The goal will be to achieve the desired reduction in control storage with a minimum reduction in performance. It may be the case that there is no solution which saves a page of control storage and still achieves the desired performance goal. The problem is shown graphically in Figure 4.3.4. The fastest set of implications is represented by p1. The horizontal dotted line represents the maximum acceptable execution time. The vertical dotted line through p2 represents the reduction in storage required to reduce the storage cost by one page unit. Each of the line segments from p1 to P2 represents the reduction in storage and the resulting increase in execution time if the fastest implementation for a program type is replaced by another implementation which requires less storage but also increases execution time. The p2 point of Figure 4.3.4 is the cost performance point resulting from the reduction in control storage of one page. The graphical illustration of Figure 4.3.4 suggests an optimization technique. Let Atk represent the increase in execution time and As3k the reduction in control storage which results if the fth implementation method for the jth program type were replaced by the kth implementation method for the jth program type. It seems reasonable to select among alternatives the program type j with the smallest Atk where 2 is the fastest feasible implementation for the jth As k

169 program type and k is allowed to range over all other feasible implementations for the jth program type The ith implementation for the jth program type would be replaced by the kth implementation for the jth program type. Then the program type (call it m) which had the second smallest At/As value would be chosen. This procedure would continue until either the desired savings in control storage is achieved or the performance is degraded below an acceptable level. The method is indeed optimal if there is only one replacement candidate for each program type. If there are two or more candidates the method will not guarantee an optimal solution. The counter example is shown graphically in Figure 4.3.5. What occurs is that for program type 1, implementation 3 (1, 3 in Figure 4.3.5) has a slightly worse slope than implementation 2 (1, 2), but it saves more total storage. Thus, one additional implementation must be replaced if implementation (1, 2) is used, and the resulting average slope is worsened. What is done is to use a slope selection procedure but to look at second third, etc., differences. For each program type, the best first difference slope is calcuAtj ik feasible implementation (call it i) for the jth program type and the other feasible implementations. If implementation k has the best ti /Aik it is the implementation which would cause the least reduction in performance per unit savings in control storage for that program type.

170 Maximum Acceptable Execution Time CycleslOperation P2 Execution Time Cycles/Operation ~I ~1 I /1 Page\ 1<- Storage) - Cost (Storage) Graphical Representation of Cost Performance Tradeoff Figure 4.3.4 Maximum Acceptable Execution Time _1] Cycles/Operation Execution Time P25, 2) Cycles/Operation 3) 1(3, 3) I ( —(1, 3) (1, 2) I P1 I Cost (Storage) First Difference Selection Example Figure 4.3.5

171 After the best first difference for a program type is found the best second difference is calculated. The procedure is the same as that for Atl the first difference except that K is calculated. Thus the fastest Asj km implementation (i) is replaced by implementation k in the At and As calculations. The third difference is then calculated, etc. The calculations stop when there are no additional implementations for the jth program which could cause a savings in control storage. The flow chart for the slope selection method is shown in Figure 4.3.6. The first step is to calculate the first, second, etc. differences for each program type. The first, second, etc. differences (Atji/Asj ) each form a triple' ik (At.k/As, Atik, Asik) which are placed in a queue which is denoted as Q in the flow chart of Figure 4.3.6. The values in Q are ordered with the first element in Q having the smallest At/As value, the second element in Q having the second smallest At/As value etc. This means that the first difference for program type j will occur before the second difference for program type j in Q. However, the second difference for program type j could have a smaller At/As value than the At/As value for the first difference of program type m and would occur before the first difference of program type m in Q. The rest of the optimization is elegantly simple and is accomplished by taking the top item out of the Q and keeping a running sum on dT and dS. The variables dT and dS are the increase in

172 execution time and the reduction in control storage achieved after each iteration of the procedure of Figure 4.3.6. If one or more pages of storage can be saved without increasing the execution time above that required, a cost reduction has been effected. If dT degrades the execution time before dS reaches a value which saves a page of storage, no cost reduction can be achieved and the fastest set of implementations are selected for that particular hardware point. The ASJ values generally represent a savings of several words of control storage. Therefore, the value of dS (after several implementation replacements) would be expected to exceed slightly the value required to reduce the control storage by exactly an integral number of pages. The optimization procedure of Figure 4.3.6 avoids that problem by using a fractional part of the storage saved by the last replacement implementation such that dS does represent an integral number of control storage pages. The value of dT is also appropriately modified. Such a procedure is reasonable since a mixture of the fastest implementation and the replacement implementation could be used to very nearly achieve the desired savings in control storage. The use of first, second, third, etc. differences provides an automatic selection process once the differences have been calculated, stored, and ordered in a queue. The first differences for a program type will always exhibit a better At/As slope than the second difference,

173 wSTAR^) a.., denotes the j'th implementation method li -i 1 I for the i' th program type GET i'th PROGRAM TYPE IMPLEMENTATION SET DENOTED AS 1' D ISCARD FROM Pi ANYa i NOT FEAS IBLE UNDER HARDWARE POINT X SELECT FROM P' THE ai i WITH THE FASTEST PERFORMANCE V ~ CALL THIS B' SET3i =13-B' DELETE FROM 1' ANY Bi. WHICH REQUIRES AN EQUAL OR GREATER AMOUNT OF STORAGE THAN B' ANOIS CAi NULTESl COMPUTE AT, AS, AT/AS A i i + 1 BETWEEN EACH ELEMENT OF Pi AND B' I / 1 YES ^THE NUMBER NO PROGRAM SELECT B j WITH SMALLEST | | S ATIA S CALL THAT ELEMENT B" PLACE AT/AS, AT, AS FOR B"IN Q SET Pi.'i- oil B' -B"____ CONTINUE ON NEXT PAGE Slope Selection Method Figure 4.3.6

174 CONTINUED FROM PREVIOUS PAGE ORDER Q BY AT/AS SMALLEST FIRST dT = dS =0 US ING THE TOP ITEM OF Q dT -dT + AT dS - dS + AS DELETE TOP ITEM FROM Q _ __ _ - ___ NO YES Figure 4.3.6(Cntire d dT = dT -AT YES dT < NO dS dS-AS max dT AT(Sreq - dS) dS = S AS Slope Selection Method Figure 4.3.6 (Continued) Slope Selection Method. Figure 4. 3. 6 (Continued)

175 the second difference At/As will be better than the third difference, etc. Therefore, the best replacement implementation for a program type will always be chosen first. The second best replacement implementation will only be chosen if the second difference for that implementation is at the top of the queue before the selection process is complete. If the second difference is at the top of the queue, the increase in execution time per unit of control storage reduction is less than for any remaining replacement implementation for other program types. The second difference for one program type may have a better At/As than the first difference for some other program type. Thus, the second best replacement implementation, not the best replacement implementation, should be and is chosen. This procedure continues until the execution time is increased above an acceptable level, or a cost reduction is achieved by eliminating N pages of storage. The proof of the slope selection optimization procedure requires that the problem statement be formalized slightly. Therefore, a series of definitions will be given, followed by the problem statement, followed by the optimization procedure (slope selection algorithm) and the proof that the slope selection algorithm performs as claimed. The slope selection algorithm forms the basis of the optimization procedure of Figure 4.3.6. The procedure works on the fastest implementation set (I ) for a hardware option point. It determines

176 the implementation point which reduces the control storage (S(I )) by S + A S words while minimizing the increase in execution req req time. The slope selection algorithm is optimal only for a savings in storage of S + AS The procedure generates a new implemenreq req tation point by replacement of the fastest implementation for a program type by an implementation which reduces the control storage requirements. The AS is the number of control words which are saved req req in excess of Sreq by the last implementation replacement made by the algorithm. Thus, AS will depend upon the particular H options, implementations, etc:. The last step of the slope selection method flow chart of Figure 4.3. 6 is to use a mixture of the fastest implementation method and its replacement implementation method to more closely equal S. The proof that follows ignores that extension and req only proves the optimality of the slope selection algorithm. Def. 4.3.1 a.. a.. is the jth implementation for the ith program type. The hardware options required by aij are specified by B, the performance by Vj, and the required stori age by S.. Note that one aij for each i requires no hardware options. Def. 4.3.2 D. D. is the total number of implementations for te jh program ype. for the jth program type.

177 Def. 4.3.3 a a is the set of program type implementations. a = {aijli e P, j < Di} Def. 4.3.4 1 p is a set of program type implementations such that there is at least one a.ij E for each Pi. U C e c a V one alj e p for each i e P Def. 4.3. 5 i The 1i set is a subset of the 3 set with each element of Pt an implementation method for the ith program type. 1 i aj ij! = {aj.tai.. e 3} Def. 4.3.6 I set An I set is a subset of p with one and only one element from each pj set. The I set satisfies the following conditions. II n p3i = 1 for all j I cp Equations 4.3.1 and 4.3.2 are used to compute the difference in execution time and the difference in control storage if one program type implementation (aij) is replaced by another program type 13

178 implementation (aik). Equations 4.3.1 and 4.3.2 assume the existence of the frequency of occurrence vector 0, the frequency of execution vector F, and the subroutine selection vector E All three vectors are defined in Section 3.2. Equation 4.3.1 Atk = F * ~0. * (V- V) jk 1 1 k j Equation 4.3.2 is — S) Asik [Oi* Ei + Ei] *(SJ Sk) Def. 4.3.7 Difference Vector (i, j, k, At, As, At/As): The difference vector is a set of values computed on 3i, between a i(aij e ) and the other elements of Pi. The algorithm for computing k, At, As, and At/As is given below. The algorithm computes that element of 3i (call it aik) which has the smallest At/As slope relative to j or if none exists k is set to -1. Algorithm Step 1 compute Asji for each a E P Step 2 eliminate any ai. for which Asji < 0

179 Step 3 compute Atjl for the remaining elements. Step 4 eliminate any aif for which At < 0. Step 5 if there are no remaining elements, set k = -1 and go to Step 9. Step 6 compute At./Asj for the remaining elements. Step 7 select the element from those remaining which has the smallest At/As. If more than one element has the same At/As and if it is the smallest value select the one which has the largest As value. Call that element ai. i i Step 8 set k = t, At =At, = As, and At/As = Atj /As Step 9 Stop. Def. 4.3.8 I set The I set is a subset of i3 such that: I= {aijlv < vk all aik e ti i i i i and if v = v s < s } sk - k

180 Slope selection algorithm T is the maximum increase in execution time max which can be accepted. S req is the number of control storage words which req must be saved. Step 1 Set i = 0 Step 2 i = i+1 Step3 if i > IPI go to Step10 Step 4 select a.. e I Step 5 compute the difference vector (i, j, k, At, As, At/As) Step 6 if k = -1 go to Step 2 Step 7 place (i, j, k, At, As, At/As) into the queue Q Step 8 set j = k Step 9 go to 5 the preceding steps have calculated the 1st, 2nd, etc. difference for each /i set. Step 10 order the elements of Q with Q the element (i, j, k, At, As, At/As) having the smallest Ap/As Q2 the second smallest, etc.. using the notation At/Asl = At/As of Q At1 = At of Q1

181 i = i of At/Asl < At/as2 < At/As3.3 Step 11 copy set I1 into set I2 Step 12 dT = 0 Step 13 ds = 0 Step 14 P = 1 Step 15 ds = ds + As Step 16 if ds > S go to Step 23 - req Step 17 dT = dT + At Step 18 if dT > T go to Step 27 max Step 19 I2 =2 - a + a ]ij i k Step 20 I = + 1 Step 21 if 2 > IQI go to Step 27 Step 22 go to Step 15 Step 23 dT = dT + At Step 24 if dT > T go to Step 27 max Step 25 I2 = I a2 + a ifj ik Step 26 success stop Step 27 stop it is not possible to save a unit page of storage Def. 4.3.10 IN IN is the I set generated after the Nth iteration of Step 19 of the slope selection algorithm.

182 Def. 4.3.11 S(IN) The function S(IN) specifies the number of words of control storage required by the implementation set IN Def. 4.3.12 T(IN) The function T(IN) specifies the number of microcycles to execute a complete sequence of program types (the number of executions is specified by the F vector and the 0 vector). Def. 4.3.13 AT(IN) AT(IN) T(IN) - T(I) Def. 4.3.14 AS() AS(IN) = S(I1) - S(IN) Lemma 4.3.3 if AS(IN) < S(I') I' c, IN c 3 then A(I) < AI( Proof: IN and I' are constructed from I by replacing implementation methods of I by implementation methods which reduce S(I1).. any of the replacement implementations are represented in the order queue Q produced by the slope selection algorithm. i 2 Let X be the set of elements of Q (Q, Q, etc.) which would be used to construct I'.

183 AT(I') = At X AS(I) = As x let XN be the first N elements of Q (Q1 Q2, QN let X = XN X let Xb = - X let X =X-X c a let t = At a X a bt A 8= ~ Asi Xb St(IN) = ta + tb X. At(IN) = t + tb AS(IN) = s + s C Xc A C ~A

184 AT(I') = t +t a c AS(I') = s + s a c t tN ta < AtN 1) Sa - N since X c XN a As b At 2) Sb < N since b C N b As t..N tc AtN 3) s > since X X = s - N c N C As..if AS(I) < AS(I') S + s <5 s + s a b- a c.. < s b- tb t since' < c by 2 and 3 b- s tb t b < t T() < T(I').'. T(IN) _< T(I')

185 Theorem 4.3.3 Given a j set and values Sq and P, req max the slope selection algorithm will generate an I set (I ) if one exists such that 1) T(I2)- T(I1) < T - max 2) S(I2) - S(I2) = S > S - req 3) No I3 (I3 # I2) exists such that if S(I1)- S(I3) > S' T(I3) T(I1) < T(I2)- T(I1) Proof: 1) Condition 1 is a direct result of Step 17 of the algorithm. 2) Condition 2 is a direct result of Step 16 of the algorithm. 3) Condition 3 is a direct result of Lemma 4.3.3.

Chapter V HARDWARE SELECTION EXAMPLE In this chapter a complete design example will be given. The example begins with the presentation of a base architecture, a set of hardware options, and their associated cost functions. Next the implementation methods for a selected set of program types are given. Finally, the optimization program is applied to the data developed in the first part of the chapter. As a demonstration of the flexibility of the program, cost performance curves will be developed for different values of control storage bit cost, control storage page size, and frequency of execution. 186

187 5.1 The Design Method The design procedure is outlined below. Step 1. A base set of hardware is provided. Step 2. A set of hardware options and their associated cost functions are provided. Step 3. A set of implementations for each PTL program type is developed. One implementation requires only the base hardware while any additional program type implementations use one or more hardware options to either improve the performance or reduce the storage requirement. Step 4. Statistics on a set of PTL programs are gathered. The statistics specify the expected frequency of occurrence and the expected frequency of execution of each program type. Step 5. The optimization program is used to develop the optimal cost performance points. The input to the optimization program is a value for required performance and the information developed in Steps 1 through 4. The output from the optimization programs specifies the least costly set of hardware options and the implementations which will achieve the desired performance goals.

188 5. 2 Base Hardware and Hardware Options A block diagram of the base hardware is shown in Figure 5. 2. 1. The main components of the base hardware are the microprogram control unit, the core memory, the memory address register, the memory buffer register, the high speed scratch pad memory, the arithmetic logic unit, and their associated data paths. The features of each of the base hardware units are summarized in Table 5. 2. 1. The base hardware set requires a 20 bit microinstruction word length. There are four basic word forms which are named the MEM, ALU, BRANCH, and 1/O-EMIT words. The different instruction formats and their associated microorders are given in Table 5. 2. 2. Examples of some possible microinstructions are given in Table 5. 2.3. Some special considerations in using the base microinstructions are listed in Table 5. 2.4. A block diagram of the base hardware combined with the hardware options is shown in Figure 5. 2. 2. The hardware options include features which provide additional testing options, shifting options, floating point operations. The features of each of the hardware options are summarized in Table 5. 2. 5. The cost function for each hardware option is given in Table 5. 2. 6. The cost of the hardware options is interdependent and this fact is reflected in the cost functions of Table 5. 2. 6. The additional microbits required for each

189 hardware option are listed in Table 5. 2. 7. An explanation of the values listed in Tables 5. 2. 6 and 5. 2. 7 will be given below on a hardware option by hardware option basis. 1) BR =: The cost of hardware option 1 depends upon the inclusion or the exclusion of options 4, 5, and 7. The effects of including either or both options 4 and 7 with option 1 can be considered in the cost functions of options 4 or 7. Option 5 specifies the addition of a second input bus. Therefore, if option 5 is added, extra circuitry will be required to select among two possible input buses and the cost of option 1 increases. The additional microbits required by option 1 are used to specify the addresses of data and to control the option. If option 5 is specified an additional microbit is required to select among the input buses. 2) ALU TEST: The cost of hardware option 2 depends upon the inclusion or exclusion of hardware option 8, which is the 16 bit ALU, If a 16 bit ALU is chosen instead of the 8 bit ALU, the cost of the test circuitry will increase.

190 There are no additional microorders required if hardware option 8 is chosen, since no additional tests are performed and no new data need be specified. 3) COUNT-1:. The cost of hardware option 3 depends upon the inclusion or exclusion of hardware option 5. Option 5 specifies the addition of a second input bus. Therefore, if option 5 is added, extra circuitry will be required to select among the two possible input buses. The added microbits required by option 3 are used to input data to the device and to control the counting. If option 3 and option 5 are both specified an additional microorder is required to specify which input bus contains the data. However, because of field encoding no additional microbits are required. 4) BR ><: The cost of hardware option 4 depends upon the inclusion or exclusion of options 1, 5, and 7. However, only the cost of options 1 and 5 need be considered since the effects of including option 7 can be considered in the cost function of option 7. If option 5 is specified, but not option 1, then the cost

191 of option 4 is increased to cover the added circuitry to select among the two possible (option 5 specifies an extra input bus) input buses. If option 1 is specified then the cost of option 5, if specified, will have been covered by the cost function for option 1. However, it is assumed that options 1 and 4 could be combined into a single package. Therefore, if option 1 is specified the cost of option 4 is less since part of the cost has previously been absorbed in the cost of option 1. If option 1 is included with option 4 the microbits required to address data have already been assigned by option 1, otherwise 8 microbits for addressing are required. If option 5 is specified but option 1 is not specified an additional microbit for data bus selection is required. 5) Input Bus: The cost of hardware option 5 depends upon the inclusion or exclusion of option 7. The cost of selecting between the two buses is absorbed by the other hardware options. However, hardware option 7 specifies that the BR=, and BR >< are to be a full word test. Therefore, if option 7 is specified the width of the input bus must be doubled. If option 5 is

192 specified additional microbits are required to specify the addresses of data. Since the additional data bus is used for word formats, ALU (S2) and BRANCH (S3), the microbits must be added to both S2 and S3. 6) Shift: The cost of hardware option 6 depends upon the inclusion or exclusion of option 8. The shift unit will operate upon 16 bit words instead of 8 bit words if option 8 (full word ALU) is used. Thus the cost of option 6 will increase if option 8 is also specified. The additional microbits required by option 6 are used to select the type of shift and to specify the number of places to shift. If option 8 is specified one extra microbit is required to allow the specification of the longer shift. 7) Double Byte: The costs of options 1, 4, 5, and 7 are related. The cost of options 1 and/or 4 increases if option 7 is specified since the number of bits to be tested doubles. If option 5 is specified the cost of selecting the input data bus must be included for the additional 8 bits. The cost resulting from option 5 for a half word (8 bits) has been included in the

193 cost of options 1 and 4. Thus, the cost for option 7 depends upon options 1, 4, 5, and for each combination is the difference between the actual cost and the cost specified for option 1 and 4. The inclusion of option 7 does not cause a requirement for any added microorders. 8) ALU: The cost of hardware option 8 is independent of the inclusion or exclusion of any other hardware options. The cost of the ALU is the difference between the cost of the standard 8 bit ALU and the optional 16 bit ALU. Thus the cost of the hardware option 8 is the exchange cost. No additional microbits are required to control the order ALU. 9) STACK: The cost of hardware option 9 is increased if hardware option 5 is specified, since additional circuitry is required to select the data bus. The additional microbits are required to control the input to and the output from the stack. If option 5 is specified an additional microbit is required to select the proper input buses.

194 10) Float A/S: The cost of hardware option 10 is independent of the inclusion or exclusion of any other hardware option. To simplify the presentation of the data an exclusion relationship has been applied. This relationship prohibits hardware points which contain certain combinations of hardware options. For example the second input bus (option 5) is not allowed if at least one of options 1, 3, 4, or 9 is not included. Thus, if the second bus (option 5) is not connected to anything it should not be allowed. Such a situation would automatically be eliminated by the optimization program but its inclusion allows the optimization to be speeded up. The other exclusion relation specifies that option 7 (double byte) is not allowed if both option 1 and 4 are not included. Since option 7 only applies to 1 and 4, such a restriction is necessary. The two exclusion relations are listed below, 1) 1, 3, 4, 5, 9 2) T, 4, 7

195 MICROCONTROL!- I- I I I I I I * m BI CORE I S I 2 L MEM U BASE HARDWARE CONFIGURATION Figure 5. 2.1

196 MICRO- O5r I-..... CONTR 8: A' i *^^ - * L —I I Ljej S~~~~~~~~~~~~~~I st ^;; -:: - Or -a' ** ~ —---—'1 LOAT A/S Base Hardware Plus Hardware Options Figure 5. 2. 2

197 MICROCONTROL UNIT: The microcontrol unit controls the ALU, controls the gatting of data within the CPU, controls the gatting of data between the CPU and memory, and controls the memory. The microcontrol instructions are executed sequentially, but conditimal branch tests are provided which may alter the execution sequence. MEM: The core memory contains up to 32 K by 16 bit words and is addressable by words. The memory can be read or written in 3 microcycle time units. MA: The MA register is 16 bits wide and specifies the location of core memory to be read or written. MB: The MB is 16 bits wide and contains the contents of core (read) or the data to be written. S1: The high speed scratch pad memory contains 64 by 16 bit words. The memory can be read and written in one microcycle time unit. The memory is addressable in a section by halfwords with 8 words per section. The section address is held in a microcontrol register and is set under microcontrol. Base Hardware Table 5. 2. 1

198 ALU: The ALU unit operates on 8 bit words (1/2 words) from S1. The unit can perform AND, OR, EOR, ADD, ADD+1, ADD + CARRY, SHIFT LEFT 1, SHIFT LEFT 1 WITH CARRY, SHIFT RIGHT 1, SHIFT RIGHT 1 WITH CARRY, DECIMAL ADD, DECIMAL ADD WITH CARRY, DECIMAL SUBTRACT, and DECIMAL SUBTRACT WITH CARRY. The carry is set by the arithmetic and shift operations and is unchanged until the next arithmetic or shift operation. The contents of the BI bus maybe complimented before being passed to the ALU input. The carry, and 8 ALU output bits are latched and can be tested under microcontrol. 31, B2: The data paths are 8 bits wide. B3: This data path is 16 bits wide. Base Hardware Table 5.2.1 (Cont.)

199 o o 0 S i o < a H 5 l-;; tIt t v4I _ _ _ r _ _!t! t < PI 0;' 4 — -c adQ * 0<8 rr m m f _ -. a..40 -- Fi *i o I Co *. o \ C rx^ Q 1 < —' 0 0. 1.

200 m 0 ~-<' -4 0 E0 ~9 i^ OPF4 00 4l ^1 i s 11 g j j~~~...j 0 _ z Cd 0~~~~~~~~~ z;I* co 0~~~~~~~ U1:4 ~/~~~~~~~~~P Pk P g P4 P4 n ~ _ 2 UO<!^ S d 5 $ U U uu u uu u y~ ottto opd 5 Of lz zmaSS gsf ~1 51 trl re ~11 s4 0

201 0 o ^ - ^ ~_i- ^i- CQ CQ Q C lqCj Cs 4 00:4 C c~I R + 0 m,,, n lPl ~ t t t < + + t t t, 4 e e -,- -': -4i m-t4,-,-;-4; > m i. \. \ [.z.].~ [..T.. }/ 0 a t t vx 4 3 &. C1 C3 -t t ^ ^ uS ^i S 0 E'^MM~yl~

202 C11, ~)C t tl t /- 0 Cv~ 00 0 +~~~~,0 ^-.~~ /aa pa ^ IV CM W + 1 sS 0'-4~~~~~~~~~~~~~~~~~~~~~~-~~~~ ~~~~~ E-4 E-, m-4 Cd-, + 0 0 / ~ ~ I/ *-' >'r++ |^ ^-^ -| ( "(?~~ ( ~+ — ~AC l ii ~ " 1 ^ a [g J ^OM^^^QQ PM MM M oe,~~~~~ \~C -( v _^ g- d "Www~~~e~~e~e m ~Cd 0 X ^ ~~ Fa

203 Transfer SI location A to MA and start a memory read. MA- B(A) MEMR Transfer MA to MB and start a memory cycle. MB - MA MEMC Add SI location A to S1 location B and store in SI location C. RB3(C) - RB2(A) + RB1(B) LB3(C) - RB2(A) + C + RBi(B) Complement Si location A and store in SI location A. RB3(A) - RB1(A) LB3(A) -LB1(A) Branch to L1 if S1 location A has a 1 in bit 2. - RB1(A) BRAI 2 LI Branch to LI if SI location Ahas a 0 in bit 9 -LB1(A) BRAO 1 LI Using the Emit field zero the odd bit position of Si location A RB3(A) - RB2(A) A' AA1 6 LB3(A) - LB2(A) A' AA16' Microinstruction Examples Table 5. 2.3

204 i) The address specified in an ALU operation or an emit operation must be in the current SI section. ii) The Bi and B2 bus specification for ALU operations must specify the same half of their respective addresses. The B3 bus half may be different from the Bi, B2 bus half. INCORRECT RB3(C)- RBi(A) + LB2(B) CORRECT RB3(C) - RB1(A) + RB2(B) CORRECT RB3(C) - LB1(A) + LB2(B) iii) The transfer of a full word from Si to MA or MB; or from MA or MB to S1 for any SI address may be accomplished with a MEM microinstruction as shown below. MA- B(A) v) The value of the result of the last ALU operation may be tested using the BR microinstructions as shown below. N specifies the bit position to test. BRA1 N 1 Base Hardware Programming Considerations Table 5. 2.4

205 1. BR=: The branch on equal option tests two words of S1 (8 bit or 16 bit dependent upon option 7). If the words are equal, a branch in the microprogramming occurs. 2. ALU TEST: The result of an ALU operation is tested for negative, zero, positive, or all one's. The test result is stored and may be used in a branch test after the completion of the current microcycle. The test status is stored until the next test microorder is issued. 3. COUNT-1: The option is used for microloops. A number is stored into the device and counted down under microcontrol. A microbranch is caused when the count goes to 0. 4. BR ><: The branch > < option tests two words of S1(8 bits or 16 bits depending upon option 7). A branch in the microprogram occurs if the test is satisfied. 5. INPUT BUS: A second dual bus is provided from S1 to the hardware optionfunctional units. Summary of Hardware Options Table 5. 2o 5

206 6. SHIFT: The output from the ALU is shifted before placing the results on the output bus. The shift unit will be either an 8 bit or a 16 bit shift unit depending upon the choice of ALU's (option 8). The unit performs either a left or a right N place shift (end around, carry out, or arithmetic) under microcontrol. 7. Double Byte: This option specifies whether options 1 and 4 will be an 8 bit (half word) or 16 bit (full word) test. 8. ALU: The 8 bit base hardware ALU unit is replaced by a 16 bit ALU. Both ALU's perform the same arithmetic logic functions. 9. STACK: This is a full word (1 6 bit) 1024 word stack. An item may be placed into or removed from the stack in 1 microcycle time unit. 10. Float A/S: This unit performs a floating add or subtract operation, in one microcycle time unit, between any two double words (32 bit) of scratch pad memory S2. The scratch pad memory (S2) is part of the option. Summary of Hardware Options (Cont.) Table 5. 2. 5

207 Hardware Option Cost Constraint Number 1 100.00 5 110.00 5 2 150.00 200.00 8 3 200.00 5 210.00 5 4 210.000 1 250.00 1,5 260.00 1,5 5 300.00 7 550.00 7 6 350.00 8 600.00 8 7 200.00 T,4,5 100.00 1,4,5 230.00 1,4,5 210.00 1,4,5 110.00 1,4,5 250.00 1,4,5 8 600.00 9 700.00 5 710.00 5 10 1200.00 Hardware Option Costs Table 5. 2. 6

208 Hardware Microorders Microbits S word Constraint Option Number 1 1 9 S3 5 2 10 S3 5 2 3 2 S2 3 2 2 S2 5 3 2 S2 5 4 2 2 S3 1 2 12 S3 1,5 3 13 S3 1,5 5 0 8 S2,S3 6 5 6 S2 8 5 7 S2 8 7 0 0 8 0 0 9 2 2 S2,S3,S4 3 3 S2,S3,S4 10 5 8 S2,S3 Added Microbits Table 5. 2. 7

209 5.3 Cost Performance Curves Using the base hardware and the hardware options described in Section 5.2, a set of implementation methods was developed. The list of program types for which implementations were developed is given in Table 5.3.1 Typically, the full set of implementations is not developed. The designer develops the implementations for the base hardware. By using the set of base hardware implementations, it is possible to specify the changes in storage requirements and execution time over that of the base implementation, which will occur if different hardware options are applied to an individual program type. That approach was used to develop the implementation values of Table 5.3.1. The base implementations are listed in Appendix B. in this section a number of design studies will be performed using the optimization program. The design studies will investigate the effects upon the optimal cost performance curve which occur when various parameters are varied. The parameters which will be studied are control storage bit cost, control storage page size, and program type frequency of execution. The cost performance curves will be developed using the optimization program described in Chapter IV and the values for the

210 hardware option costs and constraints of Table 5. 2. 6, the added microbits and constraints of Table 5. 2. 7, the implementation time storage and hardware options of Table 5.3. 2, the frequency of execution and frequency of occurrence values of Table 5.3.3, and the base hardware information of Table 5.3. 4. The performance plotted on each graph is the average execution time (Equation 3. 2.11) and the cost is the cost of the base hardware, hardware options and control storage. The cost performance studies which are presented below are broken up into several parts. The explanation of each set of curves will be given before the curves are presented.

211 1) S3 - S S2 2) S3 -S S2 3) S3- ~ S1 4) S1- S2 TR ADDR S3 LABEL 5) S -SHIFT S2 CARRY OUT LEFT S3 6) S1 - SI + S2 INTEGER 7) Si - SI - S2 INTEGER 8) Si INTEGER TO FLOAT S2 9) Si- STACK LABEL 10) STACK - SI LABEL 11) CLEAR STACK 12) INCREMENT POINTER LINK LIST SI LABEL 13) DECREMENT POINTER LINK LIST S1 LABEL 14) SET LINK LIST SI 15) EMPTY LINK LIST S1 16) REMOVE SI LINK LIST S2 LABEL 17) INCR S1 18) DECR Si 19) DO LABEL Si 20) BR Si = S2 LABEL 21) BR S1 < S2 LABEL Program Types Used in Examples Table 5.3.1

212 Program Implementation Time Storage Options Type Method (Microcycles) (Words) 1 1 2 2 2 1 18 2 1 2 22 1 1 8 3 1 2 2 2 1 1 8 4 1 1 9 - 2 9 7 8 3 9 7 4,7 4 8 6 4,7,8 5 7 5 4,5,7,8 5 1 L7 26 - 2 L5 6 3 3 L3 18 8 4 10 5 3,8 5 1 1 6 6 1 4 5 2 3 4 8 3 2 2 2 4 1 1 2,8 Time, Storage, and Hardware Option Requirements of Different Implementation Methods for Program Types of Table 5.3.1 Table 5.3. 2

213 Program Implementation Time Storage Options Type Method (Microcycles) (Words) 7 1 4 52 3 4 8 3 2 2 2 4 1 1 2,8 8 1 6.5 10 - 2 5.5 8 8 3 1 1 10 9 1 8 10 - 2 6 8 8 3 1 1 9 10 1 7 12 - 2 5 8 8 3 1 1 9 11 1 3 32 2 2 8 3 1 1 9 12 1 6 613 1 9 92 7 7 8 Time, Storage, and Hardware Option Requirements of Different Implementation Methods for Program Types of Table 5.3.1 Table 5.3. 2 (Continued)

214 Program Implementation Time Storage Option Type Method (Microcycles) (Words) 14 1 7 7 15 1 27 11 16 1 20 23 - 2 17 20 8 17 1 4 5 - 2 3 4 8 3 2 2 4 1 2,8 18 1 4 52 3 4 8 3 2 2 2 4 1 1 2,8 19 1 4 4 - 2 3 3 8 3 1 2 3 20 1 4 62 3 3 8 3 2.5 3 1 4 2 2 2,8 5 1 1 1.7 Time, Storage, and Hardware Option Requirements of Different Implementation Methods for Program Types of able 5.3.1 Table 5.3.2 (Continued)

215 Program Implementation Time Storage Option Type Methods (Microcycles) (Words) 21 1 6 11 - 2 5 10 8 3 2.5 3 4 4 1 1 4,7 Time, Storage, and Hardware Option Requirements of Different Implementation Methods for Program Types of Table 5.3.1 Table 5.3. 2 (Continued)

216 Program Frequency Frequency Type Execution Occurrence 1 1.0 3.0 2 1.0 3.0 3 1.0 3.0 4 1.0 3.0 5 1.0 3.0 6 1.0 3.0 7 1.0 3.0' 8 1.0 3.0 9 1.0 3.0 10 1.0 3.0 11 1.0 3.0 12 1.0 3.0 13 1.0 3.0 14 1.0 3.0 15 1.0 3.0 16 1.0 3.0 17 1.0 3.0 18 1.0 3.0 19 1.0 3.0 20 1.0 3.0 21 1.0 3.0 Frequency of Occurrence and Frequency d Execution of the Program Types of Table 5.3.1 Table 5.3 3

217 Base Hardware Cost $10,000.00 Base Hardware Control Word Size 20 bits Control Storage Page Size 128 words Control Storage Bit Cost $0.10 Base Hardware Values Table 5.3.4

218 Standard Configuration The first curve is for the standard configuration. The other cost performance curves are developed by varying one of the parameters used to generate the standard curve, The curve of Figure 5.3.1 represents the optimum cost performance curve for the standard values (Table 5. 2. 6, 5. 2 7, 5.3.2, 5.3.3, 5.3.4). The exact execution time and cost values and the required hardware options are listed in Table 5.3. 5. The first point requires hardware option 2, the second hardware options 2, 3, the third hardware option 8, and the fourth point requires hardware options 2, 8. The difference in execution time between the point requiring options 2, 3 and the point requiring option 8 is. 6 microcycles but the cost difference is only 50 dollars. This result occurs because option 8 is only slightly more expensive than options 2 and 3 combined but option 8 can be used to improve the performance of almost all of the program types. The costs of the point requiring option 8 and that requiring options 2 and 8 are almost identical. This result occurs because the combination of options 2 and 8 saves enough control storage words to almost offset the cost of combining option 2 with option 8.

219 0" (LJ C-_ 0 C) So cro L-_J Z M 3M"~ LUA _~Figure 5.3. 1.00 120.00 130. 10.00 150.00 160.00 COST (DOLLARS) (X 10 -2) Optimum Cost Performance Curve Standard Configuration (Tables 5. 2. 6, 5.2.7, 5.3.2, 5.3.3, 5.3.4) Figure 5.3.1

220 PERFORMANCE COST HARDWARE OPTIONS (AVERAGE MICROCYCLES) (DOLLARS) 7.119 11276.39 2 6.881 11578.80 2,3 6.214 11624.00 8 5.786 11644.80 2,8 5.548 11921.60 2,3,8 5.500 12313.60 1,2,3, 7,8 5.309 12421.60 2,8,9 5.214 12513.60 2,6,8 5.071 12698.39 2,3,8,9 4.833 13020.39 2,3,4,7,8,9 4.786 13084.80 1,2,3,4, 7,8,9 4.619 13220.39 2,4,6,8,9 4.500 13420.39 2, 4,6,7,8,9 4.452 13484.80 1,2,4,6,7,8,9 4.405 13620.39 2,3,4,6,7,8,9 4.357 13684.80 1,2,3, 4, 6,7,8,9 4.309 14540.80 1,2,3,4, 5, 6,7,8,9 4.286 14825.20 2,4, 6,7,8,9,10 4.238 14889.60 1,2, 4,6, 7, 8,9,10 4.190 15025.20 2,3,4,6,7, 8, 9,10 4.143 15089.60 1, 2,3,4,6,7,8,9,10 4.095 15945.60 1,2,3,4,5, 6,7,8,9,10 Optimum Cost-Performance and Required Hardware Options for Standard Values (Tables 5. 2. 6, 5. 2. 7, 5.3. 2, 5.3. 3, 5.3. 4) Table 5.3.5

221 Page Size The page size parameter was varied to generate the curves below. The curve of Figure 5,3. 2 (Table 5 3. 6) is for a control storage page size of 64 words and the curve of Figure 5.3.3 (Table 5o 3 7) is for a control storage page size of 32 words. The curves of Figure 5. 3.1, 5.3.2, and 5.3.3 are plotted together in Figure 5,3.4. The reduction in page size results in the inclusion of the no hardware option point in Figure 5.3. 2 (Table 5.3. 6) and Figure 5.3 3 (Table 5.3. 7). The exclusion of the no hardware option case in Figure 5.3.1 (Table 5.3.5) occurred because the number of words of control required by the no hardware option case is just slightly over a page boundary for 128 word pages. The addition of hardware option 2 causes a reduction in the cost of control storage by one page (128 words) which more than offsets the cost of hardware option 2. The reduction in page size can be generally said to reduce the total cost of any set of hardware options.

222 0 o oO 0 C0 cr M I.M 0= to I 5 Optimum Cost Performance Curve CF co 0: Z 33 Optimum Cost Performance Curve "Figure 5.3.2 Le.'- >. z z Figure 5. 3. 2

223 PERFORMANCE COST HARDWARE OPTIONS (AVERAGE MICROCYCLES) (DOLLARS) 7.500 11152.00 7.119 11276.39 2 6.881 11425.20 2,3 6.214 11496.00 8 5.786 11644.80 2,8 5.548 11921.60 2,3,8 5.429 12274.00 2,3,4,8 5.214 12328.00 2,6,8 5.167 12528.00 1, 2, 6, 7, 8 5. 071 12532.00 2, 3, 8, 9 4.976 12874.00 2,4,6,7,8 4.929 12932.00 1,2, 4, F, 7, 8 4.833 13020.39 2, 3, 4, 7, 8, 9 4.786 13084.80 1,2,3,4,7,8,9 4.738 13092.00 2, 6, 8,9 4.619 13220.39 2,4,6,8,9 4.500 13420.39 2,4,6,7,8,9 4.452 13484.80 1,2,4,6,7,8,9 4.405 13620.39 2, 3, 4, 6, 7, 8, 9 4.357 13684.80 1,2,3,4,6,7,8,9 4.309 14540.80 1,2,3,4,5,6,7,8,9 4.286 14825.20 2,4,6,7,8,9,10 4.238 14889.60 1,2,4,6,7,8,9,10 4.190 15025.20 2,3,4,6,7,8,9,10 4.143 15089.60 1,2,3,4,6,7,8,9,10 4.095 15945.60 1,2,3,4,5,6,7,8,9,10 Optimal Cost Performance and Required Hardware Options Control Storage Page Size 64 Words Table 5.3. 6

224 0( U") 1 01 Lo Cr. a: 0 ~ _ -O i o COST (DOLLARS) (X102) Optimum Cost Performance Curve Control Storage Page Size 32 Words " 1o.00 120.io 130.00 140.00 150.0060. COST (OOLLRRS] (X1 O-2) Optimum Cost Performance Curve Control Storage Page Size 32 Words Figure 5.3.3

225 PERFORMANCE COST HARDWARE OPTIONS (AVERAGE MICROCYCLES) (DOLLARS) 7.500 11088.00 7.262 11256.00 3 7.119 11276.39 2 6.881 11425.20 2,3 6.214 11432.00 8 5.786 11644.80 2,8 5.548 11844.80 2,3,8 5.500 12220.80 1,2,3,7,8 5.429 12274.00 2,3,4,8 5.214 12328.00 2,6,8 5.071 12448.80 2,3,8,9 4.976 12771.60 2,4,6,7,8 4.929 12832.80 1,2,4,6,7,8 4.881 12971.60 2,3,4,6,7,8 4.738 12992.80 2,6,8,9 4.690 13192.80 1,2,6, 7,8,9 4.619 13220.39 2,4,6,8,9 4.500 13420.39 2,4,6,7,8,9 4.452 13484.80 1,2,4,6,7,8,9 4.405 13620.39 2,3,4,6,7,8,9 4.357 13684.80 1,2,3,4,6,7,8,9 4.309 14540.80 1,2,3,4,5,6,7,8,9 4.238 14758.39 1,2,4,6,7,8, 9, 10 4.190 14890.80 2,3,4,6,7,8,9,10 4.143 14958.39 1,2,3,4,6,7,8,9,10 4.095 15782.39 1,2,3,4,5,6,7,8,9,10 Optimum Cost Performance and Required Hardware Options Control Storage Page Size 32 Words Table 5.3. 7

226 ~218~ ~~ * Control Storage Page Size 128 Words coQe;@~LI Control Storage Page Size 64 Words <> Control Storage Page Size 32 Words C) t7o C-) CD:0 Lco (5. e — I~~ " 1 0.o0 120.00 130.00 1q0. 00 150. 00 160.00 COST (DOLLARS) (X10-2 ) Optimum Cost Performance Curves Control Storage Page Size 128, 64, 32 Words Figure 5. 3. 4

227 Frequency of Execution The curves developed in Figure 5.3.5 (5.3.5, 5.3.8, 5o 3.9) and Figure 5.3. 6 (Tables 5.3 5, 5.3. 10, 5. 3 11) explore the effect of varying the frequency of execution of different sets of program typeso The curves of Figure 5.3o5 are developed by varying the frequency of execution of program types 1, 2, and 3 of Table 5.3. 1 It is interesting to note that the curves for an execution frequency of 1.0(*), 3. 0(A), and 10.0 (E), although different in performance are identical (point by point) in terms of cost and required hardware optionso This result occurs because the three program types (1, 2, 3) all use the same hardware option. The curves of Figure 5o 3 6 are developed by varying the frequency of execution of program types 19, 20, and 21 of Table 5 3. 1o The curves of Figure 5. 3 6 each contain different points in terms of required hardware optioals The curve which has the highest frequency of execution values for program types 19, 20, and 21 of Table 5 3o 1 has a significantly less number of option points than its two companion curves. Some of the optimum points of the other two curves have more options which can be applied to improve the performance of program types 19, 20, and 21. Therefore, those points tend to dominate as the frequency of execution of options 19, 20, and 21 is increased,

228 o.* * Frequency of Execution. 0 O Frequency of Execution 3.0 0 Frequency of Execution 10.0 -J _ o L) (:E r m -tw o a LL 0U Li,a- e e O.O o 0 10. 00 120.00 130.00 140.00 150.00 160.00 COST (OOLLRRS) (X10-2) Optimum Cost Performance Curves Varying Frequency of Execution of Program Types 1, 2, 3 Figure 5.3.5

229 PERFORMANCE COST HARDWARE OPTIONS (AVERAGE MICROCYCLES) (DOLLARS) 5.981 11276.39 2 5.796 11578.80 2,3 5.055 11624.00 8 4.722 11644.80 2,8 4.537 11921.60 2,3,8 4.500 12313.60 1,2,3,7,8 4.352 12421.60 2,8,9 4.278 12513.60 2,6,8 4.167 12698.39 2,3, 8,9 3. 981 13020.39 2, 3, 4, 7, 8, 9 3.944 13084.80 1,2,3,4,7,8,9 3.815 13220.39 2,4,6,8,9 3.722 13420.39 2,4,6,7,8,9 3.685 13484.80 1,2,4,6,7,8,9 3.648 13620.39 2, 3, 4, 6, 7, 8, 9 3.611 13684.80 1,2,3,4,6,7,8,9 3.574 14540.80 1, 2, 3, 4, 5, 6, 7, 8, 9 3.556 14825.20 2,4,6,7,8,9,10 3.518 14889.60 1, 2, 4, 6, 7, 8, 9, 10 3.481 15025.20 2,3, 4, 6, 7, 8,9, 10 3.444 15089.60 1,2,3,4,6,7,8,9,10 3.407 15945.60 1,2,3,4,5,6,7,8,9,10 Optimal Cost Performance and Required Hardware Options Frequency of Execution Program Types 1,2,3 3o0 Table 5.3.8

230 PERFORMANCE COST HARDWARE OPTIONS (AVERAGE MICROCYCLES) (DOLLARS) 4.240 11276.39 2 4.135 11578.80 2,3 3.281 11624.00 8 3.094 11644.80 2,8 2.990 11921.60 2,3,8 2.969 12313.60 1,2,3,7,8 2.885 12421.60 2,8,9 2.844 12513.60 2,6,8 2.781 12698.39 2,3,8,9 2.677 13020.39 2,3,4,7,8,9 2.656 13084.80 1,2,3,4,7,8,9 2.583 13220.39 2,4,6,8,9 2.531 13420.39 2,4,6,7,8,9 2. 510 13484.80 1,2,4,6,7,8,9 2.490 13620.39 2,3,4, 6, 7,8,9 2.469 13684.80 1,2,3,4,6, 7,8,9 2.448 14540.80 1,2,3,4,5,6,7,8,9 2.437 14825.20 2,4,6,7,8,9,10 2.417 14889.60 1,2, 4,6,7,8,9,10 2.396 15025.20 2,3,4, 6,7,8,9,10 2.375 15089.60 1, 2, 3,4, 6, 7,8, 9,10 2.354 15945.60 1,2,3,4,5,6,7,8,9,10 Optimal Cost Performance and Required Hardware Options Frequency of Execution Program Types 1,2,3 10.0. Table 5.3.9

231 CO * Frequency of Execution 1 0 oo 0 Frequency of Execution 3.0 Q Frequency of Execution 10.0 CD eng 0 oo. Z 0 CC ~0 ~ I _o- dm $- qE 0. U 0 <0 0 ~o o Cl o C)......Figure 5.. 6 "10.00 120.00 130.00 140.00 150.00 160.00 COST (DOLLRRS) (X10-2 ) Optimum Cost Performance Curves Varying Frequency of Execution of Program Types 19, 20, 21 Figure 5.3.6

232 PERFORMANCE COST HARDWARE OPTIONS (AVERAGE MICROCYCLES) (DOLLARS) 6.574 11276.39 2 6.463 11326.39 3 6.167 11578.80 2,3 5.648 11624.00 8 5.241 11644.80 2,8 4.907 11921.60 2,3,8 4.796 12313.60 1,2, 3,7, 8 4. 685 12430.39 1, 2, 3,4,6, 7 4.630 12478.80 2,3,4,8 4.611 12530.39 1,3,4,7,8 4.426 12678.80 2, 3,4, 7, 8 4.315 12730.39 1,2,3,4, 7,8 4. 056 13020.39 2,3, 4,7,8, 9 3.944 13084.80 1,2, 3,4,7,8, 9 3.926 13420.39 2,3,4,6,8,9 3.833 13484.80 1,2,4,6,7,8,9 3.722 13620.39 2,3,4,6, 7,8,9 3.611 13684.80 1,2,3,4,6,7,8,9 3.574 14540.80 1,2,3, 4,5,6, 7,8,9 3.556 15025.20 2,3, 4, 6,7,8,9,10 3.444 15089.60 1,2,3,4,6,7,8,9,10 3.407 15945.60 1,2, 3,4,5,6, 7,8,9,10 Optimum Cost Performance and Required Hardware Options Frequency of Execution Program Types 19, 20, 21 3.0 Table 5.3.10

233 PERFORMANCE COST HARDWARE OPTIONS (AVERAGE MICROCYCLES) (DOLLARS) 5.740 11276.39 2 5.240 11326.39 3 5.073 11578.80 2,3 4.781 11624.00 8 4.302 11644.80 3,8 4.031 11850.39 1,2,3,4 3.927 11921.60 2,3,8 3.365 12080.39 1,2,3,4,7 3.073 12430.39 1,2, 3,4,6,7 3.031 12530.39 1, 3,4,7,8 2.865 12730.39 1,2,3,4,7,8 2.656 13084.80 1,2,3,4,7,8,9 2.469 13684.80 1,2, 3,4, 6,7,8,9 2.448 14540.80 1,2,3,4,5,6,7,8,9 2.375 15089.60 1,2, 3,4,6,7,8,9,10 2.354 15945.60 1,2,3,4,5,6,7,8,9,10 Optimum Cost Performance and Required Hardware Options Frequency of Execution Program Types 19, 20, 21 10. 0 Table 5.3.11

234 Control Storage Bit Cost The curves of Figure 5 3. 7 are developed by varying the cost of a bit of control storage. The cost is varied from five cents (Table 5 3.12) to 10 cents (Table 5.3.5) to 20 cents (Table 5 3. 13) per bit. As would be expected the curves are affected more drastically at the low cost performance end of the curves. As the cost of a bit of control storage increases many of the optimum cost performance points drop out. This effect is noticed because improved performance and a decrease in the length of control storage generally occur together. Thus as the cost of control storage increase marginal cost performance points are dropped.

235 o L Bit Cost $.05 c0~~~o0 ~* Bit Cost $.10 e]cx ~ ~ K0 Bit Cost $.20 -"J. LLJO QC N O 9 1 ~o X ~ LU W Q0 r ao 0 3 0 CD Z- 4'"r'" *' o. 3 I 0 0 0 0 1000.00 1100.00 t200O 1300.00 14q00.00 t1500.00 1600.00 1700.00 COST (OOLLARS1 (X1O-L Optimum Cost Performance Curves Control Storage Bit Costs $.05, $.10, $ 20 Figure 5.3.7 Figure 5. 3. 7

236 PERFORMANJCE COST HARDWARE OPTIONS (AVERAGE MICROCYCLES) (DOLLARS) 7.500 10640.00 7.119 10713.20 2 6.881 10964.39 2,3 6.738 11015.60 6 6.214 11112.00 8 5.786 11222.39 2,8 5.548 11460.80 2,3,8 5.500 11756.80 1,2,3,7,8 5.429 11864.39 2,3,4,8 5.214 11956.80 2,6,8 5.167 12156.80 1. 2,6,7,8 5.119 12195.20 2,3,6,8 5.071 12199.20 2,3,8,9 4. 9 7 6 12464.39 2, 4,6,7,8 4.929 12535.20 1,2,4,6,7,8 4.833 12585.20 2,3,4,7,8,9 4.786 12662.39 1,2,3,4,7,8,9 4.738 12695.20 2,6,8,9 4.619 12785.20 2,4,6, 8,9 4.500 12985.20 2,4,6,7,8,9 4.452 13062.39 1,2,4,6,7,8,9 4.405 13185.20 2,3,4,6,7,8,9 4.357 13262.39 1,2,3,4,6,7,8,9 4.309 13990.39 1,2,3,4,5,6,7,8,9 4.286 14287.60 2,4,6,7,8,9,10 4.238 14364.80 1,2,4,6,7,8,9,10 4.190 14487.60 2,3,4,6,7,8,9,10 4.143 14564.80 1,2,3,4,6,7,8,9,10 4.095 15292.80 1,2,3,4,5,6,7,8,9,10 Optimum Cost Performance and Required Hardware Options Control Storage Bit Cost $.05 Table 5.3.12

237 PERFORMANCE COST HARDWARE OPTIONS (AVERAGE MICROCYCLES) (DOLLARS) 7.119 12402.80 2 5.786 12489.60 2,8 5.548 12843.20 2,3,8 5.309 13343.20 2,8,9 5.214 13627.20 2,6,8 5.071 13696.80 2,3,8,9 4.833 13890.80 2,3,4,7,8,9 4.786 13929.60 1,2,3,4,7,8,9 4.619 14090.80 2, 4,6, 8,9 4.500 14290.80 2,4, 6, 7, 8,9 4.452 14329.60 1, 2, 4, 6, 7, 8, 9 4.405 14490.80 2, 3, 4, 6, 7, 8, 9 4.357 14529.60 1,2,3,4,6,7,8,9 4.309 15641.60 1,2,3,4,5,6,7,8,9 4.286 15900.39 2,4,6,7,8,9,10 4.238 15939.20 1,2, 4, 6, 7, 8, 9, 10 4.190 16100.39 2,3,4,6,7,8,9,10 4.143 16139,20 1,2, 3,4,6,7,8,9,10 4.095 17251.20 1, 2,3,4,5,6,7,8,9,10 Optimum Cost Performance and Required Hardware Options Control Storage Bit Cost $. 20 Table 5.3.13

Chapter VI CONCLUSION The contributions of the research are both general and specific, The general contribution is the design method and the specific contributions are the elements which were developed to implement the design method. The design method which has been developed in this thesis has a number of desirable features. First, the design method is directed toward a CPU design environment which is rapidly coming into existence. Second, the design method is user directed. A very determined effort was made to let the user specify what functions are required for his computer applications. A large set of operations (called program types) were selected and combined into a language (PTL). The user writes his programs or translates his programs into PTL. It is then very easy to gather statistics on user requirements. Third, the design method makes optimum use of the human designer. The connection between the program types and the CPU model is the implementation methods. There are one or more implementation methods for each program type. Each implementation method specifies a CPU configuration. The human designer specifies the implementation methods and is concerned with performance - cost tradeoffs at the 238

239 program type level. The computer, guided by the optimization program, performs the performance - cost tradeoffs for the entire set of program types using the individual implementation methods supplied by the designer. The designer specifies several different configurations for each program type. The designer does not have to attempt to evaluate the combined effect of adding an optional piece of hardware on 100 or more program types. He can focus his attention on each program type one by one. The question of automatic or semi-automatic generation of program type implementations which require one or more hardware options using the base hardware implementation is discussed in Appendix D.. The CPU model PTL, the implementation concept, and the optimization techniques are all specific contributions. PTL is a machine independent intermediate language. The language is designed to meet two principal objectives. First, the language allows valid statistics to be gathered on the functional requirements of the computer user. Each program type is selected to provide a basic function which is a major building block of a highly used computer area. Many of the program type operations have the potential of direct implementation in hardware with an execution time of one microcycle. Thus, the language should allow valid statistics to be gathered on those basic operations (in terms of basic hardware)

240 the user wishes to perform. Second, the language is machine independent. The language meets the requirements of an UNCOL language. It is hoped that compilers, interpreters, etc. would be written in PTL and would produce code in PTL. Then a new computer could take advantage of existing compilers, interpreters, user programs, etc. by providing one new compiler which would be a PTL compiler. The need for an UNCOL-like language and the possible advantages of such a language have been discussed for years. However, for one reason or another, an UNCOL language has not been developed. It is time to provide an UNCOL-like language, and PTL is a first attempt at such a language. The CPU model is very flexible and has a number of good features. The CPU is modeled as a base hardware set plus a set of hardware options. The base hardware set defines a general architecture. Therefore, each design is restricted to a general architecture. However, the hardware options allow major variations of the design about the general architecture. The hardware options consist mainly of function units and the use of the hardware option concept in the model seems to be very realistic in view of the rapid acceptance of large and medium scale integration. The cost functions that the model uses are realistic. The cost of a hardware option is dependent upon the choice of the other hardware options as is the number of

241 microbits required to control hardware option. The specification of the microinstruction word length is very flexible. The use of both field encoding and word encoding are modeled. The implementation method concept allows the hardware software tradeoffs to be accomplished at the microcontrol level. The implementation methods could also be used as one input into a microprogram compiler. An additional contribution is the definition of vertical and horizontal microprogramming which are provided in Appendix C. There is currently no standard or well defined definition of either. The optimization technique involves three distinct parts. First, the use of bounding techniques allows the search procedure to be applied to the hardware option space (H space < 106) instead of the implementation space (I space > 10 1. The bounding techniques use the upper bound in performance and a lower bound in the hardware option space. Second, a modification of the Lawler Bell technique is combined with the bounding procedure to eliminate large sections of the hardware option space. Third, a slope selection techique was developed to select the optimal implementation set for any fixed hardware option set. The techique is faster than any with which we are familiar. There are a number of extensions which could be made to the

242 work presented here. First, the Program Type Language should be completed by allowing a users group to experiment with and make appropriate changes in the language. Second, a PTL compiler should be developed. The development of a compiler would help to answer many of the questions relevant to the UNCOL concept. Third, the working set idea relative to variables should be explored. Such a concept could be used to determine the optimal amount of local storage. The optimal amount of local storage cannot be determined using the design techniques developed in this dissertation. The optimal amount of storage does not depend upon the number of occurrences of each program type but depends upon the set of active variables and the change in the set of active variables. Thus if the local storage area is full and a new variable is required by the currently executing program type, a memory write and a memory read cycle will be required to acquire the variable. Fourth, a method to optimally select the form of the microinstruction word should be developed. The model which is used in this dissertation considers the cost of the microinstruction word but the form of the word is specified by the designer. It would be useful to develop a method which analyzes the tradeoffs in speed versus control memory size for different microinstruction word forms.

APPENDIX A Implementation of High Level Language Instructions in PTL and IBM 360 Model 25 Microcode. A Fortran DO Loop, a APL submatrix selection, and a SNOBOL pattern match will be implemented in PTL and then converted into IBM 360 model 25 (2025 processor) microcode. The appendix will proceed as follows: 1) The higher level instructions and equivalent PTL code will be given. 2) A list of the PTL program types used in 1 will be compiled. 3) A brief introduction to IBM 360 model 25 (2025) processor) will be given. 4) The IBM 360 model 25 microcode required to implement the program types listed in 2 will be given. 5) The combined microcode required to implement the higher level instructions will be given. 243

244 Al High Level Language Instructions and Equivalent PTL Instructions FORTRAN DO LOOPS INTEGER *2 I, J, S1 a) J = DO 1 I 1, S1 1 J=J+I equivalent PTL Code J -'0' DO S1 L1 J- J+ I -1 I-I+ 1 b) J=0 DO 1 I= 2, Si 1 J=J+I equivalent PTL Code J -'0' I-'2' S1i -S -'1' DO SI LI J-J +I L1 I-I+1

245 c) J = DO 1 = 2, S1, 2 1 J=J+1 equivalent PTL Code J-'0' I -'2' L1 J-J + S2 I-I +'2' BR S2 < S1 Li APL Submatrix Selection a) C-A/B equivalent PTL Code ADO Li ARRAY B SI, S2'12' BR L1 A(S2) = 0 BR L2 S2 l I-'O' L2 I-I+'1' C(S1, I) -B(S1, S2) L1 CONTINUE

246 b) C-A/ [ 11 B equivalent PTL Code ADO LI ARRAY B S1,S2'2 1' BR L1 A(S) ='0' BR L2 S1 i1 I -'0' L2 I-I+'1' C(I, S2) -B(S1, S2) L1 CONTINUE SNOBOL Pattern Match a) HERE S1 S2: F(L1) equivalent PTL Code HERE MATCH S1 in S2'DUMMY' L1 b) HERE Si S2 S(L1) equivalent PTL Code HERE MATCH Si in S2'DUMMY' L2 BR LI L2 CONTINUE

247 A2 List of PTL Operations Used in Al 1) S1 -'0' 2) S1 -'K' 3) S1(S2, S3) -S4(S5, S6) 4) INCR Si 5) S1 - S2 +'2' 6) S1 -S2 + S3 7) DO S1 S2 LI 8) ADO L1 ARRAY S1 S2,S3'C' 9) BR S <'10' LI 10) BR SI ='t' LI 11) BR Sl -'C' LI 12) MATCH S1 IN S2 S3 L1

248 A3 IBM 360 2025 Processor (Model 25) The IBM Model 2025 processor (II) is a microprogrammable central processing unit having limited capabilities. The ALU hardware is byte oriented and can perform 1 arithmetic or 1 logical operation in 1 main memory cycle time. The main memory cycle time is 0. 9 usec and words are accessed in halfword (16 bit) increments. The microcontrol storage, special storage areas, and program storage are all contained in main memory. Since microcontrol words are held in the main memory, any microoperation which requires data from main memory takes two memory cycles. If the I/O portion of the 2025 is ignored, the CPU and its connection to main storage can be represented as shown in Figure Al. The M register, W register and Mod + 2 hardware are used to sequence the microprogram. The sequencing of the microprogram can be handled concurrently with ALU operations. The ALU operates on data in the Local Store (LS). The ALU performs logical and arithmetic operations upon bytes not half words. Thus the addition of two half words in LS will require the execution of two microinstructions. It is possible to route a LS (2 bytes) through the mod + 2 hardware in one microcycle. The instruction types are explained and listed in Figure A2 and several examples of typical instructions are listed in Figure A3.

249 u 0 J 0 H; 0 I o 1 1'-4'~~~~~~~~~, —I T-'-.......I 0 0 C.0 03 z I I3 I I I I -, 0U ___________t~ aS o I —---- ^

250 Major Word Type Type 0 Set/Reset i) set/reset s0for + function ii) set mode for local store addressing Arithmetic Constant An arithmetic operation between a byte of LS and a 4 bit constant (within the control word) is performed. The result of written back into the same LS byte or placed on the Z bus. The constant can be used as high, low or both 8 bits and would be represented as K 88, K08, K80, etc. operations include +, AND, XOR, OR, -, COMP AND. 2 Storage Word i) Read (Write) half word or byte into or out of main core. Main core (not AUX STORE or CONTROL STORE) can be addressed indirectly (LS). ii) Update indirect address (LS). iii) Move half words from LS to LS. 3 Move/Arithmetic An arithmetic operation is performed between two byte of LS and stored back into one of the two, also Z 3conditions are set. Word Type Figure A2

251 4 Branch Unconditional A branch to another part of control store is executed with the return address stored in LS zone 4 (I register) 5 Branch on Mask i) Does 4, 8, 16 way branch upon test of bits (mask) either dynamic test or LS byte. ii) 2 way branch upon test of LS byte for all o's 6/7 Branch on Condition Word Tests a single bit of a LS byte for 1 or 0 branch if condition next, next sequential word, if condition not meet. Word Type Figure A2 (Continued)

252 1) U1 - U1 + HO Bytes U1 and HO are added and stored in U1 2) U1C = U1 + HO Bytes U1 and HO are added and stored in U1 the carry flip flop is set if a carry occurs. 3) U1 = U1 + HO + C Bytes U1 and HO and the carry flip flop input are added and the result stored in U1 4) U = U+2 The half word U is incremented by 2. 5) VI = VI + KAA The constant AA is added to VI and stored in V1 6) V1 = V1 K04 The constant 4 is ORed to VI and stored in V1. Note that constants may take the form KAA, KOA, KAO, but not the form KAB. 7) RDH U V The contents of main memory (half word) addressed by V is placed in U. 8) RDH U V + 2 Same as 7 but V is incremented by 2 after the read. LS Storage Locations are designated as U,V,G,D,I, T, P, H; the right half of a word is designated as UO, the left half of U1 Typical IBM 360 2025 Processor Microcode Figure A3

253 9) LABEL BR IF H1 = NZ A branch to LABEL is executed if HI is not zero. 10) LABEL BR IF Gi BIT 3 = 0 A branch to LABEL is executed if Bit 3 of G1 is zero. 11) LABEL BR IF C = 0 A branch to LABEL is executed if the carry flip flop is zero. Typical IBM 360 2025 Processor Microcode Figure A3 (Continued)

254 A4 IBM 360 2025 Processor Microcode to Implement Instructions Listed in A2 1) S1 -' 0' S assigned to D DO 0. KOO D1 = 0 f KOO 2) S -'C' Si assigned to D DO =0 f KLC 1 DO =0. KHC2 Di = 0' KLC3 D1 = 0 KHC4 3) Si (S2, S3) - S4 (S5,S6) Assume that each array is stored by rows. The location of S1 is held in LS location U. The first location of S1 gives the number of rows, the second location gives the number of elements in each row, the third location, points to the start of the first row, the fourth location points to the start of the second rows etc. S4 is held in V, S2 in G, S3 in D, S5 in I, and S6 in T. POC = 10 + I0 P1 = I1 + I + C P-2 * S5 POC = PO + VO

255 P1 = PI +V1 +C P =S4+ 2* S5 P P + 2 P=P+2 RDH P P GET LOC OF S4(S5,*) POC = PO + TO P1C = PI + Tl + C POC = PO + TO P1C = PI + T1 + C GET LOC OF S4 (S5, S6) RDH P P GET S4(S5,S6) HOC = GO + GO HI1 =G1 +G1 +C H-2 S2 HOC =GO + UO H1C = G + U + C H = S1 + 2* S2 H =H + 2 RDH H H GET LOC OF S1(S2,*) HOC = DO + HO HI = Dl + HI + C HOC =DO + HO HI = D1 + H1 + C GET LOC OF S1(S2, S3) STH H P STOR S4(S5,S6)-S1(S2,S3)

256 4) INCL S1 Si assigned to U U=U + 5) S1 -S2 + 2 S1 assigned to U, S2 to D U =D U=D+2 6) S1 - S2 + S3 S1 assigned to U, S2 to D U = D S3 assigned to G UOC = UO + GO U1 = U1 + Gi + C 7) DO Si L1 S1 assigned to H L1 LOOP CODE -. LOOP CODE H =H -1 L1 BR H NZ 8) ADO L1 ARRAY S1 S2, S3'12' Assign S1, S2, S3 toG,U,V, respectively. Assume array, are stored as specified in Example 3. UO = K01 U1 = KOO L2 VO = K01 V1 = KOO

257 START LOOP CODE LOOP CODE V = V + 1 G=G+2 RDH H G-2 HO = H0 V VO TEST IF START BR HO H2 END OF H1 = HI VI ROW HAS BEEN START BR H1 NZ REACHED RDH H G U = U + 1 HO = HO UO TEST IF L2 BR HO NZ LAST ROW It = HI e U HAS BEEN L2 BR HI NZ PROCESSED L1 CONTINUE 9) BR 81 < S2 L1 assign S2 to, U,S1 to D UOC = UO - DO U1 = Ul - DI - C L1 BR U18 Z LI IF NO OVERFLOW

258 10) BR S1 ='0' L1 assigns S1 to H L1 BR H NZ 11) BR Sl S2 LI assign S1 to U, S2 to D HO = UO e DO LI BR HO NZ HO = U1 e D1 L1 BR HO NZ 12) MATCH S1 IN S2 L2 Assume that U points to S1, V to S2, that the first location S1 and S2 contain the length of each. In this example string lengths are restricted to 255 words or less. and the location of the match is not required. RDB GO U + LENGTH S1 BDB G1 V + 1 LENGTH S2 L RDB I0 U FIRST CHARACTER S1 L3 RDB I1 V + 1 HO = O0 e 12 L4 BR HO Z G1 =G1- 1 L2 BR PO Z NO MATCH BR L3

259 L4 DO = GO - 1 L6 BR DO Z MATCH Dl = U + 1 P0 = G - 1 L2 BR PO Z Tl a V L5 RDB I0 U + 1 RDB II V + P0 = I0 e II LI BR PO NZ DO = DO - 1 L6 BR DO Z MATCH PO = PO - 1 L2 BR P0 NZ NO MATCH L6 CONTINUE A5 Microcode Required to Implement Higher Level Instructions of Section Al. The instructions of section Al will be implemented in IBM 2025 processor microcode. The higher level instructions were translated into PTL code. Each line of PTL code will be replaced by the equivalent microcode which was developed in Section A4, The variable assignment will have to be modified in some cases.

260 FORTRAN DO LOOPS a) J= 0 DO 1 I= 1, S1 1 J=J+1 equivalent PTL Code I-'1' J-'0' DO SI LI J-J+I L1 I-I+1 equivalent microcode Assign I to U, J to V, S1 to H UO = K01 I U1 = KOO = 1 VO = KOO J V1 = KOO J = 1 L1 VOC =VO + UO V =V1 + U1+C J=J+1 U=U+i I=I+1 =H =H- 1 S =S1 - 1 L1 BR H NZ LOOP TEST

261 b) J = 0 DO 1 I = 2, Si J=J+I equivalent PTL Code J ='0' ='2' Si = S1 -'1' DO SI LI J-J +I Li I-I+ 1 equivalent microcode assign J to U, I to V, S1 to H UO = KOO U1 = KOO J = 0 VO = K02 V1 = KOO I= 2 H = H - 1 S1 = S1 - 1 L1 UOC = UO + VO U1 = U1 + VI + C J =J + J V=V+1 I=J+1 H=H - 1 LI BR H NZ LOOP TEST

262 c) J= 0 DO 1 I' 2, S1, 2 1 J=J+I equivalent PTL Code J-'O' I-'2' LI J-J + I I-I+'2' BR I < S L1 equivalent microcode assign J to U, I to V, S1 to G UO = KOO Ul = KOO J- 0 VO = K02 Vi = KOO I-2 L1 UOC = UO + VO U1 = U1 + V1 + C J-J + I V =V + 2 I =1+2 H=G HOC = HO - VO H1C =H1 - VI +C S -I L BR HI 8 Z BR LI +

263 APL Submatrix Selection c) C- A/B equivalent PTL Code ADO L1 ARRAY B S1, S2'12' BR L1 A(S2) = 0 BR L2 S2 / 1 I-'O' L2 I-I+'1' C(S1, I)-B(S1, S2) L1 CONTINUE equivalent microcode assign B to G, Si to U, S2 to V, A to T, I to P, C to D UO = K01 Ul = KOO S1 = 1 L2 VO = K01 V1 = K01 S2 = 1 START HOC = TO + VO H1 = T1 + VI + C RDH H H GET A(2) L5 BR HO NZ L1 BR H1 Z BR LI A(S2) =O

264 L5 H=V HO =HO K01 L2 BR HO NZ HI = K1 ~ KOO L2 BR HO NZ BR L2 S2/1 PO = KOO P1 = PO I-0 L2 P=P+1 I-I+l H=U HOC = HO + HO H1 =HI +HI+ C H-2* S1 HOC = HO + GO H1 =H1 +G1 +C H=B +2* S1 H =H-+-2 H=H+2 RDH H H GET LOC B(S1,*) HOC = HO + VO HI =H1 +V1 +C HOC = HO + VO HI = H1 + VI + C GET LOC B(S1,S2) RDH H H GET B(S1,S2)

265 I=U I0C = I0 + I0 I1 = II + II + C I = 2 * S1 IOC = I0 + DO I1 = I +D1+ C I = C + 2 * S I=I+2 I=I+2 RDH I I GET LOC C(S1,*) TOC = I0 + PO II = II + P1 + C I OC = I0 + P0 II = II + P1 GET LOC C(S1,I) STH H I C(S1,I) =B(S1,S2) = V + 1 TEST G =G+2 RDH H G- 2 IF HO = HO e VO END START BR HO NZ OF HI = HI V1 ROW HAS START BR HI NZ BEEN REACHED RDH H G

266 U = U + 1 TEST IF HO = HO ~ UO LAST L2 BR HO NZ ROW HAS H1 =H1 e U1 BEEN L2 BR H1 NZ PROCESSED LI CONTINUE b) C-A/[ 1] B The microcode is identical to that of a, except that SI and S2 are interchanged. SNOBOL Pattern Match a) HERE S1 S2: F(L2) equivalent PTL Code HERE MATCH Si IN S2'DUMMY' L2 equivalent microcode Assume that U points to SI, V to S2, that the first location of Si and S2 contain the length of each. In this example string lengths are restricted to 255 words or less. In this example the location of the match is not required.

267 L4 DO = GO - 1 L6 BR DO Z MATCH D1 = U + 1 P0 =G1 - 1 L2 BR P0 Z NO MATCH T1 =V L5 RDB I0 U + 1 RDB I1 V + 1 0 = 10 ~ I1 L1 BR P0 NZ DO = DO - 1 L6 BR DO Z MATCH PO = PO - 1 L2 BR PO NZ NO MATCH L3 L1 BR NO MATCH L2 CONTINUE

APPENDIX B The implementations listed below are for the base hardware set of Figure 5. 2. 1. The formats for the microinstruction words are given in Figure 5. 2. 2. Examples of some possible microinstructions are given in Table 5. 2. 3, and special considerations in using the base microinstructions are listed in Table 5. 2. 4. 268

269 1) S3-S1AS2 RB3(S3) - RB 1(S1) ^ RB2(S2) LB3(S3) -LB 1(S1) A LB2(S2) 2) S3-SlvS2 RB3(S3) -RB (S )'vRB2(S2) LB3(S3)-LB (Sl) vLB2(S2) 3) S3 — S1 B3 (S3) -RB 1 (S1) LB3(S3) -LB (S 1) 4) S1-S2TR ADDR S3 LABEL - RB2(LENGTH) + 1 + RB1(S2) IN BOUNDS TEST LB2(LENGTH) + C + LB2(S2) 3RA1 8 LABEL RB3(T) -RB (S3) + RB2(S2) LB3(T) - LB1(S3) + C + LB2(S2) MA-B(T) MEMC NOP NOP B3(S1) - MB BR DONE(T + 1)

270 5) S -SHIFT S2 CARRY OUT LEFT S3 RB 1(S1) -RB 1(S2) LB 1(S1) -LB 1(S2) -RB 1(S3) BRAO 4 P 1 LB3(S1)- RB 1(S 1) RB3(S1) -'0 P1 - RB1(S3) BRAO 3 P2 RB3(S1) - L1 RB 1(S1) LB3(Sl)- L1C LB (S 1) RB3(S1)-L 1 RB1(S1) LB3(S1)- L1C LBI(S1) RB3(S1) -L1 RB I(S1) LB3(S1)- L1C LB1(S1) RB3(S1)-L 1 RB 1(S1) LB3 (S1)- L1C LB (Sl) P2 -RB1(S3) BRAO 2 P3 RB3(S1)- L1 RB (S1) LB3(S1) - Li'C LB1(S1) RB3(S1)-L 1 RB (S1) LB3(S1)- L1C LB (S1) P3 -RB (S3) BRAO 2 P4 RB3(S1)- L1 RB3(S1) LB3(S1)-L1C LB3(S1) P4 BR DONE

271 6) S3-S1 + S2 RB3(S3) - RB 1(S1) + RB2(S2) LB3(S3) - LB1(Sl) + C + LB2(S2) BROF P1 BR DONE P1 SET OUF 7) S3-S2 - S1 RB3(S3) -RB2(S2) + 1 + RB I(S1) RB3(S3) - RB2(S2) + C + RB 1(S1) BROF P1 BR DONE PI SET OUF 8) S1 INTEGER TO FLOAT S2 RB3(S2L) -'0016 - LBR(S1) BRAO 8 LI LB3(S2L) -'CB16 RB3(S2R)-RB (S1) + 1 LB3(S2R) - LB1 (S) + C BR DONE L1 LB3(S2L) -'4616 RB3(S2R) -RB 1(S1) LB3(S2R) -LB 1(S1)

272 Stack Operations There are two locations in S1 which hold pointing information. i) LENST the current length of the stack ii) POINST the location (in main memory) of the top of the stake. 9) S - ST LABEL MA -B (POINST) MEMC RB3(LENS) - RB3(LENST) + 1 +' LB3(LENST)- LB3(LENST) +C+t'' LENST-LENST-1 BRA1 8 LI RB3(POINST)- RB3(POINST) + 1 +'1' LB3(POINST)- LB3(POINST) +C +'0' POINST-POINST-1 B3 (S1) - MB BR DONE L1 BR.(LENST) -'0' BR LABEL

273 10) ST- Sl LABEL RB 1(POINST) - RB (POINST) + 1 LB 1(POINST) - LB 1(POINST) + C MA -B (POINST) MB -B(S1) MEMW RB3(LENST) RB3(LENST)+ 1 LB3(LENST) - LB3(LENST) +C BRAO 2 DONE RB3(POINST)- RB2(POINST) + 1 +'' LB3 (POINST)- LB2(POINST) +C +'O' RB3(LENST) - RB2(LENST) + 1+'1' iLB3(LENST) - LB2(LENST) +C +'' BR LABEL 11) CLEAR STACK RB3(POINST) — RB2(POINST) + 1 + RB 1 (LENST) LB3(POINST) - LB 2(POINST) +C + RB1 (LENST) B3 (LENST) -'0'

274 Linked List Operations There are two locations in S1 which hold pointing information. i) HEAD points to the head of the list ii) POINTS points to an item of the list, forward pointer. iii) backward point is at POINTS + 1 12) INCREMENT POINTER LINK LIST S1 LABEL MA - B(POINTS) MEMC NOP NOP BRMB1 16 LABEL OUTSIDE LIST OUTSIDE LIST B3(POINTS) - MB NOP 13) DECREMENT POINTER LINK LIST Si LABEL RB3(T) -B1(POINTS)+ 1 LB3(T) B1(POINTS) +C MA -B(T) MEMC NOP NOP BRMB1 16 LABEL OUTSIDE LIST B3(POINTS) -MB RB3(POINTS) -RB2(POINTS) + 1 +'T' LB3(POINTS) - LB2(POINTS) +C +''

275 14) SET LINK LIST Si MBL-'FF' 16 MA-B(HEAD) MEMW RB3(POINTS) -RB1(HEAD) + 1 LB3(POINTS) -LB (HEAD) + C MA - B(POINTS) MEMW RB3 (POINTS) -RB3(HEAD) LB3(POINTS) -LB(HEAD) 15) EMPTY LINK LIST Sl MA- B(HEAD) MEMC RB3(POINTS) -RB 1(HEAD) LB3(POINTS)- LB (HEAD) Li BRMB 1 16 DONE B(T) - MB MBL -'FF MEMW NOP NOP MA - B(T) MEMR NOP BR L1

276 16) REMOVE Si LINK LIST LABEL MA -B(POINTS) MEMR GET FORWARD POINTER B(S1) -MA RB3(T1) -RBi(POINTS) + 1 LB3(T1) -LB1(POINTS)+C B(T2) -MB MB -'FF MEMW RESET LINK 16 ELEMENT - LB (T2) BRA1 8 L1 NO FORWARD POINTER MA -B(T1) MEMR GET BACKWARD POINTER RB3(POINTS)- RBI(T2)+1 LB3(POINTS) -LB 1(T2)+C BRMB 1 16 LABEL ERROR REMOVING HEAD MA -B(POINTS) MEMW L2 B3(POINTS) - MB RB3(POINTS) - RB2(POINTS)+ 1+'1' LB3(POINTS) -LB2(POINTS)+C+'O' MA -B(POINTS) MB - 1(T2) MEMW NOP BR DONE L1 MA -B(T1) MEMR GETBAKWARD POINTER NOP BR L2

277 17) INCR S1 RB3(S1)- RB (S1) + 1 LB3(S1) -LB1(S1) + C BROF P1 BR DONE P1 SET OVF 18) DECR S1 RB3(S1) RB 2(Sl)+1+' 1' LB3(S1) - LB2(S1)+C+'O' BROF P1 BR DONE P1 SET OVF 19) DO LABEL S1 L 1 RB3(S1)-RB 2(S1)+ +' 1' LB3(S1) - LB2(S1)+C+'O' BRA 1 8 LABEL BR L1 LABEL CONTINUE

278 20) BR S = S2 LABEL RB3(T) -RB2(S1) * EV * RB1(S2) -RB2(T) +'0' BRC L1 LB3(T)-LB2(S1) * EV * RB1(S2) -RB2(T) +'0' BRC LABEL LI BR DONE 21) BR S1 < S2 LABEL -LB1(S2) BRA1 8 L1 S2< 0 -LBl(S1) BRA1 8 LABEL S < 0 S2 > 0 L2 -RB2(S2) + RB (S1) - LB2(S2) + C+ LB (S1) S2 -(Sl+1) BRAO 8 LABEL S2- S1 BR DONE Li -LB (S 1) BRAO 8 DONE BR L2 DONE

APPENDIX C What is Microprogramming? M. V. Wilkes [ W2,W3] in the early 1950's suggested microprogramming as a more orderly approach to the design of computer control circuitry. He was particularly concerned with the signals needed to control the flow of information among the registers and transformation units that comprise the central processing unit. Wilkes felt that the execution of a machine instructiont by the central processing unit (CPU) could be partitioned into a series of register to register transfers in the CPU. Some of the transfers of data in the CPU would, of course, be made to pass through a transformation unit while going from the source register to the destination register. A typical transformation unit might be capable of shifting, adding, logical disjunction, etc. A representation of Wilkes' scheme is shown in Figure C. 1. with the nonvolatile "control memory" divided into a part A and a part B. Each column of part A represents a microoperation. The excitation of a microoperation causes a predetermined action by one or more elements of the CPU. Some of the actions that a microoperation might cause are: data gated out of a register, data gated into a register, data gated through a transformation unit. Each row in tMachine instruction: the binary code for an operation that is performed by the central processing unit. The code is normally held in the main memory until fetched into the CPU for instruction decoding and execution. 279

280 - 0 tfj-....'" v 0 I bD 0XI-.~X 0mm - - ~ L= ~ *- /,,, m-^====V osl

281 the control memory represents a microinstruction:. Part A of the microinstruction controls gating in the central processing unit (CPU) and part B of the microinstruction contains the address of the next microinstruction to be executed. Thus the microprogram controls the gating within the CPU and hence, controls the transfer of data within the CPU. The control of a central processing unit, according to Wilkes is implemented by the execution of a series of microinstructions. The excitation of the microoperations specified by each microinstruction causes register to register transfers to occur. The series of register to register transfers specified by the microinstructions implement the desired machine instruction. The question of what is a reasonable choice of microoperations was not clearly specified by Wilkes. The most functionally complete set of microoperations is a set for which each microoperation controls a single gate (input gate or output gate) of one bit of a register. The set is functionally complete in the sense that control is exercised over the smallest unit of storage in the CPU, individual flip-flops. For even moderate sized machines the length of the microinstruction word sufficient to contain all of the microoperations of the preceding set is excessively large. Therefore, it is usually necessary to trade some amount of flexibility (i. e. functional completeness) for a reduction in the microinstruction word size. We will see that

282 both of the two major implementations of microprogram control result in reduced microinstruction word size. The terms horizontal microprogramming and vertical microprogramming are used by microprogramming practitioners to form a dichotomy of microinstruction word types. Since there are many different types of microinstructions the definitions that have been developed and which attempt to span the range of microinstruction types are overly generalized. The definitions which are presented below define the end points of the range of microprogram instruction types. The definitions could be said to describe pure horizontal microinstructions and pure vertical microinstructions respectively. In this way the essential difference in philospohy between the two types of microprogramming will be illuminated without the clutter imposed by the constraints of particular applications. After presenting the definitions some of the modifications which are imposed by considerations of cost will be discussed and illustrated. In the remainder of the paper the term horizontal microprogramming will be used to mean more nearly like pure horizontal than pure vertical microprogramming and vice versa for the term vertical microprogramming. Before defining horizontal and vertical microprogramming the definition of a control point and a control point set will be given.

283 Def. C.G Control Point: Each control point for a CPU is a set of control gates with the property that the simultaneous. excitation of the gates will cause one of the following actions. a) The outgatting of data from a register (or portion of a register) onto a data path. b) The ingatting of data into a register (or portion of a register) from a data path. c) The routing of data at a bus junction. d) The selection of the function to be performed by a transformation unit. Def. C. 2 Control Point Set: A control point set for a CPU is a set of control points which partition all the control gates of the CPU. Each control point is specified by the machine designers and represents a microoperation. The simultaneous excitation of several control points is required to execute a register to register transfer. Horizontal and vertical microprogramming are different methods of encoding the information which specifies the control points

284 to excite during one microcycle. Def. C. 3 Horizontal microinstruction: A horizontal microinstruction word assigns one bit to each control point of the control set, with the state of the bit controlling the excitation for the corresponding control point. Def. C.4 Vertical microinstruction: A vertical microinstruction word assigns one bit to several control points of the control set, with the state of the bit controlling the excitation for the corresponding control points. Horizontal microprogramming condenses word length by allowing each microoperation to control several gates (control point) of a register. Both the width of data paths and the register bit position connection point of data paths are fixedd. Therefore the ability to modify data widths is lost, but the ability to select register to register and register to transformation unit data paths remain. Thus a register to register transfer requires the execution of several horizontal microoperations. A vertical microoperation condenses word length by allowing each microoperationto control several control points. A vertical microoperation may control an entire register to register transfer or the major portion of a register to register transfer. There are, of course, a great number of possible register to register transfers.

285 The decrease in microinstruction word length is accomplished by implementing a small subset of all the possible register to register transfers and therefore limiting the number of vertical microoperations. Therefore both the ability to modify data path widths and the ability to modify data path connections other than the selection of the preselected data paths are lost. As was mentioned above, most microprogram implementations are a modification of either horizontal or vertical microprogramming. The most common modification is the use of field encoding. If it is known that certain sets of microoperations are never executed at the same time, then the microorders may be encoded into a field instead of using one bit per microoperation. Thus if three microoperations are never executed at the same time they may be encoded into 2 bits. The code 00 would specify execution of none of the operations, 01 the execution of the first microoperation, etc. Horizontal microinstructions almost always use field encoding. For example, if there are three inputs to a register, it is generally true that only one input data path would be chosen at a time. Field encoding is also useful for vertical microcoding. Typically there will be an operation field, which will set up the data paths and transformation unit control, and source and destination fields (some use a single field), which set up the register outgating and ingating. There are two disadvantages in using field encoding. First, the choice of microoperations that are mutually exclusive is not always

286 clear. Secondly, each stage of decoding of the microoperation will decrease the speed of operation by the delay time of one gate. Thus, for large codes either large delays will be suffered or large amounts of circuitry will be required to decode the microoperation. A second modification which is used to reduce the word length is word encoding. In word encoding one field of the microinstruction defines the meanings of the other sections of the control word. For example, a section of a horizontal microinstruction might control memory operations if a certain bit of the word were a zero, but if the bit were a 1 the same section that controlled the memory would now control the 1/O data paths. In a vertical microinstruction, the operation code field may change meaning depending upon the state of the word encode field. The disadvantage of the use of word encoding is a reduction in the parallelism of the microinstruction. A number of authors use terms different than horizontal and vertical microprogramming. The more descriptive terms of minimally decoded (horizontal microprogramming) and highly decoded (vertical microprogramming) have been used by Lawson[L3]. The term function field type (horizontal microinstruction) and machine code type, or mini-instruction (vertical microprogramming) have been used by Ramammorthy [R1 ].

287 Thie terms monophase, polyphase, encode, and little encode have been used by Redfield [R2]. The terms encode and little encode correspond to the terms minimumly decoded and highly decoded. The terms monophase and polyphase refer to the duration and sequencing of control signals. A monophase microinstruction produces a simultaneous set of control signals. A polyphase microinstruction produces a set of control signals in which each member of the set maybe active at different times during a microcycle time span. In discussions at conferences the terms long word (horizontal microprogramming) or short word (vertical microprogramming are occasionally used. In this paper we will use the traditional terms of horizontal and vertical microprogramming. A further discussion of horizontal and vertical microprogramming will help clarify the difference between the two methods. Each horizontal microoperation controls gates (either input gates or output) within the CPU, thereby causing data to flow into or out of registers and transformation units. A register to register transfer is executed by the proper selection of the horizontal microoperations, each executed at the same time. It is the freedom of selecting the data paths, and therefore the register

288 to register transfers, which gives the horizontal microprogram control method great flexibility. The flexibility offered by the use of horizontal microprogramming has its price. The length of microinstruction words for a horizontal microprogram controlled machine is longer than for a vertical microprogram controlled machine. Also, the programming of a horizontal microprogram machine is tedious. It is much easier to specify a register to register transfer than to specify all the microoperations required to implement the register to register transfer. Horizontal microprogramming was used in some models of IBM's System 360 series of computers. It is interesting to note that the horizontal microinstruction word length used in the System 360 series computers varied from 56 bits (model 30) to 100 bits (model 65). A section of a CPU is shown in Figure C. 2. The possible microoperations are listed in Figure C. 3 and shown (dotted lines) in Figure C. 2, Several register to register transfers, together with the microoperations required to implement them are listed in Figure C. 4 The examples of Figure C..4 are very straightforward. and need no explanation. Vertical microprogramming employs the individual )its of the microinstruction word to specify register to

289 register transfers. The register to register transfer must be executable in one microinstruction cycle time. For most CPU's the number of different register to register transfers that may be implemented is very large. Normally, a small subset of all possible transfers is chosen as the set of microoperations. Since a restricted set of transfers is normally available in a vertical microprogrammed machine, it is not as flexible as a horizontal microprogrammed machine. However, the microinstruction word length for vertical microprogramming is normally smaller than the microinstruction word length for horizontal microprogramming. In addition, the vertical microprogram machine is easier to microprogram than a horizontal microprogram machine. Vertical microprogramming is very similar to machine language or assemble language programming of a conventional computer. An example of a vertical microprogrammed machine is the Interdata 3 computer. The simplified CPU organization and the microoperations for the Interdata 3[13] are shown in Figures C. 5 and Figure C.6, respectively. There are only 16 basic microoperations provided by the Interdata 3. However, the flexibility of the operations is increased by allowing the programmer to specify the destination and source registers independently. The Interdata Model 3 microinstruction word length is 16 bits.

290 The smaller word size results in a reduction in the total number of control storage bits required, but also results in a loss in flexibility. For example, it would not be possible to add two registers together and store the result, for further use, in two registers instead of a single register. The data paths are available to accomplish such an operation but because of the restricted control word length, the control bits to specify two result registers instead of a single result register are not available. With horizontal microprogramming such an operation would be possible (see Figure C.4 for an example). The two types of microprogramming (vertical and horizontal) are very different approaches to CPU control. The use of microoperations as gate controls (horizontal microprogramming) is essentially the replacement of hardwired control logic by a control memory. The control signals normally provided by a conventional logic control unit are provided by the microoperations of the horizontal microinstruction word. For example, the microoperations of Figure C. 2 and Figure C..3 are identical to the operations that would be listed on a flow diagram used to document hardwired control logic. The use of microoperations as register to register transfers is essentially the simulation of one computer by another computer. The microoperations of the vertical microprogrammed computer are very basic machine instructions. For example, the IBM 360 Model 25 [II] microinstructions actually use four basic clock cycles each of

291 which can cause a register to register transfer. The Interdata Model 3 can accomplish only one register to register transfer per microinstruction. However, in both cases the microinstructions perform the standard machine operations of ADD, SUBTRACT, and TRANSFER. This can be observed from the microoperation of the Interdata 3 (Figure C. 6). Because the microoperations are held in a control memory with very fast access times, the microinstructions can be executed very rapidly. To execute an Interdata 3 machine instruction, many Interdata 3 microinstructions have to be performed. Therefore, the Interdata 3 computer can be viewed as being simulated by an elementary computer which is the microcontrol computer. The term'elementary computer" refers tothe very basic nature of the microoperations which form the instruction set of the control (microprogrammed) computer. Before concluding this section one use of the term microprogramming should be mentioned. Microprogramming is sometimes used to describe a form of instruction coding. A good example of this is the PDP-8I microinstruction code which is shown in Figure C. 7. The instruction, which is a machine instruction held in main memory not in a control memory, allows the programmer to specify different combinations of CPU operations to be performed during one main memory cycle. A programmer, by placing a 1 in bit position 11

292 also increments the accumulator (IAC) after it has been cleared. By choosing combinations of the bits of the code 7 instruction word, the programmer forms his own machine instructions. However, this is not microprogramming in the normal sense. The 7 op code is recognized and the instruction implemented with hardware logic control. In fact, the combination of clear the accumulator (CLA) and complement the accumulator (CMA) occurring together must be recognized and treated specially to achieve the desired results.

293 From Microinstruction Decoder I — -' —--' I I i i I j LI........l I I,REG. A | REGB REG.C I I Rigg Tuht 1e I'~ ITrue I|l [.f - -' fAERm I ----- -. I -_____D) R___I (Left 4 Right 4, No Shift) Section of a CPU Figure C. 2

294 Symbol Function A Gate Reg. A to Adder Left Side AL1 Shift Reg. A Left 1 to Adder Left Side AR1 Shift Reg. A Right 1 to Adder Left Side B Gate Reg. B to Adder Right Side BC Gate Reg. B Complemented to Adder Right Side C Gate Reg. C to Adder Right Side CC Gate Reg. C Complemented to Adder Right Side NOS Pass Adder Output through Shifter with no Shift SL4 Pass Adder Output through Shifter with Left 4 Shift SR4 Pass Adder Output through Shifter with Right 4 Shift SA Set Shifter Output into Reg. A SB Set Shifter Output into Reg. B SC Set Shifter Output into Reg. C Microoperations for Figure.C. 2 Figure C.3

295 Operation Microorder Transfer the contents of register A A, SR4, SB shifted right 4 places into register B A(shift right 4) - B Transfer the contents of register A A, B, NOS, SC added to the contents of register B into register C A+B - C Transfer the contents of register A A, B, NOS, SC, SA added to the contents of register B into register C and register A A+B - C; A + B - A Some Possible Register to Register Transfers for the CPU of Figure C. 2 and the Microoperations of Figure C.3 Figure C.4

296 Read Only Memory Address Register SBUS (8 bits, parallel) Simpifie i Dg- of Ian RO CORE I R2 MEMORY....I R3 R4 ALU R6 _ __ <c-11j i'R6'A B BUS (8 bits, parallel) Simplified Block Diagram of Interdata Model 3 Figure.C. 5

297 op Mnemonic Instruction Formatexecuton Code time(mc) C A Add A 760 D A Add Immediate B 760 E S Subtract A 760 F S Subtract Immediate B 760 A X Exclusive OR A 380 B X Exclusive OR Immediate B 380 8 N AND A 380 9 N AND Immediate B 380 6 0 Inclusive OR A 380 7 0 Inclusive OR Immediate B 380 4 L Load A 380 5 L Load Immediate B 380 3 C Command C 380 2 T Test C 380 1 B Branch on Condition D 760 0 D Do Format A -- Format C Part: Op D S E Part: Op TC Size (bits): 4 44 4 Size (bits): 4 12 Format B - Format D - Part: Op D Data Part: Op Mask Address Size (bits): 4 4 8 Size (bits): 4 4 4 Microoperations Interdata Model 3 Figure C.6

298 Op - specifies operation to be performed. D - destination register number. S - source register number. E - extended operation field, which specifies options within the same operation. Data- immediate operand (literal). TC - test command, which specifies the signal to be tested or the operation to be performed. Mask - condition code mask(C, V, G, L). Address - destination address. Destination/Source Registers Code Register Code Register Code Register 0000 RO 0101 R5 1010 MDL 0001 R1 0110 R6 1011 MAH 0010 R2 0111 R7 1100 MAL 0011 R3 1000 AR 1101 RAH/SDR 0100 R4 1001 MDH 1110 RAL/SCR 1111 DFR Microoperations Interdata Model 3 (Cont.) Figure C.6

299 0 1 2 3 4 5 6 7 8 9 10 11 11 1 1 I 0 CLAC LL CMA CML RAR RAL lIAC OPERATION: ROTATE ONE PLACE CODE ZERO SPECIFIES 1: ROTATE TWO PLACES GROUP 1 CLA - clear accumulation CLL - clear link CMA - complement accumulation CML - complement link RAR - rotate right RAL - rotate left PDP-8 Microinstruction Figure C. 7

APPENDIX D Automatic Generation of B Tables The design method developed in the body of this report requires the generation of one or more implementation methods for each program type. There is one implementation for the base hardware and one or more improved implementations. Each of the improved implementations uses one or more hardware options. The required hardware options for the implementations of a program type are represented by the B matrix and the execution time and storage requirements are represented by the Vi and Si vectors, respectively. The B matrix, the V1 vector, and the Si vectors can be generated from the base implementation method of a program type. This is the method which would normally be used by the designer. The designer would develop the base implementation method and then note the change in execution time and required storage if the hardware options (individually or in combinations) were added to the base hardware. The procedure of generating the Bi matrix and the V and S vector from the base implementation method could be partially automated. A discussion of the 300

301 semi-automated approach is given below. The approach would be to identify sequences of microcode in the base implementations which could be improved using the hardware options. A program type and the base microcode for the base hardware of Figure 5. 2. 1 are shown below (Figure Dl). The sequence of microcode of Steps 1, 2 and Steps 5, 6 could be replaced by one microinstruction if the double width ALU option was chosen. Both sequences are an ALU operation performed on both the left and right half of a word of control storage. Such a sequence could be noted by the hardware designer and used in an algorithm for detecting such occurrences in the base hardware implementations. An interesting problem is illustrated in Figure Dl. The microinstructions (1,2) could be replaced by 1 microinstruction if hardware option 8 (Figure 5. 2. 2) were chosen, but microinstructions (1,2,3) could be replaced by 1 microinstruction if hardware options 4 and 7 (Figure 5. 2. 2) were chosen. This occurs because microinstructions (1,2,3) of Figure D1 implement a test to determine if S2 is less than LENGTH. This test can be performed in one microinstruction if options 4 and 7 (Figure 5. 2. 2) are added to the base hardware (Figure 5. 2.1). Thus the procedure would have to detect both occurrences, and form the three different implementations which result if option 8 or options 4, 7, or options 4, 7, 8 were chosen. This implies that the

302 designer would probably scan the base hardware program type implementations for occurrences of difference sequences, and then let the automated procedure find the sequences and form the different implementations some of which would result from combining sequences. There are some sequences which would be very hard to identify For example, the identification of a sequence which was performing a stack operation would be very difficult to identify since it would probably be a complicated procedure and might vary slightly from implementation to implementation. Thus there are some sequences and hence some implementations which would still have to be provided by the designer. The question of the correctness of the automatically generated implementations would be a problem. That whole question of program correctness is a major research area and will not be discussed here.

303 S1-S2 TR ADDR S3 LABEL 1 -RB2(LENGTH) + 1 + RB1(S2) IN BOUNDS TEST 2 -LB2(LENGTH) + C + LB2(S2) 3 BRA1 8 LABEL 4 RB3(T) - RB1(S3) + RB2(S2) 5 LB3(T) - LB1(S3) + C + LB2(S2) 6 MA-B(T) MEMC 7 NOP 8 NOP 9 B3(S1) - MB 10 BR DONE(T + 1) Program Type Implementation Figure Dl

BIBLIOGRAPHY [B1] Barton, N., R. McGuire, "System 360 Model 85 Microdiagnostics," Spring Joint Computer Conference, AFIPS Conference Proceedings, vol. 36, 1972. [ B2] Bell, C.G., and A. Newell, Computer Structures: Readings and Examples, McGraw-Hill Co., New York, 1970, Chapter 2. [ B3] Bell, C.G., and A. Newell, "PMS and ISP Descriptive Systems for Computer Structures," AFIPS Conference Proceedings, vol. 36, May 1970, pp. 351-374. [ B4] Bowman, Sally, T. J. DeSchon, "Case Study on Emulation of the 160A in a Complex Environment," preprints 4th Annual Workshop on Microprogramming, University of California, Santa Cruz, California, Sept. 1971. [B5] Breuer, M. A., "General Survey of Design Automation of Digital Computers," IEEE Proceedings, vol. 54, No. 12, Dec. 1966, pp. 1708-1721. [B6] Breuer, M. A., "Recent Developments in Automated Design and Analysis of Digital Systems, " IEEE Proceedings, vol. 60, No. 1, Jan. 1972, pp. 12-27. [ C1] Calhoun, R.C., Diagnostics at the Microprogramming Level, Modern Data, May 1969, p. 8. [ C2] Conway, Melvin E., "Proposal for an UNCOL, " Comm. of ACM, vol. 1, No. 10, Oct. 1958, pp. 5-8. [ C3] Cook, R. W., M. J. Flynn, "System Design of a Dynamic Microprocessor, " Department of Electrical Engineering, Northwestern University, Evanston, Illinois, 1969. [ D] Darringer, J. A., The Description, Simulation and Automated Implementation of Digital Computer Processors, Ph. D. Dissertation, Carnegie-Mellon University, Pittsburg, Pa., May 1969. 304

305 [D2] Darringer, J. A., "A Language for the Description of Digital Computer Processors, " report, CarnegieMellon University, Pittsburg, Pa., May 1969. [D3] Davies, P. M., "Readings in Microprogramming, " IBM Systems Journal, vol. 11, No. 1, 1971, pp. 16-39. [ D4] Day, W. H. E., "Compiler Assignment of Data Items to Registers, " IBM Systems Journal, vol. 9, No. 4, 1970, pp. 281-317. [D5] Duley, J.R. and D. L. Dietmeyer, "A Digital Design System Language, " IEEE Transaction on Computers, vol. EC-17, No. 9, Sept. 1968, pp. 850-861. [ D6] Duley, J. R., and D. L. Dietmeyer, "Translation of a DDL Digital System Specification to Boolean Equations," IEEE Transactions on Computers, vol C-18, No. 4, April 1969, pp. 305-318. [ F1] Feldman J., and D. Gries, "Translator Writing Systems, ommunication of the ACM vol. 11, Feb. 1968, pp. 77-113. [ F2] Flynn, M. J., M, Do MacLaren, "Microprogramming Revisited," Proc 22nd Natl. Conf. of the ACM (1967), pp. 457-464. F F3] Flynn, M. J., R.F. Rosin, "Microprogramming - An Introduction and a Viewpoint, " IEEE Transactions on Computers, vol. C-20, No, 7, July 1971, pp. 727-731. [ F4] Friedman, T. D,, "ALERT - A Program to Compile Logic Designs of New Computers, " IEEE Digest on the 1st Annual IEEE Computer Conference, No. 16C51, Aug. 1967, pp. 128-130. [ F5] Friedman, T. D., "Methods Used in an Automatic Logic Design Generator (ALERT)," IEEE Transactions on Computers, vol. C-18, No, 7, July 1969, pp. 593-613. [ F6] Gerace, Go B., "Microprogram Control for Computing Systems," IEEE Transactions on Electrical Computers, vol. EC-12, No. 5, December 1963, pp. 733-747.

306 f G2] Gorman, D. F., and J. P. Anderson, "A Logic Design Translator," AFIPS Conference Proceedings, Fall,1962, pp. 251-261. [ G3] Gorman, D.F., "Systems Level Design Automation: A Progress Report on the Systems Descriptive Language (SLD II)," IEEE Digest of the 1st Annual IEEE Computer Conference, No. 16C51, Aug. 1967, pp. 131-134o [G4] Grasselle, A., "The Design of Program-Modifiable Micro-programmed Control Units," IRE Trans. on Electrical Computers, vol. EC-11, No. 3, June 1962. pp. 336-339, [G5] Green, J. "Microprogramming, Emulators, and Programming Languages," Comm. ACM 9, 3 (March 1960), pp. 230-232. [ G6] Guffin, R. M., "Microdiagnostics for the Standard Computer MLP-900 Processor," IEEE Trans. on Computers, vol. C-20, No. 7, July 1971, pp. 803-808. [ H1] Hasset, A., "The Design of a High Level Language Machine, " IBM Scientific Center, Palo Alto, California, 1971, p. 32. [ H2] Hasset, A., J. W., Lageschulte, L. E. Lyon, Implementation of a High Level Language Machine," IBM Scientific Center, Palo Alto, California, 1971, p. 45. [ H3] Hawryszkiewycs, Igor T., "Microprogrammed Control in Problem-Oriented Languages, " IEEE Trans. on Electronic Computers, vol. EC-16, October 1967, No. 5, pp. 652-658. [ H4] Hoovind, Robert C., "The Many Faces of Microprogramming," Computer Decisions, vol. 3, No. 9, Sept. 1971, pp. 6-10. [H5] Husson, Samir, S., Microprogramming Principles and Practices, Prentice-Hall, 1970, p. 613.

307 [ I] IBM Field Engineering Theory of Operation, 2025 Processing Unit, Y24-3527-0, Oct. 1968. [ I2] Interdata Model 3 Microinstruction Reference Manual, Publication Number 29-017R01, April 1968. [ I3] Iverson, Kenneth E. A Programming Language, John Wiley and Sons, Inc., 1962, p. 286. [ J1] Johnson, A. M., "The Microdiagnostics for IBM 360 Model 30, " IEEE Trans. on Computers, vol. C-20, No. 7, July 1971, pp. 798-803. [ K] Kleir, R. L., C. V Ramamoorthy, "Optimization Strategies for Microprograms," IEEE Trans. on Computers, vol. C-20, No. 7, July 1971, pp. 783-794. [ K2] Knuth, Donald E., "An Empirical Study of Fortran Programs," Computer Science Department Report No. CS-186, Stanford Artificial Intelligence Project Memo AIM-137, Computer Science Dept., Stanford University, 1970, p. 42. [ L] Lawler, E. L., M. D. Bell, "A Method for Solving Discrete Optimization Problems," Operation Research, vol. 14, No. 6, Nov-Dec 1969, pp. 1098-1112. [ L2] Lawsen, H. W., "Programming-language-oriented Instruction Streams, IEEE Trans., C-17 (1968), pp. 476. [ L3] Lawsen, Jr., H. W., B. K. Smith, "Functional Characteristics of a Multilingual Processor, " IEEE Trans. on Computers, vol. C-20, No. 7, July 1971, pp. 732-742. [ L41 Levy, Leon. S., "State of the Art of Microprogramming," Aerospace Corporation, El Segundo, California, April 1965, Contract No. AF 04(695)-469. L5] Lowry, Edward S., C. W. Medlock, "Object Code Optimization," Comm. of ACM, vol. 12, No. 1, January 1969, pp. 13-22.

308 [Ml] McBee, W. C., "The TRW-133 Computer," Datamation 10/2, February 1964, pp. 27-29. [M2] McCormick, M. A., T. T. Schansman, K. K. Womack, "1401 Compatability Feature on the IBM System 1360 Model 30, " Comm. ACM, 8, 12 (December 1965), pp. 773-776. [M3] Metze, G., and S. Seshu, "Proposal for a Computer Compiler, " AFIPS Conference Proceedings, Spring 1966, [P1] Proctor, R., "Logic Design Translator Experiment Demonstrating Relationship of Language to Systems and Logic Design, "IEEE Transactions on Computers, vol. Ec- 13, August 1964, pp. 422-430. [R1] Ramamoorthy, C. V., M. Tsuchiyal, "A Study of UserMicroprogrammable Computers," Proc. AFIPS, vol. 35, 1970, pp. 165-181. iR2] Redfield, S. R., "A Study in Microprogrammed Processors: A Medium Sized Microprogrammed Processor," IEEE Trans. on Computers, vol. C-20, No. 7, July 1971, pp. 743-750. [R3] Rosin, Robert F., "Contemporary Concepts of Microprogramming and Emulation, " Computing Surveys, vol. 1, No. 4, (December 1969), pp. 197-212. [R4] Rosin, Robert F., Gideon Frieder, Richard Eckhouse, Jr., "An Environment for Researdh in Microprogramming and Emulation, " preprints (ACM) 4th Annual Workshop on Microprogramming, University of California, Santa Cruz, California, Sept. 1971. [R5] Rosin, Robert F., "Notes on Emulations and Microprogramming, " Advanced Topics in Systems Programming, Engineering Summer Conference, University of Michigan, 1970, p. 62. [S1] Sammet, Jean E., Programming Languages:, PrenticeHall, 1969, p. 785.

309 [ S2] Schlaeppi, H. P., "Formal Language for Describing Machine Logic, Timing, and Sequencing (LOTIS)," IEEE Transactions on Computers, vol. EC-13, August 1964, pp. 439-448. [S3] Schiller, W. L., R. L. Abraham, R. M. Fox, A. van Dam, "A Microprogrammed Intelligent Graphics Terminal, " IEEE Trans. on Computers, vol. C-20, No. 7, July 1971, pp. 775-782. [S4] Schoen, T. A., M. R. Belsole, Jr., "A Burroughs 220 Emulation for the IMB 360/25, " IEEE Trans. on Computers, vol. C-20, No. 7, July 1971, pp. 795-798. [S5] Semarne, H. M., and R. E. Porter, "A Stored Logic Computer," Datamation, 7/5, May 1961, pp. 33-36. [S6] Shorr, H., "Computer Aided Digital System Design and Analysis Using a Register Transfer Language, " IEEE Transactions on Computers, vol. EC-13, Dec. 1964, pp. 730-737. S7] Shriver, D. B., Sr., "Microprogramming and Numerical Analysis," IEEE Trans. on Computers, vol. C-20, No. 7, July 1971, pp. 808-811. [S8] Slobodynonyck, T. F., "A Method of Minimizing Microprograms, An UKrSSR. Kibermeticheskaya Tekhnika, (Academy of Sciences og the Ukrainen SSR. Sybernitecs Tech. Kiev, 1965, pp. 72-80 - Available AD 662-821 Foreign Technology Division, Wright-Patterson AFB, Ohio. [S9] Steel, T. B., "A First Version of UNCOL, " Proceedings of the Western Joint Computer Conference, 1961, pp. 371-378. [ S10] Strong, J., et al, "The Problem of Programming Communications with Changing Machines, A Proposed Solution, Part I," Comm. of ACM, vol. 1, No. 8, August 1958, pp. 8-12.

310 [ Sll] Strong, J., et al, "The Problem of Programming Communications with Changing Machines, A Proposed Solution, Part II," Comm. of ACM, vol 1, No. 9, Sept. 1958, ppo 9-150 [ T] Tucker, S. G., "Emulation of Large Systems," Comm. ACM, 8, 12 (December 1965), pp. 753-761. [ T2] Tucker, S. G., "Micro-program Control for System/360," IBM Systems Journal, 6 (1967), p. 222. [ T3] Tucker, Allen B., Michael J. Flynn, "Dynamic Microprogramming: Processor Organization and Programming," Comm. of ACM, vol. 14, No. 4, 1971, pp. 240-250. [Wl] Weber, H. A., "Microprogrammed Implementation of EULER on IBM System/360, Model 30, "Comm. of ACM, vol. 10, pp. 549-558. [W2] Wilkes, M. V., "Microprogramming," Proc. AFIPS vol. 12, EJCC, 1958, pp. 18-19. [W31 Wilkes, M. V., "The Growth of Interest in Mircoprogramming, A Literature Survey," Computing Surveys "vol. 1, No. 3, (September 1963, pp. 139-145.

UNIVERSITY OF MICHIGAN 3 9015 03025 4711