THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY MINIMIZATION OF INTERPItOCESSOR COMMUNICATION FOR P. kRALLEL COMPUTATIONS ON AN SIMD MULTICOMPUTER William Sal-fong Wu CRL-TR-24-84 MARCH 1984 Room 1079, East Engineering Building Ann Arbor, Michigan 48109 USA Tel: (313) 763-8000

MINIMIZATION OF INTERPROCESSOR COMMUNICATION FOR PARALLEL COMPUTATIONS ON AN SIMD MULTICOMPUTER by William Sai-fong Wu A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering) in The University of Michigan 1984 Doctoral Committee: Professor Keki B. Irani, Chairman Professor Donald A. Calahan Assistant Professor Janice M. Jenkins Professor Norman R. Scott Professor Alfred C-T Wu

ABSTRACT MINIMIZATION OF INTERPROCESSOR COMMUNICATION FOR PARALLEL COMPUTATIONS ON AN SIMD MULTICOMPUTER by William Sai-fong Wu Chairman: Keki B. Irani Interprocessor communication is an important aspect of parallel processing. Studies have shown that data communication can be a major cause of performance degradation of a parallel algorithm. This thesis is concerned with minimization of delay due to interprocessor data communication during the execution of a parallel algorithm on an SIMD multicomputer interconnected with any multistage interconnection network from a class of functionally equivalent networks. The interprocessor communication cost minimization problem for the omega network in the equivalent class are presented. For the class of important permutations with which this thesis is concerned, a representation scheme is developed. This class of permutations includes the class of bit-permute-complement studied often in the literature. Based on the representation scheme and the omega network, algorithms are developed that determine, for a given parallel procedure, the mapping and remapping of data into physical memories so that the communication cost is minimized. Algorithms are also developed that determine, for every arithmetic expression of a given procedure, the alignment of operands for every binary operation so that the communication cost is minimized. For example, the communication costs of a lower triangular matrix inversion program and a fast Fourier transform program are minimized using both approaches. For each of the approaches, a saving of one-third of the original communication cost is achieved for the matrix inversion program while a saving of three-fifths

is achieved for the fast Fourier transform program. The algorithms developed are proved to be applicable for any SIMD multicomputer interconnected with any network from the equivalent class.

TABLE OF CONTE ITS D E D IC A T IO N............................................................................................................. ii ACKNOWLEDGEMENTS......................................................................................... iii LIST OF FIGURES.................................................................................................... vi LIST OF TABLES....................................................................................................... vii CHAPTER 1. INTRODUCTION................................................................................................. 1 1.1 Motivation and Objectives.................................................................. 2 1.2 The Communication Cost Minimization Problem..................................... 6 1.3 Review of Previous Works......................................................................... 14 1.4 Outline of Thesis..................................................................................... 17 2. THE OMEGA NETWORK AND THE ALGORITHM ENVIRONMENT............. 18 2.1 The Omega Network........................................................................... 19 2.2 A Method for Finding Data Mapping Functions........................................ 23 2.2.1 Data Mapping for PSP and BRP............................................... 25 2.2.2 A Data Mapping for BRP and PSP on Omega Network............ 28 2.2.3 Applications for Other Networks............................................... 35 2.3 A Larger Class of Permutations............................................................... 36 3. NON-STATIC DATA MAPPING......................................................... 42 3.1 Logical Transfers and Data Mapping Functions........................................ 45 3.2 Characterization of Optimum Sequences................................................... 49 3.3 Non-Static Data Mapping Algorithms....................................................... 56 3.4 E xam ples................................................................................................... 6 iv

3.4.1 Lower Triangular Matrix Inversion............................................ 65 3.4.2 Fast Fourier Transform............................................................. 76 4. DATA ALIGNMENT................................................................ 84 4.1 Logical Transfers and Alignment Functions.............................................. 85 4.2 Characterization of Optimum Sequences................................................... 91 4.3 Data Alignment Algorithms....................................................................... 106 4.4 Examples................................................................................................... 116 5. APPLICATIONS FOR OTHER NETWORKS...................................................... 117 5.1 Data Mapping Algorithms.................................................1............... 120 5.2 Data Alignment Algorithms....................................................................... 124 6. SU SMMARY AND CONCLUSIONS........................................................................ 129 APPENDIX.......................................................................................................... 132 BIBLIOGRAPHY......................................................................................................... 136 v

LIST OF FIGURES FIGURE 1.1 M odel of an SIM D m ulticom puter.................................................................. 3 1.2 Logical and physical data transfers................................................................ 10 1.3 An expression tree for B(i + 1) - C(i - 3)*D(i + -1)........................................ 11 2.1 A 23X 23 omega network............................................................................. 19 2.2 A single stage for a 23X23 omega network.................................................... 20 3.1 Example for recursive transformation........................................................ 43 3.2 Expression trees of a parallel program......................................................... 63 3.3 Relationship between V, A, D and L......................................................... 66 3.4 Relationship between V and its submatrices.................................................. 66 3.5 Relationship between A,/ and C................................................................. 67 3.6 Forming the off-diagonal terms of A/ s..................................................... 68 3.7 Forming all the product terms of A D5 and A D.................................... 69 3.8 Adding product terms to form elements of AjiDo and A'DS....................... 69 3.9 Forming all the product terms of A'lDs5Al and A31DoA21................70 10 Adding product terms to form elements of A 1DoA-1 and AlDsA1......... 70 3.11 Forming the product terms of V............................................................ 71 3.12 Forming intermediate sums of product terms................................................ 72 3.13 Forming the elements of V V 3.................................................................... 72 3.14 Forming all the product terms of Vl1V3Vi1................................................. 73 3.15 Forming intermediate sum of product terms................................................. 74 3.16 Forming the elements of H = V21V3Vl....................................7. 75 vi

FIGURE 3.17 Signal flow graph of a 16 points FFT............................................................ 77 3.18 Expression trees for computing Al ( stage )................................................ 80 3.19 Expression trees for computing A2 and A3 ( stage 2 & 3 )............................. 81 3.20 Expression trees for computing A4 ( stage 4 )................................................ 82 4.1 Logical transfers and alignment function....................................................... 86 4.2 Logical transfers and alignment functions for temporary variables................ 89 4.3 Sutbtree of a full expression tree............................................................... 90 4.4 Relationship between k-tree, j-stree and kj-stree.......................................... 93 4.5 A full expression tree............................................................... 110 4.6 Expression trees of a parallel program for data alignment............................. 115 5.1 A 23X23 indirect binary n-cube.............................................................. 118 vii V11

LIST OF TABLES TABLE I Example Permutations in BPC.............................................................. 40 II Logical Transfers and Date Mapping Functions for 16 Points FFT..................... 83 III Experiments on the Heuristic Algorithm.............................................................. 114 viii

CHAPTER 1 INTRODUCTION Continuing advances in VLSI technology have resulted in low cost computing capabilities that make multicomputer systems more attractive as a practical way to increase effective computing power without inordinate increase in cost. Multicomputer systems consisting of more than 500 processors have been made feasible and proposed [LuBa80, Schw801. However, unless parallel algorithms that use these multicomputer systems effectively are found, the systems will not be cost effective[Ager77!. Until a few years ago, computer scientists were primarily concerned with the computing aspects of parallel processing. Communication aspects of the processing usually receive little attention. The communication aspects are becoming at least as important as the computing aspects, if not more, as more and more proposed or built interconnection networks for the processing elements are not capable of realizing at all or realizing efficiently some of the communication patterns. Gentleman [Gent78j shows that communication time can be greater than the computation time that is based on data dependencies alone. He shows that considering data dependencies alone, the execution time for computing the matrix product of two NX N matrices is O(logN), but the communication delays limit the total execution time to O(N) on a mesh connected multicomputer. Possible tradeoffs between computation and communication are demonstrated through parallel algorithms developed for Boolean matrix multiN2 plication [AgLi78J. The parallel algorithms developed are found io execute in 0( log ) and N3 0(log N) time using N and lNoo- processors. The tradeoffs show that computation log N log log N 1

2 and communication should be taken into consideration in mappi ig algori thms onto multicomputer systems. Indeed, hy properly arranging the data onto a ci cularly onnected network of N processors, [IrCh8Oj shows that the complexity of the Jacobi algorithrt for a Nx N matrix can be reduced from 0(N2) to O(N). This thesis investigates the problem of mapping parallel algorithms efficiently onto a multicomputer system interconnected with any multistage interconnection network from a class of functionally equivalent networks. The structure of the multicomputer system that this proposal is based on is an SIMD (Single Data Multiple Data Stream [Flyn721) computer as shown in figure 1.1. It consists of a central processor with its own private memory, N = 2" processing elements each with its own private memory, and an interconnection network that provides data paths among the processing elements (PEs). Data communication between PEs is performed by executing routing instructions. Given a data communication pattern, depending on the interconnection network, one or more routing instructions for the network may be needed to realize the communication pattern. The total execution time of a parallel algorithm thus consists of the computation time and the communication time. This thesis is concerned with efficient execution of an arbitrarily given parallel algorithm on a multicomputer system through minimization of the communication delays of the algorithm. 1.1. Motivation and Objectives The execution of a parallel algorithm can be considered to proceed in a sequence of alternating computation and communication stages. During a computation stage, active processors perform transformations on available data, generating new results. During the communication stage, one or more data are transferred between processors in preparation for the next computation stage.

3 I/O I/O I I I I I I I I I I m * * I I m a I I I Figure 1.1 Model of an SIMD multicomputer.

4 As Chen [Chen8l] has pointed out, while the processing speed of a computation stage can readily be determined from the capability of the processing elements, the processing speed of a communication stage depends on the organization of the entire system, and thus is much more complicated. The communication process is dependent on many factors, including the interconnection network, the storage scheme for the data and the data transfer functions performed. For instance, consider a parallel assignment statement ( PAS ). A(i) +- D(i+2) + A(i) *(B(i) - CTi)), 0 < i < N-l. It consists of the following cycles: 1.Computation stage: T{i) +- A(i) (B(i) - C{i)) (Tis a temporary vector ) Communication stage: align T(i) and D(i+2) 2.Computation stage: A(i) -- D(i+2) + T(i) Clearly, alignment of vectors D and T can be achieved in different ways, each causing a different communication delay (number of routing steps). For example, we can movi D to T or move T to D. The way the vectors are stored also affects the data transfer function. Communication delay of a data transfer function may again vary, depending on the type of network used. Thus, for a given algorithm and a given machine, a careful study of the relationship among the computational structure, the communication requirements and the machine structure is necessary for minimizing the total communication cost [Ager771. At the hardware level, a cross-point array interconnection network is obviously t ie best choice for minimizing the communication costs. However, as N increases, its costeffectiveness decreases rapidly and becomes impractical[IKuck781. It is also possible to design

5 an interconnection network that matches the communication requirement of a given algorithm. However, it is often the case that it is too costly or its usefulness is limited to only specific problems. The requirements of cost-effectiveness along with high performance have led to proposals of many blocking multistage interconnection networks [Park80, WuFe78, WuFe79, Prad79). Such network of size N (N input X N output) consists of log N stages each N comprising - elementary 2 input X 2 output, two-state switches. Each stage is preceded or followed by a fixed wiring pattern that connects it to the adjacent stage or the outside. The number of admissible permutations on such network is v7v7, which is a small fraction of N! possible permutations for N>8. However, researchers (Lawr75,Peas771 have demonstrated that many important permutations of parallel algorithms can be realized in one pass through some of these networks. In this thesis, we will be concerned with a class of networks that are functionally equivalent to the omega network. We will denote this class of networks by r and will define it in chapter 5. Once, a good interconnection network is chosen, other means must be sought to improve the communication delays of parallel algorithms. Now, the need of data communication arises when the operands of some binary operations are not aligned. It is thus clear that communication cost is affected by the manner in which data art stored in the memories. This has led to the development of software techniques [IrCh82, h uSt77, Kuhn79, Kuhn80] to properly distribute the data over memories to reduce communication costs. Kung & Stevenson [KuSt77J, Kuhn [Kuhn80O, Bohkari [Bohk8lJ and Irani & Chen [IrCh82) have demonstrated that to improve the performance of parallel algorithms, storing the data properly is sometimes a more flexible tool for providing better solutions than har Iware, whose capability is often limited by cost and complexity. However, most of their works are limite( to multicomputer systems with single stage interconnection networks of limited capabilities, for example, the mesh connected [KuSt77], the perfect shuffle connected [Kuhn80 and the circularly connected

6 multicomputer systems[IrCh82]. The global objective of this thesis is thus to develop methodologies for minimizing interprocessor communication delays in parallel computations on an SIMD multicomputer interconnected with a blocking multistage interconnection network in r. It embodies the following general objectives: 1. Characterization of a network in F. 2. Characterization of the class of communication patterns with which we will be concerned. 3. Development of methodologies that will analyze an arbitrarily given algorithm and obtain an optimal or near optimal solution to the interprocessor communication minimization problem for an SIMD multicomputer interconnected with the network. 4. Demonstration of how the methods developed can be used for any multistage interconnection network in F. The communication problem with which this thesis is concerned is first formulated in [IrCh82] and is defined formally in the following section. 1.2. The Communication Cost Minimization Problem The communication cost of a parallel algorithm is dependent on the algorithm and the multicomputer system onto which it is to be mapped. Characterization of the communication cost minimization problem hence involves specifications of parallel algorithms, inte.connection networks, and the relationship between the two. The notion of data transfer at the logical and the physical level is illustrated.

7 Specification of Parallel Algorithms Let m denote the logical memory and assume that the parallel algorithm uses 2" = N logical memories, rm,ml,m2,...,m, L1. Also assume that the vectors En the algorithm a e of size N and their elements are indexed from 0 to N-l, and that the ith element of a vector for example, B, i.e., B(i), is stored in logical memory mi. The communication requirement of a parallel algorithm is specified by.. sequence of k logical data transfers, Pi, 0 < j < k-1. Let {0,l,...,N-1)} denote the index set of the logical memory, then a logical transfer (or data transfer at the algorithm level), Pi, for a particular data is a partial function mapping the index set {0,1,...,N-1}, into itself. That is, Pi sends data from memories m, to mp (,), for all i to which Py assigns values. For instance, consider the expression u(i - 1)X tri + 1). Since u(i - 1) and v(i + 1) are not stored in the same logical memory. logical transfers are needed to align the two vector operands before multiplication can be executed. We can move either u to v or v to u. Suppose v is to be moved to u for alignment. That is, transfer operation is applied to variable v oily, so that after the operation, the desired computation can be carried out at u. The logical transfer Py for vector v may then be written as a permutation, 0 1 2... N-1l P J-N-2 N-0... N-3 This transfer Py is, at the logical level, a uniform shift of distance -2, i.e., Pi (i) = (i - 2) mod N. However, Py is not unique. A logical transfer for vector u with shift distance +2, or a logical transfer for v with shift distance -1, and one for u with shift distance +1 can also be used to align the operands. The notion of where the operands shall be aligned can be formalized as follows. Consider the expression u(Pl(i)) X v(tP,(i)), instead of aligning u with v or v with u, we can align them according to a bijection mapping {0,1,...,N-l}^ onto {0,1,...,N-i}^. By this, we

8 mean that if w is the result vector of the binary operation on u and v, and A, is the bijection, then the binary operation for u and v is performed in processor A, (i), and the result is stored in memory mA, (i), 0 < i < N-l. The logical transfers for vectors u and v, in this case, are PUW = APu and PVW = APv, respectively. For instance, consider the binary operation u(i1)X v(i + 1) again. If A, = identity function, then Pu,(i) = i + 1 mod () = P. and Pv(i) = i - 1 mod (A) -=Pv. If A,(i) = h + i mod (N), then P,,(i) = i + 1 + h mod (N) and Pw,(i) = i - 1 + h mod (N). Henceforth, such bijections, A,'s will be called alignment functions. If temporary vector variables w, is to be aligned to wi, the logical transfer is given by AwAk. ) I Specification of Interconnection Networks Let Al denote the physical memory with N memory modules, Mo,A,...,MN-1, which are connected by an interconnection network. The network is defined by a set of admissible network functions, Q = {Qo,Qi,...,Q}), each element of the set being a bijection on the index set { 0,1,2,...,N-1 }sf. When a routing instruction associated with an admissible network function Qi, is executed, data is sent from data memory modules Af, to MQj (,) for all the "active" memories AM, through direct data paths of the network. In other words, each admissible network function designates a transfer that can be physically realiz(d in one routing step ( one pass through the network). Let P be an arbitrary partial function on the the index set { 0,1,2,...,N-1 }M, and that i specifies a transfer of some data from Al, to Mp (,). Zero or more routing instructions will bineeded to realize P, i.e. zero or more passes through the network before the data can reach the destination memories. Hence, we can associate with each interconnection network, a distance function D on a set of partial functions such that ) minimum number of routing steps to realize P, if P can be realized. oo if P cannot be realized.

9 Clearly, if P is an identity function, no transfer is necessary. Therefore, D(P) = 0. If P is an element of the set of admissible network functions, Q, of a network, then D(P) = 1 with respect to the network. If D(P) = 1 for a network, then we define P as an admissible permutation of the network. Mapping Data onto Memories Consider a parallel algorithm using N logical memories that is to be executed on a machine with N PEs and an interconnection network with a distance function D. To execute the algorithm, we map the "logical" memories, m, onto the "physical" memories, M. A data mapping function for a data variable v is a bijection F. mapping { 0,1,...,N-1 }m onto {0,1,...,N-1}M, i.e., the ith component of v in m; is stored in memory module AMF,), 0 < i < N-1. Consider two vectors u and v, let Fu and Fv be their corresponding data mapping functions. Suppose u and v have to be aligned for certain operation, for example, u(PFu(i)) X t{(Pv'(i)). One way to align the two vector operands is to nwove u to v so that after the operation the desired computation can be carried out at v. Let -'u = P-vPu be the function describing the logical transfer (permutation) of moving t to v. At the logical level, element u(Fu1(i)) is moved from mFl( to mPy tl() At the physical level, however, the data movement is from M, to MFp Fui(,. The actual permutation to be realized by the network is then given by FVPUVF'1, and the communication cost is thus given by D(FtPuvF'u). Figure 1.2 illustrates the notions of logical transfer, physical transfer, and data mapping function. Identity functions are usually the first choice for data mapping functions. However, if the data mapping functions F& and F, are both identity functions, the network transformation to be realized becomes P,, = transfer at the algorithm or logical level, which may not be reali2able in one routing step. Hence, alternate data mapping functions for u or v or both may ha.ve to be used to make FvPvFU1 admissible.

10 m-1 FvPuv u M! --- FUV F M> rvPuv Fl(i) -1 F u physical level Fv logical level -- -1 (i) Fu (i) (i) U 1i) P v UV uv UV Figure 1.2 Logical and physical data transfers Having spt cified the algorithm, the network and the relationship between data transfer and data mapping functions, we shall now formulate the interprocessor communication minimization problem. The Minimization Problem For any binary operation, the way operands are aligned determines the log cal transfer functions for the two operands. Instead of aligning variable u wi h v or v with u;or a binary operation involving v and u ( as was the case in the previous section ), we can perform the binary operation in processor A,(i) and store the result in memory mA.(), 0 < i < N-l. There is no loss of generality if the resulting vector is called w(A, (i)). In other words, the data mapping function F is an identity function. The total communication cost for the binary operation is then D(Pu^}u)+D(PjF ), where the logical transfers P,,, and P, are A,,P and AP,, respectively.

11 The above result can be generalized to a parallel expression, S, with variables vl,v2,, vk.. Let E be an expression tree for S. For example, for B(i + 1) - C(i - 3)*D(i + 1), E is as shown in figure 1.3. p W1W2 Wi B C D where wl and wu are the interna nodes of the tree and the partia results. Figure 1.3 An expression tree for B(i+l)-Ci-3)*D(i+l). Now, let wo,wl,...,wt be the internal nodes of E, where each wu represents the partial result of some binary operation on variables and/or other internal nodes. For a given expression tree E, which specifies the order of computation of S, the alignment of operands for every operation from which the logical transfer functions can be specified must be determined. Thus, for a statement S, the communication cost is w.: Iuj= E WD(P,,<vjF) + variable (leal node) partial result (internal node) parent ( vi ) E D(P jw.), wP;, wj: internal lodes wj - parent( i ) where parent(z) denotes the parent node of z in E.

12 The above indicates that the total communication cost depends on the data mapping functions F,-'s, and the transfer functions P.'s and P,.,'s. The transfer functions in turn depend on the expression tree E, as well as the alignment of operands. If the operations are commutative, the tree may not be unique too. The complete communication cost minimization problem for a parallel algorithm can hence be stated as follows: Given a parallel algorithm involving variables with logical ii dex set {0,1,...,N-1}m, an interconnection network for physical memories with index set {0,1,...,N-1}(f and the distance function D, determine the following: 1. an expression tree for every parallel assignment statement in the given algorithm; 2. alignments of operands for binary operations in every expression tree in 1; and 3. a data mapping function Fv for every variable v involved in the parallel statement in 1, such that the value of the expression E + D(Pw) P j)..............( 1.1) statements v,: variable r wj ~ internal nodes of the (leal node) pj a = prent ( w,) given fw: partial result algorithm (internal node) w-= parent( v ) is minimized.

13 A data mapping is defined to be static if the data mapping function, F,i for each variable, v, in the algorithm does not change throughout the algorithm. [Kuhn80] and [IrCh82] have shown that by determining when and how to perform a remapping during the execution of a parallel algorithm, the new cost, the communication cost plus the cost of remapping itself, can be less than the cost of using only static dita mapping. For nonstatic mapping, we will determine a data mapping function for each logical transfer of every vector variable in the algorithm. That is, given a sequence of k logical transfers for v, we will determine a sequence of data mapping functions, S- = Fv,1 Fv,2,..., Fv, where Fv, is the data mapping function for the ith logical transfer of vector v. The remapping cost for v is then given by k D(Fv, s-+l F)6=1 For nonstatic mapping, the sum of remapping cost for all the vectors is to be included into the above communication cost formulation ( equation 1.1 ). Only step 3 of the above cost minimization problem has to be changed. That is, instead of determining a data mapping function Fv for every variable v, we determine a sequence of data rrapping functions Fv,1, F,2,..., F,,k for v. Clearly, the total communication cost is dependent on the alignment functions, the mapping and remapping functions, and the logical transfer functions. Furthermore, since the expression tree for a statement may not be unique, the choice of expression trees may affect the communication cost. The general communication cost minimization problem is complex In fact, even the static data mapping problem, where alignment functions and expression trees are fixed, can be shown computationally equivalent to the bandwidth reduction problem [KuSt77,Bokh811 which is NP-complete. Hence, in this thesis, we shall restrict ourselves to a specific algorithm environment. We shall assume that the expression trees are fixed. This is often desirable, for example, in problems where computation reordering may cause loss of sig

14 nificant digits, overflow or underflow, etc.. We shall also assume that the am, unt of computations in a parallel program is fixed. For example, if there is a loop in the programr, the number of times the loop will be repeated is fixed and known during compile time. Assuming all alignment functions are identity functions, we will develop algorithms for optimum or near-optimum nonstatic data mapping functions. Assuming all data mapping functions are known or fixed, we will also develop algorithms for finding optimum or near-optimum alignment functions. 1.3. Review of Previous Works Data mapping is concerned with the manner in which data are stored. Its objective is to reduce communication costs between processors and processors or processors and memories during the execution of parallel algorithms. Even th )ugh a restricted version of the data mapping problem called mapping problem was first formulated by King and Stevens n [KuSt77] and the general communication cost minimization problem was then formulated I y Irani and Chen [IrCh82l,the notion of data mapping functions for special cases can actually We traced back to earlier literature. For example, in [Lawr75], [Budn7l] and [Swan741, skewed array storage schemes for conflict-free access to rows, columns, forward and backward diagonals and square blocks are described. In [Peas77J, data mapping functions for performing FFT and matrix multiplication efficiently on an indirect binary n-cube microprocessor array -re described. However, no general technique had been reported. In formulating the mapping problem, Kung and Stevenson [KuSt77j model the communication requirement of a parallel program as a sequence of k logical transfers, Py,O<j<-l. The minimization problem is then to find a single mapping function, F, for all the vector variables in the parallel program such that

15 k-I S D(FPiF') is minimized. j=o For Po = P ' = Pk-i = P, they have shown that this type of algorithm (type I) can always be executed optimally, even on the simplest network, namely, the circularly connected network. Type I algorithm can be executed in at most 4k routing steps on a circularly connected network. Matrix transposition and polynomial evaluation as described in [Ston71] are examples of such an algorithm. It is also pointed out in [KuSt77l that the mapping problem for this type of algorithms corresponds to a graph numbering problem. For algorithms using different transfers (type II), Kung and Stevenson [KuSt771 pointed out that the minimization problems become the well known bandwidth or profile reduction problems encountered in sparse matrix computations. Since no exact solution for the above general problems can be easily obtained, in [ThKu77l and [KuSt771, an approximate technique for the bitonic sort problem (type II algorithm) on a mesh connected n -twork is described. The technique determines an optimum data mapping function for the most expensive transfer. The approach, however, is not desirable. The optimum data mapping function for a particular transfer may not be optimum for the entire algorithm. It may make the total cost higher as the optimal data mapping function may be costly for other transfers. Nevertheless, the routing time is reduced from 0(vn log n) to O(V/'). In [Bokh81] the mapping problem is shown to be computationally equivalent to the graph isomorphism problem, and similar to the quadratic assignment problem. Since they are usually solved approximately by heuristic algorithms, Bokhari [Bokh8l] describes a heuristic algorithm in solving the mapping problem that arises when solving structural problems on the finite element machine (FEM), an array of processors under development at NASA Langley Research Center. In [Kuhn79, Kuhn80] efficient mapping of several numerical algorithms (type II algorithms) on shuffle-exchange network is described. Efficient mapping in his sense, is to make

16 the communication complexity of an algorithm to be equal to or less than the computational complexity of the algorithm. The notion of nonstatic mapping is introduced to map parallel algorithms that cannot be efficiently mapped by static mappings. A mapping is defined to be static if the mapping of data to memory does not change throughout the algorithm. Hence, instead of minimizing k-I S D(FPjF ),;=0 as in [KuSt771, Kuhn [Kuhn80] tries to minimize k-1 S (FilPjl;'). iy=0 Non-static mapping is shown to be useful for some numerical algorithms considered by Kuhn. However, no systematic method for finding the mapping functions, Fy, 0 < j ~ k-1, is described. In [IrCh82], the general problem of communication cost minimization (as described in section 1.2) is formulated. Based on the cycle structures of an interconnection network and two arbitrarily given permutations, a mapping methodology for the permutations is described. A heuristic algorithm to obtain an optimal or near-optimal solution for the problem of mapping a particular class of algorithms onto a circularly connected network is also describe in [IrCh82J. The only restriction to the algorithms considered is that it consists of uniforrlshift type logical transfers only. Their approach consists of the following steps. First, at the logical level, an optimal sequence L of Ic gical transfers is obtained ( b, determining alignment functions and a computation ordering for each expression tree ), assu. ling that identity mapping function is used. Then, based on L, they determine for every lata v a data mapping function F,. Since Fv is optimal only with respect to L and may not be opti nal in a global sense, a heuristic that makes appropriate adjustments so that the resulting data mapping functions become optimal or near-optimal is then used. The notion of nonstatic data map

17 ping is also considered and a method for finding an optimal remapping schedule is developed. However, it is optimum with respect to a particular L only. Previous efforts regarding to the interprocessor communication problems have primarily been concerned with solutions to special cases or on fast heuristics. The most general cases that have been studied so far are reported in [IrCh82]. However, the study is mainly for a circularly connected network that is considerably different and less powerful than the class of multistage interconnection networks that is to be studied here. 1.4. Outline of Thesis In chapter 2, a network in F, the omega network, on which chapters 3 and 4 are based, is characterized. Based on the characterization, the class of important permutations that this thesis assumes is described. The characterization also leads to a method for solving for a data mapping function for a single vector variable. The method is demonstrated by finding a data mapping function that enables bit reversal permutation as well as perfect shuffle permutation admissible on an omega network. In chapter 3, permutations and data mapping functions are characterized. Identity alignment functions are assumed and algorithms are developed that determine, for a given parallel procedure, the mapping and remapping of data into physical memories so that the total communication cost is minimized. In chapter 4, permutations and alignment functions are characterized. Identity data mapping functions are assumed and algorithms are developed that determined, the alignment of operands for every binary operation of expression trees so that the total communication cost is minimized. In chapter 5, F, the class of multistage interconnection networks that are functionally equivalent to the omega network is defined. The algorithms developed in chapters 3 and 4 are then proved applicable for any network in F.

CHAPTER 2 THE OMEGA NETWORK AND THE ALGORITHM ENVIROMENT In this chapter, the set of admissible network functions, defining the omega network in r is described. A representation scheme for a class of communication patter-is is developed. Based on the representation scheme, a general method for finding data mapping function for a (lass of simple problems is described. The method is demonstrated by finding a data mapping function for bit reversal permutation ( BRP ) and perfect shuffle permutation ( PSP ) so that they can both be realizable in one pass through an omega network. The data mapping function is proved to be good for omega network of any size, and thus proving Chen's[Chen821 conjecture that for BRP and PSP, such data mapping function can always be obtained for omega network of any size. The class of permutations that the method can apply to is limited by the representation scheme. It includes one of the five families of "Frequently Used Bijections" collected in [Lenf78] only. However, it will be shown that the representation scheme can be extended to represent a larger class of permutations that we will be designating by A[IrWu82]. This larger class of permutations is assumed throughout this thesis. By this, we mean that all logical transfers for the vector variables of parallel algorithms are assumed to be in A. If a parallel algorithm uses other permutations besides logical transfers in A, we shall consider the minimization problem for vector variables involved in logical transfers in A only. 18

19 2.1. The Omega Network Since the omega network proposed by Lawrie[Lawr751 has been extensively studied in the past[Sieg78,WuFe79J, we will give a brief description of the network and then characterize f, the set of admissible network functions defining the omega network. 'he topology of a 2"X 2" omega network consists of N = 2" input lines, N output lines and n stages. The network input lines and network output lines are both numbered in binary notation from 0 to N-1 as shown in figure 2.1. The stages are also numbered from 0 to n-1, from right to left, as shown in the figure. N Each of these stages consists of a perfect shuffle interconnection of N lines and - switches. 2 N The switches of each stage are numbered in binary notation from 0 to - -1 as in figure 2.1. 2 Stage 2 Stige 1 Stage 0 Figure 2.1 A 23X 23 omega network.

20 For our purpose, a switch is not capable of broadcasting. It may either send its inputs straight through or interchange the inputs i.e. a switch may operate in one of two modes namely, "direct" or "crossed". The input lines of a stage are called switch input lines and the output lines of a stage are called switch output lines ( figure 2.2 ). The indexing of all these lines is carried through the network, assuming all switches operate in the direct mode. That is, if lines ( network input lines or switch input lines ) i and j are paired together on a switch on the left, the lines ( network output lines or switch output lines ) incident on the right are also indexed as i and j, with i connected to i and j to j when the switch operates in the direct mode. The switch output line k of stage i is connected to switch input line k of stage i-1, 1<iKn-1,0<k<N-l. The network input lines are the same as the switch input lines at stage n-1. The network output lines are the same as the switch output lines at stage 0. Figure 2.2 A single stage for a 23X23 omega n ~twork.

21 Let I be the index of a switch input line and let it be identified by an n-dimensional vector (Il,l,,-2,...,llo), where l=21"',,1 + 2-21.2+......2/i + 10. At stage i, the switch input lines that differ only in their components Ii are paired together as inputs to a switch, for 0 < i< n-1. Two switch input lines (ti,... *,I*i,,ll,...,lo) and (I.i,...,I itO,l-i,...,l o) connected to a switch at stage i can be connected to switch output lines (l, ~ ~,+1,,O-F 1,. * * ~llo) and (ItnA,,.. l,4+,1,4.4,...,I,b), respectively, at the output if the switch operates in the crossed mode. They will be connected to switch output lines (ll,...., +,l,li.1,...,llo) and (1,,....,;+1,0,1,. *.,l,b), respectively, at the output if the switch is set to the direct mode. Since there are n stages, each component l, of all network input lines, O<i< n-l, is affected at some point and can be changed by setting appropri Ate switches to the exchange mode. Hence, the mapping of a network input line n1 to a network output line 12 can be regarded as an n-step process in which the ith component of the network input line 11 is changed to the ith component of the network output line 12 at step ( or stage ) i, for 0 < i< n-1. Let z & y be the indices of a network input and a network output,respectively. A permutation, P, is then a one-to-one function mapping z into y for all integers in the range 0 < z < 2"-1 and 0 < y < 2"-1, i.e. y=P(z). Let xz,, _n2 * o and yy, Y-2'' ' Y denote the binary representations of x & y E {0,1,...,21-1}, respectively. Then the function representing the permutation, P, can be described by a set of functions p,,0 < i < n-l, where;i = P(zlz,ln-2.-o)....................(2.).

22 If P is a mapping that is linear in the components of the index, then we can write where x -= xl, -2, X3,.., T T Y = Yni, Yn-2, Y-3, -* - *, P is a nX n nonsingular binary matrix, and modulo 2 arithmetic is used. In what follows, a permutation will be denoted by a capital letter. If a permutation that is denoted by a capital letter has a nonsingular binary matrix representation, then the matrix will be denoted by the corresponding capital letter in boldface. Unless otherwise stated, modulo 2 arithmetic will be used for all matrix operations. We will use L to denote an arbitrary lower unit triangular matrix and U to denote an arbitrary upper unit triangular matrix. By unit matrix, we mean a matrix in which all the coefficients on the main diagonal are ones. The following theorem is due to Parker[Park80) and is based in Pease's earlier work [P eas771. Theorem 2.1 A permutation P is an element of n, iff for every i, 0 < i < n-1, y; can be expressed as Yi' = ziGfi (Y,,-, Y,2,, * it, il,, X-2, ~., Xo). where fi is a boolean function and ~ stands for exclusive-or operation ( equivalent to modulo! arithmetic ). Proof: Consider a switch node at stage i of the omega network. This node switches a pair of lines whose indices differ only in the ith components of their binary representations. The less

23 significant components of the index have not been modified by any switch node at stage k,k<i. Hence If maps the less significant components of z, i.e., Z,1,..., and the more significant components of y, i.e. Y,-I,Yw-2, —,Yi+1, into {0,1). The fact that y, must be equal to z-,fi instead of the more general form (g,;z;fi), reflects the fact that if one input line to a switch node is switched, the other is also switched. Corollary 2.1 A permutation P, that is linear in the components of the indices, is in f iff it can be expressed in the form Ly = Uz. Since the inverse of a lower unit triangular matrix is also a lower unit triangular matrix, for the linear case, y can be written as: y = LUz. Hence, a linear mapping P with permutation matrix P is admissible through a 2"X 2" omega network iff it can be decomposed into an L and a U such that P = LU. 2.2. A Method for Finding Data Mapping Functions In what follows, we will limit ourselves to data mapping functions and permutations that are linear in the components of the indices. That is, we will consider permutations and data mapping functions that can be represented in nX n nonsingular binary matrix form only. Consider a data mapping function F, for a vector variable v. The relationship between logical and physical memory is as follows:

24 z = FVL, where XL is an n-dimensional vector that is the binary representation of the index of the logical memories of v, z is an n-dimensional vector that is the binary representation of the index of the physical memories, where components of v are respectively stored. A logical transfer Pi for v can be expressed by a matrix, P,, such that YL - PjZL...................(2.2), where YL is an n-dimensional vector that is the binary representation of the index of the logical memories of u, where v will be stored after the permutation. Assuming identity data mapping function for u, Y = YL, where y is an n-dimensional vector that is the binary representation of the indicesof the physical memories, where components of u are respectively stored. The physical transfer relating the physical memories where v and w are stored is given by y = PAF,,1-z. If P,[Fvj-' is LU decomposable, then with data mapping function, Fv, the logical transfer Pi is admissible through an omega network with the data mapping function, F,. Given k permutations, Pi, 0 < j < k, for a vector v, we will consider, in the next section, the problem of how to find a data mapping function, Fv, i' there exists one, for Pi, O < j < k, such that PjF-l, 0 < j < k, are LU decomposable. The method will be described by finding a data mapping function, F,, that makes PSP and BRP both admissible to an omega network.

25 2.2.1. Data Mapping for PSP and BRP Perfect shuffle permutation (PSP) and bit reversal permutation (BRP) are two important data communication patterns in many numerical problems, e.g., Fast Fourier Transform. However, although perfect shuffle interconnection is used in every stage of an omega network, both PSP and BRP are not omega network admissible. In what follows, it will be shown that a systematic method can be used to find a data mapping function such that PSP (P,) and BRP (Pb) for vector variable v are simultaneously admissible in one pass throu;h an omega network. In fact, a general solution that is good for omega network of any size i, found. Thu proving the conjecture[Chen8l) that for permutations BRP and PSP, a data mapping function can always be obtained for omega network of any size such that BRP and PSP will be admissible in one pass. Definition 2.1: Let P8 be a perfect shuffle permutation on the set {0,1,...,N-1 }, where N = 2" and n is a positive integer. Let (IXz,,2.' ' 'zO) be the binary expansion of an element x of the set. Then the PSP of z is given by P (X) = (Zn2z3 *' ' ' zo 1)In other words, P, (z) is obtained by cyclically shifting the binary expansion of x one place to the left. The corresponding nX n binary matrix P8 is 0 1 0 0... 00 1 0... 00 0 0 0 1....00 0000... 1 0 0000... 0 1 1000... 0 7 -

26 Definition 2.2: Let Pb be a bit reversal permutation on the set {0,1,...,N-1 }, where N = 2" and n is a positive integer. Let (z,x 1z,2 X* zzo) be the binary expansion of an element z of the set. Then the BRP of z is given by Pb (2) = (z'zi* * * 2n-1) In other words, Pb (z) is obtained by reversing the order of the bits of the binary expansion of z. The corresponding nX n binary matrix Pb is 000... ooo.. 1 0 0 0 0. 0 0 00... 0 0 01 0. 0 0 100... 0 0 Let F be a data mapping function with which both PSP and BRP can be realized in one pass. Then P,[F]-l and Pb[F]1- in y = P,[F]-l1 and y = PbIF]-z, respectively, are decomposable into LU' a. Let the nX n binary matrix [FJ-1 be as shown below 2,0 aO,l ao,2 ~ ~,n-2 aol, aO all al,2 a. * -al,2 al,-I a2,0 a^l a2,2 * * * a2,2 22-l ta,-3,0 a,-3,1 a-3,2 ~ * * a-3,n-2 a-3,-l an-2,0 an-2,1 aln-2,2 * * * an-2,r-2 a -2,,-l an-l,0 a,-l,l a,,-1,2 ~. anl,n-2 an-l.n

27 The matrix product P,[F]J- is then alo, a2,0 a3,0 n-2,0 an-l,O ao,o a1,1 a, a2,1 a3,1 a02,1 aO, l al,2 a2,2 a3,2 al,,-2 a2,.-2 a3. -2 an-2,2 a0,2 The matrix product Pb[F]l- is then a6-2,0 an-3,0 2,0 a1,0 0,0 a,212,,-2 a nr-3, n- 2 a2, - I ao, w- I al,1 a0,1 ao,I a2,2 a1,2 ao0, a2, n-2 aO,,-2 Since P,[F]-l and P^bIFI- are LU decomposable, we will let P,[FI-' and PbF]-' be decomposed into LU, and LbUb, respectively. Let 1i, ui, j,,,j and ub i, for O < i, j < n, be the elements of the matrices L, U,, Lb and Ub, respectively. Then a system of 2n2 equations with 2( n2-r + n2 variables can be set up. For instance, consider al,o al,o = (loju, i,o) mod(2), j=l n alo = (Zb.-2jUb,,o) mod ( 2 ). j=1 In this example, we can eliminate a1,0 immediately by equating the equations. Hence, ( s t oJU j,o + tb, 2.,Ub,o ) mod ( 2 ) =0. j=1 t =1

28 Solving this set of equations will produce F. Obviously, Fdoes not exist if there is any contradiction between the equations. In general, for m different communication patterns, we will have mn2 equations with m( n2 - n ) + n2 variables. There is a limit to the number of different communication patterns that can be made simultaneously admissible. Clearly, the limit is dependent on the communication patterns required. 2.2.2. A Date Mapping for BRP and PSP on Omega Network The system of nonlinear equations derived in the last section has been solved for different values of n. Observing the characteristics of the data mapping functions thus obtained, a general solution that is good for omega network of any size is deduced. For a 2nX2n omega network, the data mapping function F is determined to be 1 if j=n-l-i mod(n) n-1 2=fr- 2 0 else where 0 < i, j < rn-1. As an example, for n = 5,F is 0000 1 00 01 0 0 0 1 1 0 1 01001 10000 The corresponding mapping expressed in the convention used in [Chen81) is: (0,18,12,30,4,22,8,26,2,16,14,28,6,20,10,24,1,19,13,31,5,23,9,27,3,17,15,29,7,21,11,25).

29 The corresponding PIFI-l is I pij I 0 if j =n-2 —i mod(n) n if j-i rfor i< floor of - else For n = 5, P.[F]-1 is 1 0 0 1 LO 001 0 1100 1000. 0000 0001 0 0 0 0 oo 0 i The corresponding LU, matrix is given by 1 I i, =- 1 0 if j= if i=n-2-j for j< floor of. I 2 else 1 usi, = 1 0 if j=i if j=n-2-i for i< floor o-. else For n = 5, 1 0 1 LO 0000 1 0 0 0 1000 0010 0001 0010 O 0.1 0010 0001 I 0 O

30 P6[F]-1 is simply equal to L, and hence Lb = L,, Ub= I, where I is a nX n identity matrix with its diagonal elements equal to 1. In what follows, we will prove that F is indeed a general solution. With data mapping function F, BRP and PSP will both be admissible in one pass through an omega network. Lemma 2.1 Let B be a nX n nonsingular binary matrix and bp q be the element in the pth row and qth column of B such that, 1 if q=n-l-p mod(n) bp, = 1 if -=p-1 for <p< -1 (n even) or l<p -- (n odd). 0 else Then B is the inverse of F. Proof: We are going to prove that FB = I. We will first define Iwo matrices, C and D such that C + D -F. We will then prove that CB + DB = FB =-I. Let c;, and d;j be the elements in the ith row and jth column of C and D, respectively, and they are defined as below, 1 if j=n-l-i mod(n) i - 0 else

31 1 if j=il for - < i < n-2 ( n even ) or -< i < n-2 ( n odd). d 2 -dI l 0 else Consider CB, the matrix multiplication is equivalent to replacing p in the definition of bpq with n-l-i. For q=n-l-p, we will have = —i, and for q=p-l, we will have q=n-i-2. The n n-i n n-i limits, 1, -1 and - become n-2, - and --. Let cbiq be the element in the ith row 2 2 2 2 and qth column of CB. Then 1 if q=i cbi,q = 1 if q=n-i-2 for - < i < n-2 ( n even j or -- < i< n - 2( n od ). 0 else The matrix multiplication of D and B is again equivalent to 1. replacing p in q=n-l-p ( in the definition of B ) with i+l, for - < i < n-2 ( n even ) or < i < n - 2( n odd) 2. replacing p in q=p-1 with i+1, for i that satisfies the following two requirements simultaneously: n n-i a.- < i< n-2( neven) or - < i< n - 2( n odd) C)^~~~ M~2 n n-I b. 1 < i+l < n-1 ( n even) or 1 < i+ < ( n old) -2 2 Obviously, the two requirements in 2 cannot be satisfied simultaneously. The element, db;j, in the ith row and qth column of DB is 1 if q=n-i-2 for - < i < n-2 ( n even ) or — < i < n-2( n odd). db~,~ =- l 20 else dbiq I 0 else

32 CB + DB = FB = I. Therefore, B is the inverse of F. EIi Lemma 2.2 PBF- is LU decomposable. Proof: Now, matrix C in lemma 2.1 is actually a bit reversal permutation ( BRP ). Since B = F-1, the matrix product CF-1 = CB is given ir. the previous lemma. If we examine CB carefully, it can be seen that cb -, = 0, for all i <. j. Furthermore, cb;j = 1, for i = j, o < i < n-1. Therefore, CB is a lower triangular unit matrix. Clearly, P6F-1 is LU decomposable, and D(P F-) < 1. Li Let pij be the element in the ith row and jth column of the matrix P,. Then, 1 if j=i+1 for0 < i< n-2 Pi, = 1 if j=n-1-i mod(n) for i=n-1 0 else Matrix multiplication of P, and F-1 is again equivalent to 1. replacing p in q=n-l-p ( in the definition of bp,q ) with i+1, for 0 < i < n-2 2. replacing p in q=n-l-p with n-1-i, for i=n-1 3. replacing p in q=p-l with i+l, for i that satisfies the following requirements simultaneously a < < n a. 1< i+l < -l(neven)or I < +l< (n odd), -- --. - - -- ~~~2 od,

33 b. 0 < i< n-2. n n-1 n -3 The limits 1, 2 -1 and now become 0 -2 -2 and.Let pbi, be the element in 2 2 2 2 the ith row and qth column of PF1. Then 1 if q=n-2-i for O < i < n-2. 3 n odd. 1 if g=i for 0 < i < -1 ( n even ) or 0 < i < ( n odd ). 2 2 pb = 1 if j=n-1 for i=n-1 0 else Lemma 2.3 P9F-1 is LU decomposable. Proof: Let E be a matrix w ith its element in the ath row and bth column defined as follows. 1 if b=n-2-a for - < a< n-2(n even) or- < a< n-2. 12 2 a,b = 0 else Now, multiply E by PsF-1. This corresponds to 1. replacing i in q = n-2-i ( in the definition of pfi,q ) with n-2-a for a that satisfies the following two requirements simultaneously a.n < a < n-2( neven )orn < a< n-2(n odd) 2 2 b. 0 < n-2-a < n-2 2. replacing i in the equation q = i with n-2-a for a that satisfies the following requirements simultaneously

34 n n-i a. -2 a < -2 ( n even ) or - < a< n-2 (n odd) n _ -3 b. 0 < n-2-a < - -1 ( n even ) or 0 < n-2-a < - ( n odd ) 2. 2 Since e,,h- = 0, we do not have to consider pf,-li at all. Let epa,q be the element of EPXF-1 in the ith row and jth column. Then 1 if q= a for < a < n-2 ( n even ) or -- < a <n-2 (n odd 2 - - 2 epa, = 1 if q=n-2-a for - < a < n-2 ( n even ) or — < a < n-2 ( n odd) 0 else Now, let I + E = Lt. They form a lower triangular unit matrix with its element in the ath rou and bth column defined as follows. 1 if b=-a nn ~~~~n-i la, = 1 if b=n-2-a for - < a< n-2( n even) or — < a< -2 ( n cdd). ee2 2 - 0 else Clearly, IPF-1 = PF-1. Now, LP.F-1 = EPF-1 + PF-1. Let Ut = EPF-1 + P8F-1. Then its elements u, b are defined as follows. 1 if b = a ua,b = 1 if b=n-2-a for0 < a < - 2 ( n even ) or < a < n- ( n odd. else2 2 0 else Now. Ut is an upper triangular unit matrix. Since LPwF-l = Ut, PyF-1 = L1U1. Therefore, PF-1 is LU decomposable. [II

Theorem 2.2 There always exists a data mapping function for BRP and PSP, so that both will be admissible to an omega network of any size. One such data mapping function is F whose element in the ith row and jth column is given by 1 if j = n-i-1 nt n-i fij= 1 ifj=i+l for- < i < n-2( neven) or — < i< -2( n odd). 2- 2 0 else Proof: Based on lemma 2.2 and 2.3, with data mapping function, F, BRP and PSP are admissible in one pass through an omega network of any size. 2.2.3. Applications for Other Networks A network, N1, is functionally equivalent[WuFe79, Prad79] 1o the omega network iff Q1 = RlQR2, where Q1 is the set of admissible network permutations for N1, R1 and R2 are permutations. Any permutation in Q1 can be decomposed into three functions, a permutation R1, an admissible network permutation in fQ and a permutation R2. If R1 and R2 are linear in the components of the indices, then a permutation, P, that is linear in the components of the index is in Q1 if and only if P is decomposable into a R1L and an UR2. Given some permutations that are linear in the components of the indices, a data mapping function for these

36 permutations on N1, if there exists one, can be determined by solving the system of nonlinear equations that can be set up similarly as in section 2.2.1. 2.3. A Larger Class of Permutations Since once the desired permutations are put into matrix forms, F, if it exists, can be found systematically with the algorithm described. The applicability of the algorithm is clearly limited to those cases where a permutation can be expressed as a linear combination of the components of the index. The class of communication patterns that can be expressed in a nonsingular binary matrix form is a small percentage of the N! possible input to output per nutations. In fact, the percentage decreases rapidly as N increases. Fortunately, as Lenfant [Lenf781 pointed out, he set of needed permutations is usually small, as compared to the set of N! possible permutations. He collected Frequently Used Bijections (FUBs) into five families. By extending our approach of representing a permutation in a matrix form, the method can now be used for three families of Lenfant's FUBs. In fact. this larger class of permutations, which we designate by A, contains also the class of bit-permute-compiement studied in [NaSa77,81A,81Bj. We now describe A(IrWu821. Let P/(z1-, X,2,, z o0)-*(yl, -2,..., f0) be any mapping that is linear in the components of z. Then it can be expressed in binary matrix form as shown below. ao,o ao, a0,2 aG,3.. ~ ao, 2 aO,,-l al,0 al, al,2 ~ 1,3 ~* * ~ l, 2 a,,-l P Ln-1,O an-l,1 gn-1,2 ltn-1,3 ~.. nl,- 2.-.,n-l where ai E {0,1}.

37 Let k,,.lk,, 2 ko denote the binary representation of k, O<k<2"-1. Let Ok be a bijection defined by Ok: (1,,,2, * * * 2 )-( -l, 1 -2, * * *, )~k, where (z-, -2, *, so)k denotes bit-per-bit exclusive-or of (k1, -, k,,,..., ko) with (z,,_, Z,-2, * * * 0). Definition 2.3: Let PG- = 8-P be the left composition of tk with P.?hen A, the set of permutations with which we will be concerned, is defined by A ={Pa I 0< < 2"-1 }. Clearly, PGk cannot be expressed in the matrix form described earlier except when k=0. However, a 21+ X 2"+l omega network can be partitioned into two 2"X 2" omega network by not allowing any exchange at the first stage of the network[Smit78,Sieg78J. This is equivalent to requiring the first row vector of a (n+l)X(n+l) permutation matrix for the network to be [1 0 0...0O. The (n+l)X(n+l) matrix below now specifies two permutations, P and Pk, each of which passes through one of the two subnetworks of order 2"X 2. 1 0 0... 0 k,-I ao,0 0, 1 ao,I -2 a0,-\ kn-2 al1,0 l,l * l, n-2 al,,...... kO an-lO an-l,1 * anl,~2 anl,n-lJ

38 In other words, P is admissible through a 2"l+ X 211 omega network iff Pa and P are admissible through a 2"X2" omega network. Therefore, we can use a (n+l)X(n+l) nonsingular binary matrix to represent any Pk ( including k =0 ). In this form, theorem 2.1 states that a permutation Pok is admissible through a 2nX2" omega network iff its corresponding (n+l)X(n+1) matrix is decomposable into LU, where the first row of U is 1 0 0...0]. Let F be a (n+l)X(n+l) nonsingular binary matrix. For P1~1- to specify two communication patterns that correspond to P and P ~ with alternate mappings, so that each communica ion pattern passes through one of the two 2"X2" subnetworks, the first row of P[F~-1 shall be equal to the row vector [1 0 O... 0]. This, in turn, requires that the first row of F be [10 0 * 0. If F is the data mapping matrix for the data mapping function F: (z-,zn1z,^.2 * * * )-}+({yyn-l1,-/2,.* * *,Z) then the mapping functions for P~k and P are F1: (1,z,1,z-a2, ~* * * )-*(,lYl-/lY,-2, ~* * *,) and Fo (Zn,,-l, Z n-2,..,z)-(0, -O -,Ya2,...,zo) respectively. Henceforth, in this report, a permutation in A will be represented by a (n+l)X(n+l) binary matrix whose 1st row is [1 0 0...0] even when k = 0. Furthermore, when we talk about an upper triangular unit matrix, we are concerned with a (n+l)X(n+i) upper triangular unit matrix whose first row is [1 0 0...01. Since every permutation in A is LUL d-lcomposatle [Peas77], at most two passes through a 2"X2" omega network are needed to realize anyone of them. Furthermore, if the permutation is an identity function, no routing through the network is necessary. Therefore, if D is the distance function associated with the 2"X2" omega network, then 0<D(P)<2, for all P in A. Since we are assuming a 2nX2" omega network for the rest of this report, unless otherwise stated, the distance function D will be associated with the 2"X 2" omega network.

39 The cardinality of A can be computed in many ways. Based on the following lemma we can determine the cardinality of A. Lemma 2.4: n-I There are I (2" - 2') nonsingular nX n binary matrices. i==0 Proof: The lemma is to be proved by mathematical induction. A binary matrix is nonsingular if every row of it is linearly independent of all the other rows in the matrix. There are 2" different combinations of O's and 1's. Excluding the row of all zeros, there are (2" - 20) different rows. In choosing the first row, there are (2"- 2~) choices. There are 21-1 rows that are lir early dependent on the first row. Hence, for the second row, we have (2" - 21) choices (excluding the row of all zeros). There are (2" - 20)(2" - 21) possible combinations of the first two rows. j-2 Now, assume there areI (2" - 2') possible combinations of the first j-1 rows. There so are 2i1-1 rows that are linearly dependent on any one of such combinations. Hence, for the jth rows, we have (2" - 2j1) choices. There are - (2" - 2') possible combinations of the first i== n-1 j rows. Therefore, there are fI (2" - 2') nonsingular nX n binary matrices. i==o Lemma 2.5: The cardinality of A is n (2*+1 - 2). ==l Proof:

40 n There are I (2"+' 2') nonsingular (n+l)X(n+l) nonsingular binary matrix. But the ==0 first row of a matrix in A has to be [ 1, 0,...., 0 1. The cardinality of A is then II(2+' - 2i) ==0 divided by (2"+ - 2~). That is, II (2+1 - 2). ==1 For large N, the cardinality of A is only a small percentage of the possible M! permutations. Nevertheless, many of the permutations encountered in parallel algorithms are in this class. Table I lists several of the more popular permutations in the class of bit-permutecomplement (BPC). Table I Example Permutations in BPC Permutationsi y = (,,-i, Y.-2, * —, b ) = P() i Matrix Transpose Bit Reversal Vector Reversal Perfect Shuffle Unshuffle Shuffle Row Major Bit Shuffle ( n Zs1' '.,:o Z,_1., xn ) 2 2 ( Xo, 2X,, -, - ) ( X.-,1,,-2,., *o ) (2"-1) ( 0,,e_1,, T 2, * * * 1 ) ( T 2n-32 —, -O^, ^,1 ) ( Il1, nl- x, -2, 2._2,... I,, ) 2 2 2 ( l-, fT-3, -*, i_,.2 x n-2 Zxn-,..., ) Further characterization of A in later chapters leads to efficient algorithms for networks in r, a subset of the set of all networks that are functionally equivalent to the omega network. Nevertheless, the method developed in this chapter demonstrates the general concept I x = ( xZ- x2 ***, 20 )

41 of data mapping and is indicative of how complex the problem can possibly become if the multicomputer system has a network that is not in r.

CHAPTER 3 NON-STATIC DATA MAPPING We shall now develop techniques for determining non-static data mappings for the vector variables of a given parallel program such that its total communication cost can be minimized. We will assume that alignment functions of any expression tree are identity functions. Furthermore, we will assume that the logical transfers for binary operations on temporary variables are identity functions. If the logical transfer for a temporary variable is not identity, then it can be transformed to identity through simple recursive transformations. For instance, in figure 3.1 -A, logical transfers for aligning vectors a and b to w! are P. and Pb, respectively. Similarly, logical transfers for c, d, wl, w2 and w3 are Pc, Pd, P1, P2 and P3, respectively. Through transformations, the logical transfers for w1, uw and w3 become P3P1, P3P2 and I, respectively 1 ( figure 3.1 -B ). Figure 3.1 -C can be obtained by applying the transformation again. For the beginning, we will also work under the restriction that there is no precedence relationship between expression trees, i.e., leaf variables are not the results of other expression trees. Earlier assumptions plus this restriction allows us to develop an algorithm that guarantees minimum cost. The algorithm also sets the stage for the development of a heuristic algorithm for a given parallel program where precedence relationship between its expression trees exists. 1 I stands for identity function 42

p1 I I 3 we3 ~ 3 p1 P3p2 32 I I p p a a a c a b c d a b B C c A Figure 3.1 Example for recursive transformation.

44 Under the above constraints, the communication cost function of a parallel program is given by S D(PvFv h) + D(F,h+l1h) statement<. -:, statelmnentt variable v: ariable of the given Wj partial Fh: data mapping algorithm result function jy =- parent (vi where Fvh is the data mapping function for the hth logical transfer of vector variable v;. To minimize the cost function, one therefore, has to determine for each vector v an ordered sequence of data mapping functions. Since there is no precedence relationship among expression trees, determination of an optimum sequence of data mapping functions, S,, for any variable, v, does not change any logical transfer for any other variables. Hence, the minimization problem for a program with j vector variables can be decomposed into j independent subproblems, each specifying the communication requirements jf a single variable. For a vector v, all its communication requirements are specified by a sequence of logical transfers, P,,k = Pv,l,Pv,2, * ',Pv,k. The method for solving for an optimum sequence of data mapping functions, Sv = Fv,l,Fv,2,.,Fv,, for v so that the total communication cost, k k S D(PvhF-^) + S D(Fv,hFlu-l) h=l h=2 is minimized, can be applied to each of the j subproblems. The set of j sequences of data mapping functions thus obtained is a solution to the minimization problem. Since the algorithm that we will be describing deals with the variable v only, subscript v will be dropped from Fv,h and Pv,h from now on.

45 3.1. Logical Transfers and Data Mapping Functions Characterization of the logical transfers and data mapping functions are needed for describing the algorithm. The following two basic lemmas can be obtained easily from linear algebra[Hers64]. Hence, we merely state them here. Lemma 3.1 The LU decomposition of a nonsingular binary matrix is unique. ELI Lemma 3.2 The set of all lower triangular unit matrices and the set of all upper triangular unit matrices are groups under matrix multiplication ( modulo 2 ). We will designate the set of all lower triangular unit matrices and the set of all upper triangular matrices by L and U, respectively. Two matrices, PI and P2, are in the same coset of R, R = L or R = U, iff P1 and P2 are in RP1 =- {R,P I R, E R}. Theorem 3.1 Let D(PrF1) < 1. If Pi E LP,, then D(PjF'1) < 1. Proof: Let PF-1 = LU. Then LP/F71 = LU. Therefore, if Pi e LPi, then PjF71 E LU and D(PjF1) < 1.

46 Theorem 3.2 Given P1, if D(PlFil) < 1 and F2 E UF1, then D(P1Fil) < 1. Proof: If P1Fi1 = LU, then P1Fi1U = LU. Therefore, if F2 = UFI, then PiFiIU-1 E LU, and D(P1F21) < 1. LII Theorem 3.3 The set of all data mapping functions Faith, that makes logical transfer Ph in the sequence Pk realizable in one or zero pass is given by Fah = {UiLiPhILi E L & U; E U}. The cardinality of Faih is 2"'. Proof: Let Fi = ULPh. Then (PhFl1) = Ph(ULPh)-l = L-1U- is LU decomposable. Therefore, if F, = ULPh, then F, E Fall. Let F, E Fa. Then PhF 1 = LUb and F, = UglL7.lPh. Therefore, Fa = { UrLOPh I Ur E U & Lo E L }. (n2-n) (n2+n) Since there are 2 2 U's and 2 2 L's, |Fa(hj = 2"2. -II

47 The following two lemmas and theorem form the basis for theorem 3.5 that establishes the worst case communication cost for non-static data mapping. Lemma 3.3 Given two logical transfers, Ph and Ph+1, for vector variable v, there exists a data mapping function that makes both logical transfers simultaneously realizable in one pass. Proof: Now, every nonsingular binary matrix is LUL decomposable[Peas77j. Let PhP^+lLULt and let Fa = LtPh+l. Then D(PhF~l1) < 1 and D(Ph+lF~l') < 1. F Lemma 3.4 Every nonsingular binary matrix is ULU decomposable. Proof: Let p be the bit reversal permutation matrix. Then pp = I, and p = p-1. It can be easily shown that pLp = U, and pUp = L. Now, every nonsingular binary matrix is LUL decomposable [Peas77]. Consider an arbitrary matrix, P E A. Let pPp = LrUeLg. Then P = pLrppUtppLtp. But pLrp, pLtp E U, and pU8p E L. Therefore, every nonsingular binary matrix is ULU decomposable. II

48 Theorem 3.4 Let S = FI,F2,...,Fk be a sequence of data mapping functions for v. Let Fi = F+ =, Fih, 2 < i,i+h < k, F1F;, and D(PgF1) < 1, i < g< i+h. If the remapping cost D(Fi~l) -= 2, then we can always find a Fnew to replace all of the F;,F+l,...,Fi+h such that D(Pgl,,p) < 1, i < g < i+h, and the remapping cost is D(FnewF1) < 1i. Proof: Let the remapping function, FF-;l be decomposable into UrLUt. Let Few = U-lFi. Clearly, D(Fn,F1, 1) < 1. Furthermore, based on theorem 3.2, D(PgFn,) _< 1 i < g < i+h. Using the results of the above two lemmas and the theorem 3.4, the following theorem can be proved. Theorem 3.5 Given k different logical transfers for variable v, the resulting communication cost after data mapping is given by k-1 < cost <3 k1 ( k=even ), k-1 < cost < ( k= odd ) Proof: Case: the lower bound Since there are k different logical transfers, the total communication cost clearly cannot be less than k-1. Case: the upper bound

49 Based on lemma 3.3, for every two permutations, we can always find a data mapping function that makes both permutations simultaneously realizable in one pass. If k is even, then we will have at most - different data mapping functions or - -1 remappings. Based on 2 2 theorem 3.4, the remapping cost between any two different consecutive data mapping functions can be less than or equal to one. Therefore, if k is even, the resulting communication k 3 cost is less than k + - -1 = - k-1. 2 2 If k is odd, then we will have at most ( ) +1 different data mapping functions or (k- ) remappings. Therefore, if k is odd, the resulting communication cost is less than or 2 equal to k + (k-2) 2 3.2. Characterization of Optimum Sequences The algorithm that we will be describing shortly, involves )artial enumeration of possible partial solutions. Rather than enumerating all sequences to find an optimum sequence, the algorithm generates possible subsequences of optimum sequences. Conceptually, during its hth step, subsequences of length h are formed by appending each subsequence from the h-lth step with data mapping functions that belong to F^, a set of data Ir ipping functions that will be defined later. The subsequences thus formed, are called intermediate subsequences. Intermediate subsequences that can lead to an optimum sequence are saved for the next step, while the rest are eliminated. The subsequences that are saved f or the h+lth step are called candidate subsequences. A set of candidate subsequences of leng h h is denoted by SEQ^. In 'vhat follows, we will describe rules for eliminating some of the intermediate subsequences from consideration. We will also describe Fh, the set of data mapping functions for step h of the algorithm.

50 Some basic notations and definitions for the algorithm are needed. Notation 3.1: A sequence of data mapping functions for variable v is designated by a hat over a capital letter S. For example, Si = F1,F2,,..,F*k, is a sequence of data mapping functions for variable v. A subsequence, Si, b, of Si is given by FiFaiF i,...,F,i. If a=l, then Sia b is the prefix of Si and is designated by Sib. Note: Si = Sik. I-I Notation 3.2: The cost of a subsequence, Si,b of Si, is denoted by either Da,b(Si) or Da b(Sia, ), and is given by b Da,b(Sia,b) = Da,b(Si) = D(PhFl ) + S D(F il,,). h=a h=a+l Notation 3.3: Let Sh be a set of subsequences of data mapping functions of length h. Then Min(Sh) is the cost of the minimum cost subsequence in Sh. Definitioi 3.1: A sequence Si is optimum if there does not exist another sequence Sj, such that D,k(Si) > D1,k(Sj). The following theorems determine the conditions under which an intermediate subsequence of length h is to be eliminated. We will consider two intermediate subsequences S1 and S2h, where SlhS2h. Given a sequence SI having S1l as its subsequence, we will

51 examine conditions under which we can always find S2 that contains S2h such that D1) (S1) > Dl (S2). Since we are seeking an optimum solution only, these are conditions under which Slh can be eliminated. Theorem 3.6 Let Slh be the prefix of S1. Given a subsequence S2h such that Dlh(Slh)D1 h(S2h) > 2, we can always find SS containing 52h such that Dk(S1) - D1lk(2) > 0. Proof: The communication cost associated with S1 is given by D1,k(S1) = Dl,h(Slh) + Dhk(S1) + D(Fh+l,lFh) Let S2h be a subsequence of 52 and 52h+,k = Slh+l,k. The communication cost associated with 52 is then given by Dlk(S2) = Dlh(S2h) + Dh,k(S2) + D(Fh+1,2 ) = D1,h(S2h) + Dh,k(S1) + D(Fh+llF,-). Therefore, Dl,k(S) - DO,k(S2) = Dlh(Slh) - D1,h(S2i + D(Fh+l,lFh) - D(Fh+l,lFI2) > 2 + D(F^+l,l hl) - D(Fh+l,lF2). But 0 < D(Fh+l,llFhl),D(Fh+llF^2) 2. Therefore, D(Fh+l,lF^'h,) - D(Fh+l,lF^) > -2, and D1,(S1) - D1,(S2) > 0. Corollary 3.6.1 The candidate set, SEQh, can therefore, be partitioned into two disjoint sets, SEQJ and SEQR, where Sih E SEQ iff D,h(Sih) = Min(SEQh) and Sib E SEQJ iff Dl,h(Sia) = Min(SEQh) + 1. Li

52 Theorem 3.7 Let Slh be the prefix of S1. Given a subsequence S2h such that Fh,I = Fh,2 and Dlh(Slh) - D,h(S2h) >0, we can always find S2 containing S2h such that D k(S1) - Dl,k(S) > 0. Proof: The communication cost associated with S1 is given by D,k(S1) = D,h(S1) + D^+l,k(S1) + D(Fh+1,,Fhll) Let S2h be a subsequence of S2 and 52-h+l,k = Slh+l,k Then D1,k(S2) = D,^(S2) + Dh+lk(S2) + D(Fh+1,2Fhl2) = D,,h(2) + Dh+,k(S1) + D(Fh+l,^rh11) Therefore, Dl,k(S1)- D1,k(S2) > 0. IITheorem 3.8 Let Slh be the prefix of S1. Given a subsequence 52^ such that F^hF^h2- LrU, Fh,l7Fh,2 and Dl,h(Slh) - Dl,h(2h) > 1, we can always find S2 containing S2h such that D,k(S1) - D1,k(2)> 0. Proof: The communication cost associated with S1 is given by Dlk(S1) = Dl,h(Sb) + Dh+l,(S1) + D(Fh+l,lFh). Let S'2h be a subsequence of S2 and S2h+l.k = Slh+l,k, Then 8D(2() = D,^(2) + D^,k(S2) + D(Fh+1,2^,2) = D1h(S2) + Dh,FFk(Sl) + D(Fhl,, lh,2).

53 Therefore, Dl,(S1) - D,k(S2) > 1 + D(Fh+l,lFhl) - D(Fh+il,F). If D(Fh+l,lFl )=O, then since D(Fh+ l,lFh2) = 1, Dl,(S1) - Dl,t(S2) > 0. If D(Fh+l,lFhll) > 0, then since D(Fh+l,lFhl) < 2, Di,(S1) - D1,k2)> 0 Therefore, if Dh(S1) - Dh(S2) > 0 and FhlF'^2 is LU decomposable, then Di,(SI) - D,k(S2) > 0. It will be inefficient to generate intermediate subsequences of length h from candidate subsequences of length h-l by appending every one of the candidate subsequences with every permutation in A. Based on the following two theorems, we determine the set of data mapping functions, Fh, that have to be appended to candidate subsequences in SEQh1 to guarantee an optimum solutions. Let lastl < h < k-l, be a set of data mapping functions such that Fa is in lasth, iff Fe is the last mapping function of a candidate subsequence in SEQ1,. Note: lastl = 0. Let Maph = lastln({FaUlUFa[,ah-2UFai 3), where Faith = if b is less than or equal to zero. The first of the following two theorems shows that it is sufficient for Fh to be equal to lasthlUFa,AUh. The second theorem shows that it is sufficient for Fh to be equal to Maphl Fal h C lasthlUFall,h. Theorem 3.9 Let Si_1 be a candidate subsequence in SEQh_ and F11,i be tCe last data mapping function in Sih-. Sih,Fa f SEQh if FyFhl,i and Fa _ Faih. Proof:

54 Let Sjh be formed by appending Fa to Sih-1. Then Dl,h(Sjh) == Dl,-l(Sih-l) + D(PhFal) + D(FaF( l,i). Let Sih be formed by appending F-l,; to Sihl. Then DI h(Sih) = Dl,hI(Sih-l) + D(Phr^h-,i) Therefore, Dl,h(Sh) - Dl,h(Si^) = D(PhF) + D(FaF-1,-) D(PhhF-,l). Since Fa 7_F F;l, D(PhF') = 2, and,D1,(Sj^) - Dlh(Sih) = 2 + D(Farli,,) - D(Phf-i,.). If D(FaF,-l ) = 2, then Dl,h(Sjh) - D1,h(Sih) > 2. By theorem 3.6, Sjh cannot be a candidate subsequence. If D(FaF~,,) = 1, then Dl,(Sjh)- D^h(Sih) 1. By theorem 3.8, Sjh cannot be a candidate subsequence. Therefore, SjIh SEQh. Corollary 3.9.1 If Fa i lasth-lUFalh, then for all Sth- E SEQhl, Sth-lFa. SEQh. Therefore, it is sufficient for Fh = lasth-1FaIUh. Corollary 3.9.2 If SihESEQh, Fh,i Fallh, then F^, = Fh1li. Fh according to theorem 3.9 may still grow larger and larger, with increasing h, until it equals A. The following theorem shows that that is not the case.

55 Theorem 3.10 Let Sisl = Sih-4,F-3, i,Fh-2i,Fh-li be a candidate subsequence in SEQi1. If F-,i FalhUFallA^-IUFal(h-2UFalPs3, then Sih-l, Fhli SEQh. Proof: If Sih-1 is a candidate subsequence, then Sih-2, which is a subsequence of Si-_l, is a candidate subsequence. Now, Fh-l, # Fat-i. By corollary 3.9.2, Fh2,i = F1,i. Similarly, Sih3, Sia,4 are candidate subsequences and their last mapping functions, Fh3,i and F^_4,1 are equal to Fh1,i. Now, since Fh-li f FauhUFatt,lhU Fal,h-2U Fath3, D(Pr;,i) = 2, h-3 < r < h. Let Sih = SilI,Fh-l,i. Then, h Dl,h(Si^) = DOl,-4(Si.( 4) + E D(PrFr,i) r=/h-3 = DlA(Si_4) + 8. Based on lemma 3.3 and theorem 3.4, we can always find another intermediate subsequence Sjh = Sih-,Fh-3,Fh2j,F-l,Fhj such that F3Sj = Fh2j, F-ilj = Fhj, ZD(PrF) < 1, A-3 < r <h and the remapping cost D(Fh.3jF^,) and D(Fp1J^F2,) are less than or equal to one. Therefore, D1,h(Sjh) < D1,h-(Sih_) + 6. Since Dl,h(Sih) - D1,,(Sjh) > 2, by theorem 3.6, Sih = Sil,_,Fli ~ SEQ. Corollary 3.10.1 For an optimal solution it is sufficient that, Fh = MaphUFau,Ah. Proof:

56 Since, if Sih-_, Fh,_, E SEQh, then Fhl,i E Fa(lhUFau, h-1UFalA2UFall^1-3 and Fh-l,i Fall,hUlasth-l ( corollary 3.9.1 ), it is sufficient for FA = (FauhU hlatSla)n(Fal( hU Fa, h-lU Fal i-2'JFall,-3) = -FaluhU(last1fn(Fah-lUFall,2UFah- _)) = FallhU MPh h Since, Fh C _ FaIt the set of data mapping functions that the algorithm will be exat- 3 mining will not grow larger and larger, with increasing h, until it equals A. In fact, it is bound by 4 X FaJ, where |Fa14 = 2"2 3.3. Non-Static Data Mapping Algorithms From theorem 3.9, we cali also conclude that if Fa is not in Fallh, but in Maph, then F, has to be appended only to the candidate subsequence whose last mapping function is equal to F,. Therefore, instead of adding each data mapping functions in Fall, UMMaph to every subsequence in SEQ\1, we only have to add each data mapping functions in F.Uh to every subsequence in SEQc1 and add Fh-,i in Sih1 to each subsequence Sil1 in SEQhD1. Hence, instead of producing at most SEQ1 I X Fall, U a.hl intermediate subsequences at step h of the algorithm, we have to produce only at most ISEQ \XllX\Fl, + \Maphl intermediate subsequences. Actually not all the ISEQ^h-_lXlFAhl + IMaphi intermediate subsequences have to be generated and evaluated. We will first define the following notation and then describe the observations behind such claim.

57 Notation 3.4: The upper triangular unit matrices are Ui, 1 < i<2 2. U = I. Similarly, the (nc+.) lower triangular unit matrices are L;, 1 < i < 2 2. L1 =I. The following observations can be used in reducing the actual number of intermediate subsequences that have to be formed. 1. Let C be the cost of the lowest cost intermediate subsequence of length h formed thus far in the partial enumeration algorithm. Let F;, Fh. If Min(SEQ1) + D(PhF'i) > C + 1, then SEQL1,Fi f SEQh (Theorem 3.6). 2. The lower bounds of the costs of intermediate subsequences formed by appending a mapping function, Fa Fh, to subsequences in SEQ1-, i = 0,1, are lD(Ph^F1) + Min(SEQ7_i) and Il = D(PhF') + Min(SEQl), respectively. Once an intermediate subsequence with cost, li is obtained by appending Fa to a subsequence in SEQ11, we do not have to append F4 to any subsequences in SEQ)1I (Theorem 3.7). Furthermore, if such intermediate subsequence has a cost of /, we do not have to append Fa to any subsequences in SEQI[USEQLi (Theorem 3.7). 3. Let J,_-1 be the last data mapping function of Sj^l. If F Fj,,l = U,, where F, E Fal,h, but F; i UPh, then we have to append UlFi to Sjhl only. Appending any other Fr E UF, to Sjl1 is not necessary, for they will be eliminated (Theorem 3.8). 4. Let Fi, h- be the last data mapping function of Sj,_1. Let SuLU be the set of all ULU decompositions of FiFjl, where F;, Fh, but Fi UPA. We have to form intermediate subsequence Sjhl, UrFi iff the ULU decomposition UrLUt E SCU.U Appending any other CLF;, to Sjil1 is not necessary, for they will be eliminated (Theorem 3.8).

58 The algorithm described below incorporates the observations made and will produce aa optimum sequence of data mapping functions. Algorithm I: MAIN PROGRAM BEGIN Let SEQ~ = P1. Let S1 = Fall, - LP1. Reduced S1 according to theorem 3.8. FOR h = 2 TO kDO BEGIN (n2+n) Form Fauh. Partition it into 2 2 equivalence classes. They are the cosets ULPh, 1 < i < 2 2 C = large_number; /* cost of minimum subsequence of length h */ SEQ% = SEQ| = Temp = 0; (2-n) /* Form intermediate subsequences, keep them in CandAl:2 2 ] and their (2-n) corresponding cost in Lmin[l:2 2 1 */ Form_Subsequence (SEQ1i, ULlPh, Lmin[], Candll,, h, h); Include Cand[l into Temp; Form_Subsequence (SEQL1, ULlPh, Lmin[l, CandA, C, h, 1); Include Canal] into Temp; (n2+.) FOR j=2' TO 2 2 DO BEGIN Form_Subsequence (SEQR1, ULPh, Lmin[J, Candl, C, h, j); Include Canal] into Temp' END; j=2; DONE = FALSE; WHILE ( NOT ((min(SEQ1) + 1) > C+1) AND NOT DONE DO ( observation 1 ) BEGIN

59 Form_Subsequence (SEQL1, ULiPh, Lmin[l, CandlJ, C, h, 1); Include Candll into Temp; j=j+ 1; IF (j > 2 )THEN DONE =TRUE; END; /* Remove some intermediate subsequences and partition the set into two sets */ REDUCE (Temp, SEQ%, SEQJ, h); /* Append subsequence with its last data mapping function if needed */ IF { (Fh - Fath)^)7 } AND NOT(fin(SEQI) + 2 >C+1) THEN ( observation 1 ) BEGIN FOR EVERY Sihl E SEQ1i and Fh-li e Fh - Faoh APPEND Fh-. to Sih-l to form Sih. IF D,h(Sih) = C, THEN Include it in SEQ% ELSE IF (Dl,h(Sih) = C+1) AND ( it cannot be eliminated by theorem 3.8 )) TIEN Include it in SEQ,; END; /* Append subsequence with its last data mapping function if needed */ IF { (Fh - Fallh)0O } AND NOT(Min(SEQ.1)+2 > C+1) THEN ( observation 1 ) BEGIN FOR EVERY Sih1 E SEQL1 and Fh-l, E Fh - Fat, APPEND Fh-l; to Siha to form Sih IF D1, h(Si) = C, THEN Include it in SEQ% ELSE IF (D,h(Sih) = C+1) AND ( it cannot be 'liminat d by theorem 3.8 )) THEN Include it in SEQ,; END; ELIMINATE subsequences in SEQ\ according to theorem 8; END; END. /* This procedure forms intermediate subsequence if necessary */ PROCEDURE Form_Subsequence (SET, MPS, Lmin[J, CanAJ, C, h, t)

60 BEGIN Lmin[l:2 2 j=large_number; WHILE { (NOT every Sh- E SET has been used ) AND (n2-n) (Lminl[j >,Min(SET) + D(PhPiL' 1), for some 1 < j < 2 2 )} DO (observation 2 ) BEGIN Let S-, be a sequence in SET that has not been used. Let F be the last data mapping of S-1. Let Fo = LtPh. IF t - 0 THEN BEGIN IF (Lmin[O]>Min(SET) + D(FoF1) THEN BEGIN Lmin10-=Min(SET) + D(FoF1) CandjO] = Sh-1 appended with Fo; C= minimum of C and Lmin[O]; END; END; IF (FoF-') = U, E U THEN ( observation 3 ) BEGIN IF (Lmin[jl > Min(SET) + D(PhFo61 U1)) THEN ( observation 2) BEGIN Lmin[jL-=Min(SET) + D(PhFol' 1'); Candjl = Sh-, appended with UjFo; C= minimum of C and Lmin[ll; END; END ELSE BEGIN IF NOT(Min(SET) + D(PhFo')+l > C+1) THEN ( cbservation 1 ) BEGIN (n2- n) FOR j=1 TO 2 2 DO (observation 3) BEGIN FORM SuLU, the set of all ULU decomposition of FoF'; ( observation 4 ) FOR every UjLrUt E SULU DO BEGIN IF (Lmin[jl>Min(SET) + D(Ph^lF1'U') + 1 THEN

61 BEGIN Lmin[jl = Min(SET) + D(PhFZoi il) + 1 Candlaj=S appended with UiFo; C= minimum of C and Lmin[j]; END; END; END; END; Let S be another sequence in SET, that has not been worked on yet. END; END. /* This procedure include intermediate subsequences into the set of Candidate subsequences */ PROCEDURE REDUCE (T, SET0, SET1, h) BEGIN Let C = Afin(T); FOR every S E T DO BEGIN IF Dlh(S) = C THEN put S into SETo ELSE IF Dl,h(S) = C+1 THEN put S into SET1 if they cannot be eliminated by theorem 8. END; END. I-I Based on previous theorems, it is clear that the algorithm produces an optimum sequence of data mapping functions. Clearly, the complexity of the algorithm is dependent on the number of intermediate subsequence formed. For each Fa E Fa11,h, at most ISEQhil intermediate subsequences with Fa as its last mapping function will be formed. But, based on theorem 3.7, there could only be one candidate subsequence with Fa as its last data mapping function. Therefore, the maximum number of candidate subsequences formed at step h is I\aPh + Falh\. But

62 IMaphi = I asthln(Fatl.1 U Fa,4h-2 U Fa, h-3)1 < 3C, where C = 22 Therefore, IMaph + Falthl < 4C. At step h+l, there are at m.st Maph + Fa, h X IFall,h+ll + IMaphl < 4C& + 3C intermediate subsequences. Given k transfers for vector v, we will have kC(4C + 3) intermediate subsequences. Since n remains constant for a fixed network, the complexity of the algorithm is O(k). A Heuristic Algorithm The problem of determining optimum data mapping functions for a parallel program where precedence relationship between its vector variables exists is complex. For such parallel programs, a heuristic algorithm that is based on the optimum algorithm developed in this chapter has been developed and is described. Consider a parallel program consisting of two expression trees only (figure 3.2). Clearly, expression tree #2 in the figure is dependent on results from expression tree #1. To determine optimum sequences of data mapping functions for C or D before determining data mapping functions for A, initial data mapping function for A has to be assumed. However, the data mapping function assumed for A may not be the initial data mapping function of an optimum sequence for A. Hence, an optimum sequence of data mapping functions for A shall be determined before optimum sequences for C and D are determined. More often than not, a set of optimum sequences for A will be produced by algorithm I. Thus, for the parallel program the problem of determining which one will lead to an optimal solution, if it ever will, is complex. Hence, a heuristic algorithm for finding a near-optimal solution has been developed.

63 A Cr r nT, u' U J L. U D D C C C A D A A #1 #2 # 2 Figure 3.2 Expression trees of a parallel program. Algorithm II 1. If the result vector, v, of an expression tree is dependent on its previous value, assign a new vector variable to replace the result vector. Note: for each iteration of a program loop where such dependency exists, we have to assign a new vector variable to replace each such result vector. 2. Determine the precedence relationships between the vector variables and pu them in a sequence, v1,Vt,...,v, such that logical transfers for Vj is not dependent on any variables v;, i > j. 3. Forj=1 to z DO BEGIN Determine optimum sequences for vj. A

64 Pick limit number of optimum sequences for vi and for each of them determine optimum sequences for other variables as follows: For i=j+l TO z DO BEGIN Determine logical transfers for v,. Determine optimum sequences for v,. Arbitrarily pick an optimum sequence for v,. END; Keep the optimum sequence for vi that leads to the total lowest cost for all variables. If j4z, based on data mapping functions for vi determine logical transfers for vi+. END; E3 The above heuristic has been tested on a number of example problems, its performance has been compared with that of a total enumeration algorithm and it is found that in most cases, the heuristic generates optimal or near-optimal solutions. 3.4. Examples In this section, we will consider two parallel algorithms, the inversion of a lower triangular matrix and the radix-2 fast Fourier transform ( FFT ). These two examples are representative of two classes of diverse, important computational requirements, namely, matrix operations, discrete spectrum analysis and related processes.

65 In section 3.4.1, we will present a parallel algorithm for lower triangular matrix invcrsion and a corresponding optimum solution. In section 3.4.2, we will present the parallel FFT algorithm and a corresponding optimum solution. 3.4.1. Lower Triangular Matrix Inversion In this section, we will consider the inversion of an 8X8 lower triangular matrix on an SINMID multicomputer system with 16 processing elements. We will assume that in any operation, one or more processing elements can be disabled or set to produce zeros. We will designate a vector variable of 16 elements by a capital letter. We will also designate the ith element of a vector by its corresponding small letter and a subscript i. The inversion algorithm uses partitioning[Boro75,Hell78] to invert a rX r lower triangular matrix, V. It is based on the following relationship between V and its inverse V-l. If fv, o0 - V3 V2a, where Vl, V2 are - X lower triangular matrices and Va is a X matrix, then V 222 X-2 - V-V V'-1 V-1 The inversion algorithm hence proceeds in stages. It will first determine V-1 and V,21 and then V21V3 and V-1V3aVl. In our example, the elements of V are stored in three vectors, A, D, and L, and they are to be operated on in that order. Their relationship with V is as depicted in figure 3.3.

66 ao 0 0 a2 a3 0 ds d9 a12 dlo adlI al4 dto d11 a14 12 3 1 80 1 ll 12 '10 ll1 1 4 0 O O O 0 0 O O 0 0 0 0 00 a15 0 0 0 0 /s fO O O O 15 a8s 0 0 17 a1o all 0 0 13 do dl a4 0,5 d2 d3 6 a7 Vector elements that are not shown are zeroes, e.g. dl2 = 0. Figure 3.3 Relationship between V, A, D and L. For convenience, we name the submatrices of the matrix, V, as in figure 3.4. Ao 0 0 0 rvI O [ D2 A1 0 0 v V3 VO = Lo L1 A2 0 L2 L3 Do A3 A/, D/ s& L, sare 2X2 matrices. Figure 3.4 Relationship between V and its submatrices. Some more definitions are needed. Definition 3.2: Let # denote an exchange operator. Let Yand Z be two vector variables. Then Y # Z means that if the ith processing element is enabled, then the ith element of vectors Y and Z are swapped. II

67 Definition 3.3: Pa is a permutation defined by Pa: (Z3,Z2, 1,0Z)-'(3, Z2,Z0,zl). -II Definition 3.4: Pb is a permutation defined by Pb:(Z3, 2,z1i, )- (Z2, 3, XO, 1) Eli We shall first describe how the inverses of Ao, Al, A2 and A3 are formed and how they are stored in a new vector C. The relationship between the elements of C, c,,O < i < 15, and the submatrices is as shown below A-I1 CO 01 c2 C3 C12 O C14 C15 C8 0 CI0 C1 C4 0 C6 C7 Figure 3.5 Relationship between A/ s and C. The following is a description of the procedure for forming the inverses of the 2X2 matrices.

68 (1) The reciprocals of the diagonal terms of the matrices are formed and stored in their original location in A. ( They are equal to c0, c3, c12, c15, c8, cl, c4, c7 ). (2) The off-diagonal terms, a2, a14, a10 and a6, of the matrices are multiplied with the diagonal terris of their inverses ( figure 3.6 ), for example, 2 - - c3a2co. The products are then combined with the diagonal terms of their inverses, Aj',AAI,A21,AAl to give a new vector C. C I I A A A Only processor elements with off-diagonal terms multiply. % - diagonal and off-diagonal terms of the inverses combined to form C. Figure 3.6 Forming the off-diagonal terms of A/ a. Having the inverses of A0, Al, A2 and A3 A-lD5 and A-1Do to give a new vector E, where formed, we can now proceed to compute feo el1 3AD-L e2 e3 and e8 el A1D2 = 1 =elo ellJ as in figure 3.7.

69! t W2,W3 0 7 / a/ C D C 04Pa - forms all product terms of the diagonal elements. s-Pa - forms all the product terms of the off-diagonal elements. #- interchange of the vector elements in processors 1, 2, 9 and 10. Figure 3.7 Forming all the product terms of Al'D5 and A31Do. The elements of AI'Do and AI'DS are then formed by adding u2 md w3 together as the expression tree in figure 3.8. E w2 2 T~~~~~~~~~~~~~~~~~~~T~~~s w3 3 Figure 3.8 Adding product terms to form elements of A31D0 and AlD5.

70 AllDs5Al and A~1DoA21 can similarly be formed with the expression trees in figure 3.9 and figure 3.10. 4 I w4,w5 w4 w I..5 8a / C I E C E 8-'P a- forms all product terms of the diagonal elements. 0'Pa - forms all product terms of the off-diagonal elements. # - interchange of vector elements in processors 1,2,9 and 10. Figure 3.9 Forming all the product terms of AlD5Aol and A31DoA2A. The elements of A13DoA-1 and AlIDsAB are formed by adding the product terms w4 and ws together as in figure 3.10 to give F. F I 02 t wI I w W 4 Figure 3.10 Add ng prodt ct terms to form elements of A31DoA'2 and A1lD5AI1.

71 Having V-1 and V-1 computed, we can now form V21lV3Vi. The product terms for all the elements in U V 1V3 are computed according to the two expression trees in figure 3.11. 6 1 t 9 w6,w77 w8,w 9 4b A dP I 5Pb X Pb F L F L F L F L 04'Pb - forms all product terms for the diagonal elements of U. 0-Pb - forms all product terms for ul,Ou01,u2,3 and u3,2 of U. Pb - forms all product terms for u2,0,uO,2,ul,3 and u,3 of U. 8'Pb - forms all product terms for u3,0,0,3,1u,1 and U1,2 of U. #1 & #2 - interchange of vector elements'in processors 1, 2, 5, 6, 9, 10, 13 and 14. Figure 3.11 Forming the product terms of V-1V3. We now have at most four product terms for each element in U. The expression tree in figure 3.12 adds wu to uw and wt to wui to form two sums of product terms for each element in U.

72 I I w11 I W I. 76 W7 w8 W9 #1 - interchange of vector elements in processors 0, 1, 2, 3, 12, 13, 14 and 15. Figure 3.12 Forming intermediate sums of product terms. Each element in U is finally formed by adding its own corresponding two intermediate sums of product terms according to the expression tree in figure 3.13. T I 8 w / 10 11 Figure 3.13 Forming the elements of V21V3. Having U = V21V3 formed, we can now proceed to compute all the product terms of the elements in H = V 1V3Vj1 according to the two expression trees in figure 3.14.

73 I I I t w12, 13 W14, 15 W12 X 13\\ X*\X \1 4 15 2 T 8b T C T C T C T Pb- forms all product terms for the diagonal elements of H. 2'Pb - forms all product terms for ho,I, hA1,, h2,3 and h2,3 -88.Pb - forms all product terms for h0,2, h2,0, h1,3 and h3,. 81o.Pb - forms all product terms for h,3, h3,0, h1,2 and h2,1. #1 & #2 - interchange of the vector elements in processors 1, 2, 5, 6, 9, 10, 13 and 14. Figure 3.14 Forming all the product terms of V1V3VI1. We have now at most four product terms for each element in H. The expression tree in figure 3.15 adds wl2 to W13 and wl1 to wto to form two sums of product terms for each ele ( ment in H.

74 t I W16,Wi 7 w16'w 17 vv/ \! W12 W13 W14 W15 #1 - interchange of vector elements in processors 4, 5, 6, 7, 8, 9, 10 and 1. Figure 3.15 Forming intermediate sum of product terms. Each element in H is finally formed by adding its own corresponding two intermediate sums of product terms according to the expression tree in figure 3.16.

75 S!! w16 7 80 81 84 85 82 83 86 87 V21V3V2 l 88 89 812 813 810 811 814 815 Figure 3.16 Forming the elements of H =- V'V3Vi1. A total enumeration algorithm and the heuristic algorithm have been used to determine data mapping f inctions for the above matrix inversion algorithm. It is found that the heuristic algorithm produced an optimum solution in this case. In the heuristic algorithm, the ordering of the vector variables into a sequence is not unique. The following is one such sequence, S,w17,W16, W'15,W14,W13, w2, wT,W,11, W u, 8,F,L,w4,ws,E, w,, C,D,A. The total communication cost before applying the heuristic algorithm is 34 routing steps. Applying the heuristic algorithm, the mapping functions for all the variarbles, except C and F, are determined to be identity. The sequence of communication requirements f( r F is determined to be 04Pb, 05'Pb, 0o0Pb, P6. The corresponding sequence of optimum data mapping functions for F is found to be P6, P6, Pb, Pb. Due to the initial data mapping functions, Pb, for F, the seqi ence of communication patterns for C is P04' Pa, PaB5'Pa., Pb'.88Pa, Pb6Og'Pa, Pb, 02'Pb 88'Pb, 01ioPb. The corresponding sequence of optimum data mapping

76 functions for C is determined to be Pb056 Pa, Pb085 Pa, Pb 85 P. Pb 05 Pa, Pb Pb, Pa, Pb. The resulting total communication cost of the inversion algorithm is 21 routing steps. For this particular case, it is also the minimum communication cost possible. 3.4.2. Fast Fourier Transform The radix-2 version of the Fast Fourier Transform ( FFT ) that we are going to des ribe is based on Pease's work[Peas68]. It has a signal flow diagram as shown in figure 3.17 and is dependent on perfect shuffle and bit reversal permutation. We will first describe the discrete Fourier transform. Let A0 k, k = 0, 1,..., R-, be R = 2r samples of a time function sampled at instant; that are spaced equally apart. The discrete Fourier transform of Ao 0 k I is then a disc ete function Xl j, j = 0, 1,..., I -1, where R-1.il = EAoI k Wt j =0, 1,..., R-1 k=0 2ni and W=e R Let (kr-i, kr-2,..., ko) and (jr-i, jr-,,..., o) be the binary representations of k, jE {, 1, 2,...,R-1 }. Then X[i r-l, jr-2,... o =.E E. A [ kr-l, kr-2,..., ko 1 i), kO k1 krtwhere each of the indices k, are summed over the binary values 0 and 1. Based on the above relationship, the FFT algorithm can be computed in r stages. At stage i, 1 < i < n, a vector A, will be computed from Ai-,.'or i > 1. The vectors are defined as follows. Al I kr-2,., ko, o = E A [ kr-,, kr-2, o... W 1 W kr-l

A0(0: 15) 0 --- X(O: 15) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1'PS1' IS' PS1' DL)ata arc 1) iii. al l -Llll ur at eacli tsLi). Figure 3.17 Signal flow graph of a 16 points FFT.

78 At I kr-:-lA., k, o, o,, j " t-\l ] - E At- I rt, ~...o, Jo, *~, J2 W ' rkr-t where =2, 3,..., r. The roles of perfect shuffle and bit reversal permutation in the signal flow graph ( figure 3.17 ) become evident when we consider an element of Ai. Here, Ai [ jr-, Jr-2, ~ ~ ~, 0 ] and Ai [jr-1, jr-2, 1 1 of A, is the weighted sum of two elements, Al1 [0, jr-2',. jl and A-1 [ 1, jr-2, 1], of A-1. A perfect shuffle will pair the two elements together for the computation of A [ jr-l, Jr-2, 0 ] and Ai [ jr-l, jr-2..., 0 ]. The last vector, Ar, has elements that are the values of the X[ j, but the elements are scrambled in bit reversal order. That is, Ar[ J0, jl,.- jr-2, Jr- I ]a= " Jr-1, Jr-2, *, JO. A bit reversal permutation will bring the elements of X back into normal order. Since the computations required are executed on pairs of data, the elements of the vectors can be stored in pairs and a 16 points FFT can be executed on an eight processing elements system. Each vector, A,,0 < i < 4, can be divided into two vectors, such that A [0:71 = A [ 0,2,4,6,9,11,13,15 and Ai, 0:7 = -A [1,3,5,7,8,10,12,14. For the ith computation stage, each element in Ai has to be paired with an element in A;. The logical transfers for Ai and A;, 0 < i < 3, are determined to be P: (,z1,z0 ) - ( r1,Z0,2 ) and P: ( z2,z,^o ) -* ( zlzo,,2, respectively.

79 Finally, since A4 [0:7 = X[ 0,4,2,6,9,13,11,151 and A4 [ 0:7 = X 8,12,10,14,1,5,3,71, to bring X into normal order such that X[0:7 =X[ 0,2,4,6,9,11,13,15 and X 0:71 =X[ 1,3,5,7,8,10,12,14, the logical transfers for X and X are determined to be P,: (,2^1, O ) - ( 2,o0^1 ) and P: ( 2,21,2 0 ) 7 ( i2,'X ), respectiv -ly. Let Ri, S;, T & U,, 0 <i <3, be vectors containing powers of WI. For exanple, So- = [1,,, 1 l, 1, 1,,, V,. Then the expression trees for stage 1,2.3 and It are shown in the figures 3.18, 3.19 and 3.20.

80 I 'I B1 11 I1 I B0 A0! Ao I B A A0 I 0 I'0 I; ' IT 0 B0 R0 B SO B TO B0 UO Figure 3.18 Expression trees for computing Al ( stage 1 ).

81 Bi+l It 'i+l P II p I I wi+l, I I B B. I II I II R.i B TS B i = 1,2 Ui 2 Figure 3.19 Expression trees for computing A2 and A3 ( stage 2 & 3 ).

82 X I, Pr Pr X r7 r7 Sic t /cT o A4 A A4 + 4 + w41 w42 w4,3 w4,4 \~ 4 *\ * * B3 R3 B3 S3 B3 T3 B3 U3 Figure 3.20 Expression trees for computing A4 ( stage 4 ). Since the constant vectors, Ri, Si, Ti & Ui, 0 < i <3, are used only once, if P is a logical transfer for one of the above constant vectors, then the optimum data mapping function for the vector is P. We do not have to consider data mapping functions for these constant vectors in the heuristic algorithm. The ordering of the vector variables into a sequence for the heuristic algorithm is clearly not unique. The following is one such sequence, B3, B3,, B2,,B1, B1, Bo, Bo, A4, A. The total communication cost before applying the heuristic algorithm is 20 routing steps. For one of the solution obtained, the sequences of logical transfers specifying the communication requirements of B, and B,, 0 < i < 3, are the same. Furthermore, B; and B. have the same sequence of data mapping functions. Table II shows the sequences of logical transfers and data mapping functions determined for the vector variables. The resulting total

83 communication cost of the FFT algorithm is 8 routing steps. For this particular case, it is also the minimum communication cost possible. TABLE II Logical Transfers and Date Mapping Functions for 16 Points FFT Vector Variables Sequences of Sequences of Logical Transfers Data Mapping Functions B3, B3 Pr, P P,, Pr Bz, BzPrP, PrP PrPo PP BB, BiPr P 2, PrPP, Bo, Bo Pr, PrP2P Pr, Pr Ao PrP PrP AoPrP PrP

CHAPTER 4 DATA ALIGNMENT We shall now develop techniques for determining alignment function for each internal node of t e expression trees of a given parallel program. In this section, we will assume that logical transfers for a binary operation on any two vector variables ( internal or external node variables ), u and v, of a given expression tree in a parallel program are specified as v (P1l(i)) *u(Ul(i)), where * stands for any binary operator. If the alignment function for w, the temporary result vector of the binary operation, is an identity function, then logical transfers for aligning u and v to w are Pv, = P, and PuW = P,. We will assume that data mapping functions for the leaf variables of an expression tree are identity functions. If the data mapping functions are not identity functions, then we will replace each logical transfer and each data mapping function with a new logical transfer and an identity function, respectively. For instance, if the logical transfer and data mapping function for a vector variable, v,, are Pi and Fv, respectively, then the new logical transfer will be P, = Pvi Fit and the new data mapping function will be an identity function. Though logical transfers for a binary operation on any two temporary variables may be changed during the process of determining where the two variables should be aligned, we will assume that logical transfers for all temporary variables are initially identity functions, or else 84

85 they will be transformed recursively to identity functions as in chapter 3. We will also assume that there is no precedence relationship between expression trees of a given parallel program. This assumption enables us to partition a parallel program with i independent expression trees into i independent subproblems. Applying the optimum data alignment algorithm developed for a single expression tree to each of the i independent subproblems wi'l produce an optimum solution for the parallel program. Based on the optimum algorithm, a heuristic algorithm also will be developed for parallel programs where precedence relationship between expression trees exists. The data alignment algorithm is based closely on the relationship between logical transfers and alignment functions. Hence, we will be studying this relationship in the next tw( sections. 4.1. Logical Transfers and Alignment Functions We now characterize the relationship between logical transfers and alignment functions. Consider the following binary operation on two leaf variables, u and v, v (P1(i)) * u (P'(i)). Let w be the temporary result vector of the above expression. If the alignment function for w is I, then u (P,1(i)) is to be multiplied by v (P1l(i)) and stored in w(i). The corresponding logical transfers, Pv,, and Pu,, for the binary operation are P, and P,, respectively ( figure 4.1 ), and the total communication cost is given by D (Pu) + 0 (Pv). However, more often than not, identity alignment function will not give minimum communication cost. We may want to have the variables aligned according to another alignment function, A., so that u (Pu1(i)) will be multiplied by v (Pi1(i)) and stored in w (A(i)), and the corresponding total communication cost

86 w A = I w u v w A # I w A P // \ A P u v Figure 4.1 Logical transfers and alignment function. D (PU,,) + D (Pv,), where Pw, = A.P, and Puw = AwPu, is minimized( figure 4.1 ). Suppose the vector variable, v, is a temporary result vector of another binary operation. Since logical transfers for temporary result vectors are initially identity functions, if the alignment function for v is A,, then u (Pul(i)) is to be multiplied by v (A11(i)) and stored in utt4A(i)). The corresponding new logical transfer for aligning v to w is then Pv, = AAl1, and the total communication cost for the multiplication is D(Pv,) + D(P,,). Thus, the data alignment algorithm determines, for each internal node wu of an expression tree, a corresponding alignment function, A,,, so that the communication cost of a parallel program given by E S oD(Pv,,) + statements v:. variable of the: partial given re 8sult algorithm j l parent of v, E D(P.wwj) iV, wj: internal nodes wo = parent of wi

87 is minimized. In this chapter, we also will be limiting ourselves to alignment functions that are in A. If A, is an alignment function in A, then there is a corresponding (n+l)X(n+l) nonsingular binary matrix A. whose first row is [ 1 0 0. * 0 ]. The following are some basic theorems. Theorem 4.1 The set of all alignment functions, Aal,,, for logical transfer P,, such that Aa E Aayw iff D(P{v) = D(AaPv) < 1, is given by Att,= -{LiUWp' I Ly E L and U, E U}. The cardinality of Aal,,, is 2". Proof: Let AA = LjUi;P1. Then AhPv = (LU.P;1)Pv = LiU, is LU decomposable. Therefore, if Ah = LjUiP;1 then As 6 Aaa,. Let Ah E AaU,. Then AhPv = LaUb and AA = UbLPa P;1. Therefore, A,_a, = {LjUiP-' I L, E L and U. E U}. (n2. n) (n2 + n) Since there are 2 2 U's and 2 2 L's, IAa,,rl = 22. Theorem 4.2 Let D(AwPV) < 1. If Ab E LA,, then D(AbPv) < 1. Proof:

88 Let A.P, = LU. Then (LA,)P, = LU. Therefore, if Ab E LA,, then AbPv, LU and D(A&bP) < 1. Theorem 4.3 Given Pv and P,, there exists an alignment function A, such that D(APv) < 1 and D(AW,P) < 1. Proof: Now every nonsingular binary matrix is LUL decomposable[Peas771. Let P;P - = UrLUt. Let A, = U;1P;1. The logical transfers are then Pvw = APv and Pa, = AP", where D(Pwo) < 1 and D(Puw) < 1. LII Theorem 4.4 Let Ao, A1w and AW2 be alignment functions for the three temporary vector variables, u0, ul and w2, shown in figure 4.2. Let Al1(AW,A.1) be a set of alignment functions such that A2 E A1(Ao,Awl) iff D(Pwo02) < 1 and D(P1o2) < 1, where Pwvo2 = A2A-o and Puwl2 = Aw2A-wl. Then All(Ao,Ai) can be partitioned into cosets of L. Proof: Consider A2 E All(AA,Ai), if D(A2A-) < 1 and D(A,2A-) < 1, then D(LAA-'o) < 1 and D(Lt4,2A-) < 1, Li E L. Therefore, LA C Alt(Ao A,,) and All(Ao,A,1) can be partitioned into cosets of L. II

89 w2 wlw2 w2A < / w2 w wlw2 A wO wl Figure 4.2 Logical transfers and alignment functions for temporary variables. Theorem 4.5 Let AWo, Al and A2 be alignment functions for the three temporary vector variables, wO, wl and w2, shown in figure 4.2. The logical transfer, Pol,2, for wl is then At2A-,i. If D(Pvl2) = 2, then there exists another Ab E LA,1, such that for the new Pt q2 = A.AAtl, D(Pwiw) < 1. Pr )of: Let AWA-1i = LrULt. Then A2A-AL-1 = LU,. Let Ab = LAI. Then D(Pwlj2) < 1, where PlUw2 = Aw2Al1 While it is difficult to compute the lower and upper limits of the resulting communication cost of an arbitrary expression tree after applying the data alignment algorithm, the lower and upper limits of the resulting communication cost of a full expression tree can be computed and is given by the following theorem.

90 Theorem 4.6 Consider a full expression tree with r = 2", n > 2 leaf variables; if the logical transfers for the leaf variables are all different, then the resulting communication cost after determining the alignment functions is given by 3 r-1 < cost < - r. 2 Proof: 3 Clearly, r-1 < cost. We have to prove cost < - r only. Consider the alignment func2 tion for v (P(i)) u (P(i)) in the earlier example; by theorem 4.3, we always can find an alignment function AW such that D(Pvw) < 1 and D(P,,) < 1. Let v and u be the leaf variables of a full expression tree. With w and WG, they form a subtree ( figure 4.3 ) of the full expression tree. / w / G. / \ w /\ / / u V Figure 4.3 Subtree of a full expression tree.

91 Let the alignment function for we be A.G = I. Then by theorem 4.5, we always can find Au e LA,, such that the logical transfer, PwG = AGA,1, can be realized in one pass. Furthermore, since A, e LAW, by theorem 4.2, the new logical transfers for v and u, PVW = AwPv and P,,' = A.Pu can both be realized in one pass. If we let the alignment functions for the ancestors of the parents of the leaf variables be identity and determine similarly the alignment functions for all of the leaf variables and parent nodes of the leaf variables, the total communication cost associated with the expres3 sion tree is less than or equal to - r. 2 L - 4.2. Characterization of Optimum Sequences The data alignment algorithm that we will be describing is similar to the data mapping algorithm in the previous chapter. We assume an expression tree with k nodes. The nodes are numbered in postorder fashion, and they form a sequence V = v,,..., vk. Through partial enumeration of possible partial solutions, the algorithm determines a sequence of alignment functions, E = Al, A2,, Ak, where A, is the alignment function for node v,, so that the communication cost of executing the corresponding expression tree is minimized. For reasons that will be detailed later, if v, is an external node, then we let A, equal Pv,-1. During step h of the algorithm, Ah, a sct of alignment functions is determined for node vh. Subsequences of alignment functions for the nodes of the subtree with root node sv are formed by concatenating every Ay e Ah with two subsequences of alignment functions, to give EL, ER, Ai,

92 where EL is a subsequence of alignment functions formed earlier for the nodes of the left subtree of vh, ER is a subsequence of alignment functions formed earlier for the nodes of the right subtree of vh. As in data mapping, the subsequences formed will be called intermediate subsequences. Intermediate subsequences that can lead to an optimum sequence for the expression tree are saved for a future step, while the rest are eliminated. The intermediate subsequences that are saved are called candidate subsequences. A set of candidate subsequences for the subtree with root node vh is denoted by SEQh. In the following, we will describe rules for eliminating some of the intermediate subsequences from consideration. We also will describe Ah, the set of alignment functions for vh, that must be determined during step h of the algorithm. Some basic notations and definitions for the algorithm are needed. Notation 4.1: An expression tree with k nodes is designated by k-tree. Since the nodes are numbered in postorder fashion, the root node of k-tree is vk. Notation 4.2: For j < k, the j-stree is a subtree of k-tree. Its root node is vi and it includes all the descendants of node v, of k-tree. The tree kj-stree is a subtree of k-tree with j-stree removed (figure 4.4). Similarly, for h < j, if h-stree is a substree of j-stree, jh-stree is defined as a subtree of j-stree with h-stree removed. E-1

93 % 1*.0 r ~ ~ a r ~ rr ~ ~ ~ r~ Figure 4.4 Relationship between k-tree, j-stree and kj-stree. Notation 4.3: A sequence, Eij = Aa,i, A.., Ayi, is a subsequence of alignment functions for subtree j-stree iff Vr, a < r < j are nodes in j-stree. Here, Ar,i in Ei, is an alignment function for node vr. Note: since the nodes are numbered in postorder fashion, the last alignment function in a subsequence of alignment functions, Eij, for subtree j-stree is always the alignment function for node vy. Eli Notation 4.4: Eijh is a subsequence of alignment functions for subtree jh-stree iff every alignment function in the subsequence is for a node in jh-stree. I

94 Definition 4.1: The cost of a sul sequence for a subtree is the communication cost in executing the subtree when alignment functions for all of its internal nodes are specified by the subsequence For ex: mple, the costs of Eii and Eijh are d(Eii) and d(Eiih), respectively. d(Eii) can be defined recursively as follows, d(Ei,) = d(Eih) + d(Eih) + D(Pvhvt), where vh4^vj and vh is an internal node of j-stree, Vt is the parent node of vh, Eijh is a subsequence of alignment functions for subtree jh-stree. Tlue cost of a sequence for an expression tree, Ei, is similarly defined as d(Ei). a Definition 4.2: A sequence Ei is optimum if there does not exist another sequence Ej such that d(Ei)> d(EJ). I-I Notation 4.5: Let Sh be a set of subsequences of alignment functions for h-stree. Then Min(Sh) is the cost of the minimum cost for an element in Sh. a Consider an internal node, vb. Since the nodes are number d in postorder fashion, the right son node of vb is v1.l. Let vl be the left son node of vb. Let Ei = A, Aa,..., Al and Ejb — = As +, AI+2..., A-i be subsequences of alignment functions for l-stree;nd (b-l)-stree, respectively. The only way to form a subsequence of alignment functions for

95 b-stree from these two subsequences is to concatenate the two subsequences with an alignment function, Ab, for node vb, so that the new subsequence is Erb = EI, Eb A., A = A * *, Aa+,.., A, Al+, A+2, ~ ~ ~, A-, Ab. The communication cost in executing the subtree b-stree is then d(Erb) = d(Eil) + d(EjDl)+D(AbAl') + D(A&A!-i). The following theorems determine the conditions under which an intermediate subsequence for h-stree is to be eliminated. We will consider two intermediate subsequences E'lh and E2h wher, Elh yE2h. Given a sequence El having Elh as its subsequence, we will examine conditions under which we can always find ~2 that has E2h as its subsequence such that d(El) > d(E2). Since we are seeking an optimum solution only, these are conditions under which Elh can be eliminated. Theorem 4.7 Let Elh be a subsequence of El. Given a subsequence E2h such that d(Elh)d(E2h) > 2, we can always find ~2 containing E2h such that d(El) > d(E2). Proof: The communication cost associated with El is d(E) = d(Eh) + d(Elkk) + D(A,A-h) where vt is the parent node of vh, At,l, Ah,l E l are alignment functions for vt and vh, respectively. Form ~2 by replacing the subsequence Elh in El with ~2h. Then, d(E2) = d(E2h) + d(Elkh) + D(AtA-2 where Ah2 E E2 is the alignment function for node vh.

96 Now d(EI) - d(E2) > 2 + D(At,,lA ) - D(At,,Al2). Since 0 < D(AlA-ll),D(AtA^j2) < 2, -2 < D(At,IA-i) - D(At,,A-) < 2. Therefore, d(El)-d(E2)> 0. Corollary 4.7.1 The candidate set, SEQh, can be partitioned into two disjoints sets, SEQX and SEQ_, where Sih E SEQO iff D1,h(Sih) = Min(SEQh), and Sih E SEQ[ iff D1,h(Sih) = Min(SEQh) + 1. L-1 Theorem 4.8 Let Elh be a subsequence of El. Given a subsequence E2h such that Ah = Ah,2 and d(E1h) > d(E2h), we can always find E2 containing E2h such that d(El) > d(E2). Proof: The communication cost associated with El is d(E) = d(Elh) + d(Elkh) + DAtlA-), where vt is the parent node of vh, At,, Ahn, E El are alignment functions for vt and vh, respectively. Form E2 by replacing the subsequence Elh in El with E2^. Then,

97 d(E2) = d(-2h) + d({Ekh) + D(A, lA,2), where Ah,2 E E2 is the alignment function for node vh. Since Ah,1 = Ah,2, d(El) > (E2). Corollary 4.8.1 Let ah be the set of last alignment functions in all of the subsequences in SEA2h. Then ah can be partitioned into two sets, ao, z = 0 or 1, where at is the set of last alignment functions in all of the subsequences in SEQ,. I-I Theorem 4.9 Let Elh be a subsequence of El. Given a subsequence E2^ such that AhA^2 is LU decomposable, AAl 7 Ah,2 and d(Elh) - d(E2h) > 1, we can always find E2 containing E2h as its subsequence such that d(El) > d(E2). Proof: Consider El, its cost is d(El) = d(Elh) + d(Elkh) + D(AtlA-1) where vt, is the parent node of v^, At,l, Ah,l e El are alignment functions for vt and VA, respectively. Form E2 by replacing Elh in El with E2h. Then, d(E2) = d(42h) + d('2kh) + D(At,,A-h2) where A h2 E E2 is the alignment function for node vh.

98 Therefore, d(El) - d(E2) = d(Elh) + 4At,iAh) - d(I2)-d(AtlA-) > 1 + d(AtlAh )F-d(AtIAh)2. If (At,l = Ah,), then d(At,iAll) 0- and d(At,lAZ2) = 1. Hence, d(E1) - d(E2) > 0. If (At, 34 AA,I) then 1 < d(At,,A-) < 2, 0 < d(AtlAl2) < 2, and -1 < d(At,,Ah) - d(At, A^2) < 3. Therefore, d(EI) - d(2) > 0. The partial enumeration algorithm will be inefficient if we let Ah = A, for all internal nodes vh in the k-tree. Based on the following theorems, we will determine the set of alignment functions, Ah, that is needed to form intermediate subsequences, Eih's, to guarantee an optimum solution. For convenience, we will limit ourselves to node vh, its left son vl, its right son vr and their corresponding sets of candidate subsequences SEQh, SEQg and SEQr, respectively. Their corresponding sets of alignment functions are Ah, Al and Ar, respectively. If tl is a leaf node, we will let At = {Pi}', where PI is the logical transfer specified for vi in the given parallel program. We will also let SEQI = {El}, where El = PI1. This can be done here because with an alignment function Ah for v^, the resulting logical transfer, Ph = AhAt1, for vb is mathematically equal to the actual logical transfer AhPt. If v, is a leaf node, then A, and SEQr are similarly defined. We now characterize the set of alignment functions, Ah for internal node Vh. Let EJ, E SEQr and Eit E SEQi. Then Ayr and A,, are alignment functions for v, and vl in the subsequences Ejr and Eil, respectively. Let the subset of Ah that is used with the alignment functions A, r and A I be Ah^. The following two theorems characterize Ahk.

99 Theorem 4.10 If D(Ai,,A~,)<l or D(AA,lr)')<1, then the set Ahi contains at most the elements A, 1 and Ay, Proof: Suppose D(AjrA},) < 1. Form Ezh = Eil, Ejr, Ah, where A, {A (A, Aj,r}. The cost of Exh is then d(Ezh) = d(Eil) + d(Ejr) + D(A,hAAI) + D(Az,kA,). We always can form Esh = Eil, Ejr, A r, such that, d(Esh) = d(Eid) + d(Ejr) + D(Ai,,A,}) + D(Ai,rA,). The difference, d(Ezh) - d(Esh), is then D(A,hA,} ) + D(A,,hA;l) -D(Ai, rA-). Now, 1 < D(A,^A~,), D(A,,hA\) < 2. Therefore, 1 < d(Ezh) - 4Esh) < 4. If the difference is greater than or equal to two, then Exz 4 SEQh( theorem 4.7 ). The difference is equal to one iff D(A.,^A0 ) = D(AA,yAlr)=l. But by theorem 4.9, Ex^ f SEQh. Corollary 4.10.1 If D(Aj,rA,) = 0, then A,,r = Ai, and Ahi = {A,I}. O

100 Theorem 4.11 If D(Ai,rA~,) = 2 and D(A;,p4A,) = 2, then it is sufficient that Ah-i = {A j, rUA1}Ai, j} Air), where Ay E AI(A, r, Aij) iff D(AyA4,) < 1 and D(AyA4~) < 1. Proof: Consider any alignment function A,^h f {Ai, r, Ai,l} U AI4Aj,r, Ai,). Let EX = Eil, EjrAzh then, d(Ezh) = d(Ejr) + d(Eil) + D(A,,,A1) + D(,hAI}) Suppose D(Az,hA,)) = 2, form Ej- = Eil, Ejr, Ay r, such that d(Ej)) = d() + dEi,) + D(Aj,^rA}) + D(A rA-'). Therefore, d(Exh) - dEjh) D(As,^A'lr) + D(Az,hA,1) D(A;) - D(AjrA; ) = D(A,hA,,). If D(A^,, l)= 2, then d(Exh) - d(Ej^)=2 and Ezh f SEQh ( theorem 4.7 ). If D(AhA41) = 1, then d(Ex) - d(Ejh) 1. By theorem 4.9, Ezh 4 SEQh. Similarly, if d(A,hA ]1) = 2, it can be shown that Ezh can be eliminated by forming Eih = Eil, Ejr, A, 1. The following theorems determine additional conditions under which AII(Aj,, A,,r) does not need to be included in Ah^.

101 Theorem 4.12 Let Eit E SEQj and Ej, E SEQ1. Ai,j and Aj,, are the last alignment functions in the subsequences Eil and Ejr. If D(Ai, A1) = 1, then for an optimal solution it is sufficient that Ahi" = {AI, A,.r} for A,,r in any Es, E SEQ,. Proof: Consider Ah, where Ah 7 Ai,, and Az,h 7 A,,r, for any Esr E SEQ\. Let Ezh= Eil, Esr, Ash, Es, E SEQ$. The communication cost associated with Exh is then d(Exh) = d(Eit) + d(Esr) + D(AA,}) + D(AAr), where A,,r is the last alignment function in the sequence Es,. We can also form Eih = Eil, Ejr, A,. The communication cost associated with Eih is d(Eih) = d(Ei) + d(Ejr) + D(Aj,lA,}) + D(AiAW}) = 4Ei) + EEj,) + 1. Therefore, d(Ezx) - d(Eih) = D(AzAl) + D(A,A-r)-1. Since A, 74 A,,r and A, 7 Ai,j, 1 < D(AzA-l), D(AZA~,) < 2. Therefore, 2 < D(AA;l) + D(AzA,) < 4 and 1 < d(Eh) - d(Eih) < 3. If d(Ex) - d(Eih) 2, then Exh S SEQh ( theorem 4.7 ). d(Ezh) - dEih) is equal to one iff D(A,}) = D(AzA-') = 1. Since D(AA~,1) = 1, by theorem 4.9, EXh h SEQk. Therefore, for an optimal solution, it is sufficient that A^h = ({Ai,, A,,r} for A,,r in any Es, E SEQ1. II

102 Theorem 4.13 Let Eil E SEQI and Ejr E SEQ?. A, and A, r are the last alignment functions in the subsequences Eil and Ej,. If D(A, A;,) 1, then for an optimum solution, it is sufficient that Ah^ = {AAi,A, r} for A, r in any Esr E SEQ~USEQr = SEQr Proof: Consider the subsequence Eih = Eil, Ejr,, Ai,. The cost of the subsequence, Eih, is given by d(Eij) + Min(SEQ~) + 1. Consider a subsequence, Euh = Ei, Es,, A,, where Esr SEQ' and A, f {As,r, A, }. The cost of the subsequence, Euh, is dEl) > d(Ei) + Min(SEQ') + 2 > d(Eil) + Min(SEQ~) + 3 > d(Eih) + 2. By theorem 4.7, Esr SEQh. Therefore, it is sufficient for Ahi = {Ai,, A, r} for any A8 r in Es, E SEQ1. By theorem 4.12, it is sufficient for Ah" =- {A, I, A, r) for any A,,r in Esr e SEQ?, too. Now, a, z = 0 or 1, is the set of last alignment functions in all of the subsequences in SEQr (theorem 4.8). From theorem 4.5, we know that if aZ contains a coset of L, then for any Ai in Eil e SEQj there exists an A,,, in E, in SEQ' such that D(Ai, A,1) < 1. Determination of whether there exists A,,r in Esr SEQQ such that given.4A, of Eir in SEQ1, D(Ai, A, l1) = 1 can be simplified if we can identify the existence of a co et.

103 The following two theorems characterize the cosets in aj. Theorem 4.14 If LA,hC ah, then LAhC aj or LA, hC a. Proof: If the above is not true, then there exists Ay, in Eyh C a% and Ath = LAyh in Eth e at. But d(Et/) - d(Eyh)=1 and D(AthA-,) = 1. This contradicts theorem 4.9 that excludes Eth from SEQh. I-I Let ah again be partitioned into two sets. They are ah, and ah,b where Ai ah b iff LA C ah,. The following theorem shows that if an alignment function, A,, is in ah,,, then Az is in a,,s or at,. Theorem 4.15 If A, h E ah,,, then Az,h E al, U ar-,. Proof Let Ezh E SEQh be the candidate subsequence containing A,sh. Its cost can be broken down as followed: d(Exh) = d(Ex,) + d(Ex,) + D(Az,^Az-) + D(AzhA-1). Suppose the theorem is not true, then since Ah = U Ah, Ai,rE al Ai, E!

104 where A' = { A;,, Ai,, } or Ah =- { A,1, A,,r }UA4( Ai,, A,,r ), we have to consider the following two cases only, 1. Az,h E AIt(A,, A, r) for some A, E al and A,,r E ar, where A.,^ 74 A., and,h i74 A,ror 2. Az,h E a,b or A,^h E ar,6 Case 1 If Az,^h All(A,,I, A1,,) for some Az, E al and Az,, ~ ar, then D(As,,A4r) = 1 and D(AhA~l) = 1. Since LA,^ C Al(A1,, A,,r), the following intermediate subsequence, E2h = Ex, Ezr, L1A, h, where L1 is any lower triangular matrix in L, and d(E2h) = d(Ez) + d(Er) + D(LlA,h^A-,) + D(LlAz,hA>r) will be formed by the data alignment algorithm. Since D(A,hAl) = D(LA,^,hAl) and D(A^,hA-'r) = D(L1A,hA-r), d(Ezh) = d(E2h). Now, if E2h is in SEOh, then LAz, should be a subset of ah b. But this contradicts the fact that Az,h E a^h,. Therefore, E2h should be eliminated by theorem 4.7 or 4.8 or 4.9. If E2h is eliminated by theorem 4.7 or theorem 4.9, then Exh should be eliminated, too. But this contradicts Exh E SEQh. Therefore, E2h can be eliminated by some subsequence Eyih where Ah,y = Ah,2 = L1Az,h and d(E2h) = d(Ey) ( theorem 4.8 ). In that case, LIA,,h is still in as. Furthermore, LA,, C ah,, and it contradicts Az,, E ah, Case 2 Now, if A,hA = A,1 alb or A,^, = A,r E a,b, then D(A,,.^A1) = 0 or D(A,,hAl.) = 0. Suppose A,1, = Az, al, and Az,^ 74 Az,, E Ar,b. The data alignment algorithm will form

105 the following intermediate subsequence EyA = Ey, E, LIA,,1, where Ay, = L1As, E alb, such that d(Eyh) = d(Eyl) + d(Ezr) + D(L1Az,iA-l) + D(L1As,A-ll,). Since d(Eyj) = d(Ezx), using the same arguments as in case 1, it can be shown that LAo C ah b, and this contradicts Az, E Ah,. If Az, = Az,r = Az, h, it is obvious that Az,h E ah,, iff As,l E al, or A, r Ear,. Otherwise, A h E ah,b and LA, A E ahb. Therefore, if Az,h E ah,s, then Az,^h E O, or Ah E ar,. Corollary 4.15.1 Let A,^h E ah and UAll = AI(Ai,, Aj,r). -A^$J&{Ai, A/,r} AIE a1 Aj, r E ar if A,. E (lbUU bU UAll)n(_at U a,.), then Azh E ahb. Proof: Since Az,h E ah, if Az,h f a!,,, then Az,h E a^,b. The contrapositive of the theorem is if Az,h, ar, U ah^, then A, h E ah,6. Now, Ah= U A^h Aj,r E ar = ar, UarbUUa,,sUal, bU UAII. Therefore, if A,,h C (abUalJbU All)n(a lJar,,), then A,, E ohb. Corollary 4.15.2

106 If A, h E ahs then Az,h = P1, where Pc is the logical transfer for a leaf node, v, in subtree h-stree. Proof: Clearly, applying the theorem recursively, it can be shown that if A,h E ah,, then Az,h E ac,8, where v, is a leaf node of h-stree. But ac = Ac = (P{'}. Based on theorem 4.14 and corollary 4.15.1, it can be concluded that we can treat each coset as a unit and need not consider every alignment functions in the coset. For the case, where (.b rbU a U all) n (,J U ars ) _ 0, if A,h E (ab U ar,b U Uall) fn (as U a,.s) and A,h E ah, then A, h E ah, or LAx,h E ah, b In fact, except for the rare case, LAzCa,rUat, for some A,, if LA C ahb, then LA, C ar, or LA. C alb or LAz C All (A,,, Ay,r), A;,, E ar & Ai,,r a. Instead of forming intermediate subsequences with every alignment functions in All (A,b Ayr), only one intermediate subsequence has to be formed for each coset in All (A, l, Ay r). 4.3. Data Alignment Algorithms Based on theorems in section 4.1 and 4.2, the following algorithm could be proved to produce optimum solutions. Algorithm III MAIN PROGRAM. BEGIN FIND_OPTIMUM_SEQ ( rootnode, SEQ~, SEQ1 ); / ** A ly sequence E SEQ~ is an optimum sequence **/ END.

107 SUBROUTINE FIND_OPTIMUM_SEQ ( v, SEQ, SEQ1) BEGIN /* Subroutine to determine candidates subsequences for v ***/ IF ( tvh has no children ) THEN BEGIN /** v is leaf node **/ SEQo = (P{1}; SEQ1 = 0; END ELSE BEGIN best = large number; SEQ = SEQ = SEQ1 = 0; Let I be the left son of vh; / ** Form SEQi and SEQr **/ FIND_OPTIIMUM_SEQ ( l, SEQ?, SEQ} ); Let r be the right son of vh; FIND_OPTIMUM_SEQ ( r, SEQ~, SEQ4 ); /** Form SEQ - the set of intermediate sequences **/ FORM_CANDIDATE ( SEQ, SEQ?, SEQ, best ); FORM_CANDIDATE( SEQ, SEQ~, SEQ?, best); FORM_CANDIDATE ( SEQ, SEQ, SEQ, best); FORM_CANDIDATE ( SEQ, SEQ, SEQ, best ); /** Check if more intermediate subsequences needed **/ IF ( AMin(SEQ') + Min(SEQ}) - best < 2 ) ( Theorem 4.6 ) THEN BEGIN FORM_CANDIDATE ( SEQ, SEQ1, SEQ, best); FORM_CANDIDATE ( SEQ, SEQ|, SEQ1, best); END; IF (Min(SEQ?) + Min(SEQ) + 2 - best < 2 ) (Theorem 4.5) THEN FORMALL ( 0, 0, best, SEQ ); IF (Afin(SEQI) + Min(SEQr) + 2 - best < 2 ) (Theorem 4.5 ) THEN BEGIN

108 FORM_ALL ( 0, 1, best, SEQ ); FORM_ALL (1, 0, best, SEQ ); END; /** Form set of candidate subsequences **/ REDUCE SEQ according to theorem 4.5 & 4.8.; PARTITION SEQ into SEQ~ and SEQ1, SUCH THAT Ekh E SEQ~ IFF d(Eh) = Min(SEQ); END; END. SUBROUTINE FORM_CANDIDATE ( SEQ, SEQt, SEQr, best) BEGIN /** Form candidate subsequences to be included in SEQ **/ FOR EVERY Exi E SEQt DO BEGIN locbest = large number; done = false; WHILE ( Min(SEQt) + Afin(SEQr) < locbest) AND not done DO (Theorem 4.6) BEGIN /** Form candidate subsequences to be included in SEQ **/ If Ay,r E Ey E SEQr represent a coset in a, THEN BEGIN Let A,A lr - LaUbL, Form Ez, from Eyr by replacing every Ay, in Ey with LIAy,,. Form Eyh = Ezx,,L, aAy,. END ELSE BEGIN FORM Eyh = Ex,,Ey,,A,t where AX is an alignment function for I in Exl, Ey, E SEQ,; END; /** Is it a possible candidate subsequence? **/ IF d(Eyh) < locbest THEN BEGIN candidatesequence = Eyh; locbest = d(EYh); END;

109 /** DONE? **/ IF EVERY Eyr E SEQ, has been used THEN done = true; END; INCLUDE candidate subsequence into SEQ; END; END; SUBROUTINE FORM_ALL ( i, j, best, SEQ ) BEGIN /** Form Al[(A,,, Ayr) if necessary **/ /** Form all subsequences due to All(A,,, Ay,r) **/ I** ALL the subsequences form are of the same cost **/ FOR EVERY A,,r E Exr E SEQr and Ay, El Ey SEQj DO BEGIN IF NECESSARY TO FORM A1(A,,r, Ay,) DO BEGIN /** Form candidate subsequences to be included in SEQ **/ FOR EVERY coset in AI(A,,, Ay,) DO BEGIN Let an arbitrary alignment function A, h in the coset to represent the coset. FORM Ezh = Eyl, Ezl, Az, SEQ = SEQUEzh END; E^`D; END; IF ( best> d(Eh) ) THEN best = d(E-h); END; Lar The complexity of the algorithm is clearly dependent on the number of intermediate subsequences formed throughout the algorithm. For an estimate, we will assume reduction according to theorems 4.7 and 4.9 is not made. This is necessary because it is difficult to assess the number of intermediate subsequences eliminated by theorems 4.7 and 4.9. Furthermore, we will assume that the expression tree is a full binary tree with 2" leaf nodes. The

110 tree has a height of n, and the levels are labeled from 0,...,n-1 as shown in figure 4.5. From theorems 4.10, 4.11, 4.12 and 4.13, it can be concluded that for the maximum number of intermediate subsequences, there should not be any cosets in atb and a b. Furthermore, _a, -= aO, = 2', where vh is at level i of the tree, for each node. The maximum number of intermediate subsequences formed by concatenating every subsequence in SEQI and every subsequence in SEQr with one of the last alignment functions in the two subsequences is 21 ar|xla|. The maximum number of cosets that we have to include in Ah is |lj,, X ajl|t, where t is the maximum number of cosets in AlAAs,j, Au,r), Aj E a?, and A,r E ar,. Therefore, the maximum number of intermediate subsequences formed with an alignment function from one of the cosets is Ia~J?,I x lt. Hence, for the worst case, the number of intermediate subsequences = 21 arl X a + t?,l xJLa,,l, where t is the maximum number of cosets in ALL(A2,, Ayr), As, E a., and Ay,r e a?~,. level 3 level 2 --- - level 1 level 0/ /\ Figure 4.5 A full expression tree.

111 With no subsequence elimination according to theorems 4.7 and 4.9, the number of subsequences in SETh = tAhl. Since only tar| + Jai candidate subsequences will be produced from the 21 arl X a intermediate subsequences ( Theorem 4.8 ), |Ahl = arl + la + a~ g X la^lt. The following theorem finds maximum L4Al. Theorem 4.16 Maximum l|Ah formed for node, vh, which is at level i is 2'(2+1 - 1)t + 2'+l, where t = the maximum of the number of cosets in ALL(Az,Ay) for all A,,Ay and Az 4 ArY Proof: Mathematic induction will be used to prove this theorem. Consider a node at level 0. Iarl = |a =- la~,sl-= a^l = 1. Therefore, tAol =l ar| + al + 1o.rs X l8t = 20(21 - l)t + 21, =2~(t) + 21 = t+l. Now, for a node vs at level i, maximum la|1 = 2' ( theorem 4.15 ). Suppose the theorem is true for node up to level i-1. Consider node v at level i. iarl = I|a = 2'l(1 + 2 + - - 2-')t + 2. 1 1r,SI = l ol.l Therefore, IAhl = + arl + -lasltxla~,,lt = 2(2'-1(1 + 2 + * *. + '-') t + 2' + 2'X2't

112 = 2(1 + 2 + * * 2'1 + 2')t + 2'+1 = 29(21+1 - 1)t + 2+1. I-I Theorem 4.17: The complexity of the algorithm is O(k4), where k is the total number of internal and external nodes. Proof: Now, the number of intermediate subsequences formed for node vh at level i is 21arl x Ia + t(22i) = 2IAr, IA + t(22i) = 2(2'1(2'-l)t + 2i)2 + 2it. Since we are concerned with complexity only, we only need be concerned with the factor 24-2t. The number of nodes at level h is given by - f The total number of intermediate subsequences at level h is, hence, of the the order of 2n+3h3t. The total number of intermediate subsequences generated throughout the algorithm is then of the order of 3i3t - 2"(23-1 ) ir 7 2+ t. Since there are - = 2" leaf nodes in the k-tree ( a full expression tree ), the complexity of the algorithm is O(k4). I-I

113 For a large k-tree, Ah in algorithm III may still grow larger and larger, with vh getting closer and closer to the root node, until it equals A. Thus, given a parallel program with large expression trees, we may have to partition each large expression tree into smaller expression trees where precedence relationship exists, and use the following heuristic algorithm to solve for a near-optimal solution. A Heuristic Algorithm Consider the parallel algorithm shown in figure 3.2. Here, expression tree #2 is dependent on expression tree #1. Since the optimum alignment function obtained by algorithm III for VA in expression tree #1 may not be identity, we must appropriately determine the logical transfers for A in expression tree #2 before solving for an optimum sequence of alignment functions for expression tree #2. More often than not, algorithm III will produce a set of optimum alignment functions for VA. Thus, the problem of determining which of them will lead to an optimum solution, if it ever will, is complex. Hence, we have developed a heuristic algorithm for finding a near-optimum solution when precedence relationship between expression trees exists. Algorithm IV 1. FOR j = 1 TO z = "no. of expression trees" DO BEGIN Determine optimum sequence of alignment functions for expression tree j. Pick a no. of optimum alignment functions for the result vector of expression tree j. For each of them determine optimum sequences for other expression trees as follows FOR i=j+l TO z DO BEGIN Determine all the logical transfers for the expression tree i Determine optimum sequences of alignment functions for expression i Arbitrarily pick an optimum sequence of alignment functions for expression tree i

114 END; Keep the optimum sequence of alignment functions ( for expression tree j) that gives lowest total cost to the complete problem. If j < z then determine all the logical transfers for the expression tree j+l. END; The above heuristic algorithm has been tested on a number of example problems, its performance has been compared with that of a total enumeration algorithm and it has been found that in most cases, the heuristic algorithm generates optimal or near-optimal solutions. The results of five experiments on the heuristic algorithm are shown in table III. In each experiment, the parallel program contains three full binary trees, each with four leaf nodes ( figure 4.6 ). Each tree has the same vector variable for all its leaf nodes and the logical transfers for the leaf nodes are different. v2, 2 < z < 3 is the result vector of tree #(z-1). Thus, there is precedence among the three trees. The five parallel programs differ in their communication requirements for their vector variables. Table III Experiments on the Heuristic Algorithm Parallel Optimum Resulting program communication cost communication cost ___ _ _____ __ ( heuristic algorithm ) 1 15 15 2 14 14 3 14 14 4 13 13 5 13 14

V.2 v3 a3 V Ocn V1 r1 V 1 1 v1 V2 V2 V2 2 3 Figure 4.6 Expression trees of a parallel program for data alignment. V3 V3 V3

116 4.4. Examples The heuristic algorithm developed in this chapter was used to find alignment functions for the lower triangular matrix inversion algorithm and the FFT algorithm described in chapter 3. For the lower triangular matrix inversion algorithm, alignment functions for tw, uW, w3, w3, E, w4, w4, w5, vtt, F are determined to be 85-Pa. Alignment functions for wu, ut, W7, 7L, Ws, uW, 9 W Il, w W, wo, t ul T w 2, wl12, W12, WU13, W, Wl4, W14, WI$, WI, s t6, W1 16~, W17, w17 and S are determined to be Pa. Alignment functions for other variables, e.g., C, are determined to be identity functions. The resulting communication cost of executing the inversion algorithm is 21 routing steps. As in the data mapping case, the above solution is determined to be optimum. For the FFT algorithm described in chapter 3, the alignment functions for Bo and Bo are determined to be P and P, respectively. For Ai, Bi, wi,, i,2, 1 < i < 3, the alignment functions are (P1)l'. For Ai, Bi, wit, wi,2, 1 < i < 3, the alignment functions are (Pl)+1. The alignment functions for A4, X, u3,l and u3,2 are P1P-l. The alignment functions for A4,, W3,3 and u, are P-P11. The resulting communication cost is 8 routing steps, which is the minimum communication cost possible.

CHAPTER 5 APPLICATIONS FOR OTHER NETWORKS In this chapter, we will show how the algorithms developed in chapters 3 and 4 are applicable to r, a class of multistage interconnection networks that is functionally equivalent to the omega network. Two networks, the N1 network and the N2 network, are defined to be functionally equivalent [WuFe79] iff Qi = RlQ2R2, where Q1 is the set of admissible network permutations for the N1 network, Q2 is the set of admissible network permutations for the N2 network, R1 and R2 are permutations. Any admissible network permutation of the N1 network can be decomposed into three functions, a permutation R1, an admissible network permutation for N2 and a permutation R?2. For r, the class of multistage interconnection networks with which we are concerned, R1 = R2 = R-1 and R1 is in A. As an example, if Q2 is equal to n, the set of admissible network permutations for the omega network, and Q1 is the set of admissible network permutations for the indirect binary n-cube[Peas77j shown in figure 5.1, then R1 = R- =?Ri1 is a bit reversal permutation. In the following two sections, we will be concerned with a network in F, the N, network. It is functionally related to the omega network as follows. 117

118 O --. --------. --------- O0 1 1 / 02 3 A4 4 4 5 - -1 ~ t I i -5 7 stage 0 stage 1 stage 2 Figure 5.1 A 23X23 indirect binary n-cube. Q1 = R R, where Q1 is the set of admissible network permutations for the N1 network, R is a permutation in A and R = R1. In the following, we will let S1 denote the SIMD multicomputer system with a N1 network. We will let S6 denote the SIMD multicomputer systen with an n network. We also will let Dn and DN be the distance functions associated with the omega network and the N1 network, respectively. The following lemma relates the two distance functions, Dn and DN. Lemma 5.1 Dl(P) = DN(RPR), for P E A.

119 Proof: Since the omega network and the N1 network are functionally equivalent, P E (2 iff RPR E Q1. Therefore, Dn(P) = 1 iff D(RPR) = 1. Now, D(P) = 0 iff P = I. Similarly, D(RPR) = 0 iff RPR = I. Clearly, P = I, iff RPR = I. Therefore, Dn(P) = 0 iff DN(RPR) = 0. If Dn(P) = 2, then P is LUL decomposable. Let P = LaUbLc. P, therefore, can be realized by the following two admissible physical transfers through the omega network, Lc and then LaUb. Now, RLcR and RLaU^R both can be realized in one pass through the N1 network. Therefore, RLaU^RRLcR = RLaUbLcR = RPR can be realized in two passes through the V1 network. Now, let RPR be a permutation that cannot be realized in one pass through the N1 network. Clearly, P can be decomposed into LdUeLI, where La$I, U@'I ard L/eI. Then RPR = RLdULJR = RLdUeRRLLR. RPR, therefore, can be realized by the following two admissible physical transfers through the N1 network, RL/R and then RLdt R. Therefore, every permutation in A can be realized in less than or equal to two passes through the N1 network, and Dn(P) = 2 iff DN(RPR) = 2. Clearly, it follows that Dn(P) = DN(RPR), for P E A. CII

120 In section 5.1, we will show how data mapping algorithms developed in chapter 3 can be used to map a parallel program onto S1. In section 5.2, we will show how data alignment algorithms developed in chapter 4 can be used for S1. 5.1. Data Mapping Algorithms In this section, we will show how data mapping algorithms developed in chapter 3 can be used to map optimally or near-optimally parallel programs onto Sx. We will first describe a data mapping algorithm for parallel programs whose expression trees re independent. Algorithm V below optimally will map such a parallel program onto Si. Algorithm V 1. Transform the given parallel program by multiplying R by every logical transfer, Pi, in the parallel program to give a corresponding new logical transfer RP,. 2. Apply Algorithm: to map optimally each subproblem of the transformed parallel program onto Sn. 3. Multiply each data mapping function, Fi, thus obtained for RP, by R to give a new data mapping function RF;. This data mapping function RFi is used in the new system S, for the corresponding logical transfer, Pi, in the original parallel program. aII

121 The set of data mapping functions for S1 thus obtained can be shown to map optimally the given parallel program onto SI. In the following, we will first show that the total communication cost of executing the transformed parallel program mapped onto Sh is equal to the communication cost of executing the original parallel program correspondingly mapped onto S1 with the N1 network. Then, we will show that the data mapping algorithm does indeed produce an optimum solution. Theorem 5.1 The total communication cost of executing the transformed parallel program mapped onto So is equal to the total communication cost of executing the original parallel program correspondingly mapped onto S1. Proof: The communication cost of executing a parallel program mapped onto a multicomputer system is composed of the communication cost in realizing all of the logical transfers in the parallel program and the communication cost in remapping. Based on algorithm V, if Pi is the logical transfer in the original parallel program, then RPi is the corresponding logical transfer in the transformed parallel program. If Fi is the data mapping function for RP, in the transformed parallel program, then RF; is the data mapping function for Pi in the original parallel program for S1. Now, the communication cost of realizing logical transfer, Pi, through the N1 network is DNP,F1R). The communication cost of realizing the corresponding logical transfer, RP, through the omega network is DO(RP iF). Based on lemma 5.1, DNP,.F1R) = Dn(RP F71).

122 Let F, and Fi+l be the data mapping functions for the ith and i+lth logical transfers for a vector variable, v, in the transformed parallel program, respectively. Then, RF, and RFi+ are the corresponding data mapping functions for the ith and i+lth logical transfers for v in the original parallel program, respectively. Now, the communication cost of realizing the remapping, RF+1iFIR, through the N1 network is DpdRF;-lF71R). The communication cost of realizing the corresponding remapping, F;,+1F1 through the omega network is DQ(Fi+lFF1). Based on lemma 5.1, DARFi+lF,1R) = Dn(Fi+lF1). Since the remapping cost and the cost of realizing the logical transfers are equal in both cases, the total communication cost of executing the transformed parallel program mapped onto SI- is equal to the total communication cost of executing the original parallel program correspondingly mapped onto S1. I-I Theorem 5.2 Algorithm V will optimally map a given parallel program onto an SIMD multicomputer system with a N1 network. Proof: Based on theorem 5.1, the communication cost of executing the transformed parallel program mapped onto S is equal to the communication cost of executing the original parallel program correspondingly mapped onto S1. If the data mapping solution for the parallel program on the N1 network is not optimum, then there exists another data mapping solution with lower total communication

123 cost. This means that there exists a better data mapping solution for the execution of the transformed parallel program on Sn. This contradicts the fact that algorithm I in chapter 3 is optimum. Therefore, algorithm V is an optimum algorithm. For parallel programs where precedence relationship between expression trees exists, the heuristic algorithm in chapter 3 can be modified similarly. Since logical transfers for a vector variable, v, may depend on the data mapping functions of other vector variables, the logical transfers for v must not be transformed until all data mapping functions for those vector variables and the logical transfers for v are determined. The following heuristic algorithm is a modified version of algorithm II in chapter 3. Algorithm VI 1. Determine the precedence relationships between the vector variables and put them in a sequence, vl,v,...,vz, such that logical transfers for vy is not dependent on any variables v,, i > j. 2. For j=1 to zDO BEGIN Transform all the logical transfers, P1i, for ji to RP/j. Apply algorithm I to determine optimum sequences of data mapping functions for vi. Pick limit number of optimum sequences for vj and for each of them determine optimum sequences for other variables as followed: For i=j+l TO z DO BEGIN Determine logical transfers for v;. Transform all the logical transfers,P/,;, for vi to RPi.

124 Apply algorithm I to determine optimum sequences of data mapping functions for v,. Arbitrarily pick an optimum sequence for vi. END; Keep the optimum sequence ( for vi ) that leads to the lowest total cost for all the variables. If jyz, determine logical transfers for vj+l. END; 5.2. Data Alignment Algorithms A similar approach can be taken in adapting data alignment algorithms in chapter 4 for S1. For parallel programs where no precedence relationship between its expression trees exists, algorithm VII below provides an optimal solution. Algorithm VII 1. Transform the given parallel program by multiplying R by logical transfers that are for the leaf variables of the expression trees in the program. If P is a logical transfer for a leaf variable, then PR is the corresponding new logical transfer. 2. Apply algorithm III to each expression tree of the transformed parallel program to obtain an optimum sequence of alignment functions. 3. Each alignment function, Ai, thus obtained for an internal node ( variable v, ) is then to be multiplied by R to give a new data alignment function Rka, for the

125 corresponding node v, in the original parallel program. The set of alignment functions thus obtained can be shown to be optimum for the execution of the given parallel program on S1. Notation 5.1: Let 0 be the set of optimum sequences of alignment functions obtained in step 3 of algorithm VII, each sequence being for an unique expression tree in the given parallel program. Let T be the corresponding set of optimum sequences of alignment functions obtained in step 2 of algorithm VII, each sequence being for an unique expression tree in the transformed parallel program. I I As we did for data mapping, we will show that with O and T, the total communication cost in executing the transformed parallel program on %h is equal to the total communication cost in executing the corresponding parallel program on S1. We will then show that algorithm VII does indeed produce an optimum solution. Theorem 5.3 With 0 i ad T determined for a given parallel program, the total communication cost in executing the transformed parallel program on Si is equal to the total communication cost in executing the original parallel program on S1. Proof:

126 Based on algorithm VII, if P. is a logical transfer for a leaf variable in the original parallel program, then PR is the corresponding logical transfer in the transformed parallel program. If A, is the data alignment function for an internal node ( variable v, ) in the transformed parallel program, then RA, is the corresponding data alignment function for v; in the original parallel program. Now, the communication cost of realizing logical transfer, Pi, through the N1 network is DpZRA;P,), where node i ( variable v, )is the parent node of the leaf node that has P, as its logical transfer. The communication cost of realizing the corresponding logical transfer, RP, through the omega network is Dn(AP,R). Based on lemma 5.1, DN(RA,P,) = Dn(AiPiR). Let node j ( variable vj ) be the parent node of node i. Let Ai be the data alignment function for vy in the transformed parallel program. Then RA, and RA, are the corresponding data alignment functions for vi and vi in the original parallel program, respectively. The logical transfers for aligning v, to vj in the transformed parallel program and the original parallel program are then AyA,7 and RAA?1R, respectively. Now, the communication cost of realizing the logical transfer of RAjyA.R, through the N1 network is D(RAF;1R). The communication cost of realizing the corresponding logical transfer, A;A;1 through the omega network is Dn(AjA?1). Based on lemma 5.1, DARAA:1R) = Dn(AiA7). Since the communication costs of realizing the logical transfers are equal in both cases, given 0 and T, the total communication cost of executing the transformed p irallel program on Sn is equal to the total communication cost of executing the corresponding original parallel program on S1. I I

127 Theorem 5.4 The data alignment algorithm described for S1 is an optimum algorithm that will minimize the total communication cost of a given parallel program. Proof: Based on theorem 5.3, with O and T, the communication cost of executing the transformed parallel program on S, is equal to the communication cost of executing the corresponding original parallel program on S1. If the set of data alignment functions for the parallel program on the N1 network is not optimum, then there exists another set of data alignment functions that gives lower total communication cost. This means that there exists a better set of data alignment functions for the execution of the transformed parallel program on Sn. This contradicts the fact that algorithm III in chapter 4 is optimum. Therefore, algorithm VII is optimum. For a parallel program where precedence relationship exists, the heuristic algorithm in chapter 4 can be modified. Since logical transfers for the leaf variables of expression tree j may depend on expression tree i, where i < j, they are not to be transformed until alignment functions for expression tree i, where i < j, are determined. The following heuristic algorithm is a modified version of algorithm VI in chapter 4. Algorithm VIII 1. For j = 1 to z = no. of expression trees DO:EGIN Transform all logical transfers for the leaf variables of expression tree j.

128 Apply algorithm III to determine optimum sequences of alignment functions for expression tree j. Pick limit number of optimum sequences for expression tree j and for each of them determine optimum sequences for other variables as followed: For i=j+l TO z DO BEGIN Determine logical transfers for the leaf variables of expression tree i. Transform all the logical transfers of the leaf variables. Apply algorithm III to determine optimum sequences of alignment functions for expression tree i. Arbitrarily pick an optimum sequence for expression tree i. END; Keep the optimum sequence ( expression tree j ) that leads to the lowest total cost to the complete program. If j3z, determine logical transfers for expression tree j. END; I

CHAPTER 6 SUMMARY AND CONCLUSIONS The total execution time of a parallel algorithm on a multicomputer system can be broken down into the actual computation time and the time of interprocessor communication. On an SIv1ID multicomputer system, the computation time usually is dependent only on the capability of the processing elements, but the communication time is dependent on a number of factors, such as the interconnection network and the data storage scheme. In this thesis, the problem of minimizing the total communication time in parallel computations has been studied for a class of parallel algorithms and a class of interconnection networks. In chapter 1, the entire SIMD system, including the relevant components from both hardware and software, is characterized. Based on the characterization, the communication cost minimization problem is formulated. In chapter 2, a representation scheme for A, a class of important permutations with which this thesis is concerned, is developed. Based on the representation scheme, criterion for a permutation in A to be realizable in one pass through the omega network is determined. A method for static mapping of a single vector variable onto an SIMD multicomputer system with an omega network is developed. The method is useful for a large class of networks that are functionally equivalent to the omega network. However, it requires solving a system of nonlinear equations. 129

130 In chapter 3, an SIMD multicomputer system with an omega network is assumed. Given a parallel program where precedence relationship between its expression trees does not exist, the program is partitioned into independent subproblems, each specifying all the communication requirerements for a different vector variable in the parallel program. An optimum nonstatic data mapping algorithm for such subproblems is developed. Applying the algorithm to each subproblem yields data mapping functions that map the program optimally onto the SIMD system. For parallel programs where precedence relationship between expression trees exists, a heuristic algorithm that is based on the optimum data mapping algorithm is developed. In chapter 4, an SIMD multicomputer system with an omega network is also assumed. Given a parallel program where precedence relationship between its expression t rees does not exist, the program is partitioned into independent subproblems, each containing a different expression tree in the parallel program. A data alignment algorithm that will minimize the communication cost associated with an expression tree is developed. Applying the algorithm to each subproblem again yields an optimum solution to the communication cost minimization problem. For parallel programs where precedence relationship between expression trees exists, a heuristic algorithm that is based on the optimum alignment algorithm is also developed. The two heuristic algorithms developed in chapters 3 and 4 are found to generate optimal or near-optimal solutions for most of the example problems. In fact, optimal solutions are obtained for the matrix inversion program and the FFT program presented in chapter 3. In chapter 5, it is proved that the algorithms developed in chapters 3 and 4 are also adaptable for any SIMD multicomputer system interconnected with a network from r.

131 In conclusion, the problem of minimizing the interprocessor data communication in parallel computation for a class of parallel algorithms and a class of interconnection networks has been fully investigated. Useful and practical algorithms have been designed for solving the problem. The methodologies developed are widely applicable to many parallel algorithms and many useful interconnection networks. Instead of using two approaches simultaneously for a parallel program and then choosing the one that gives lower total communication cost, one may choose to use only one of the approaches. Suggestions for Further Research The next logical step to this research is to develop a representation scheme for a larger class of permutations. With the new representation scheme, one may then develop techniques to minimize the communication costs of a more general class of parallel algorithms and possibly a more general class of interconnection networks. Based on the current representation scheme, we also can develop techniques for other networks that are functionally equivalent to the omega network. The question of whether given a set of permutations, there exists a data mapping function such that each of the permutations in the set can be realized in one or zero pass is still open. Some necessary conditions have been established and are detailed in the appendix. However, sufficient conditions have not yet been established.

APPENDIX 132

APPENDIX DATA MAPPING FUNCTION FOR A GIVEN SET OF PERMUTATIONS The question of whether given a set of permutations, P, for a vector variable v, there exists a data mapping function for v such that each of the permutations in P can be realized in one or zero pass through an omega network is still open. However, in the following theorems, some necessary conditions are established. Theorem A.1 With the mapping function F, the set of all permutations PF, that can be realized in one or zero pass through an omega network is given by PF = { LaUbF I L E UV E U }. The cardinality of PF is 2" Proof: Clearly, PF-1 is in PF iff PF-1 is LU decomposable. If P, = LUF-1, then P/F-1 is LU decomposable. Let PiF-1 - LcUd. Then Pi = LcUdF. Therefore, PF = { LauF I La E L 8 U E U ). (_-n) (n2+n) Since there are-2 2 U's and 2 2 L's, I PF I=-2". 133

134 Corollary A.1.1 Given P, a set of permutations for v, if | P I is greater than 2"2, then there does not exist a data mapping function for v such that every permutations in P can be realized in one or zero pass through an omega network. The following two basic lemmas can be obtained easily from linear algebra. Hence, we merely state them here. Lemma A.1 The set of all lower triangular unit matrices, L, is a subgroup of A. CII Lemma A.2 The left coset relation, RL, defined by RL = { < P1, P2 > I PI E LP2 }, is an equivalence relation on A.1 RL partitions A into equivalence classes. It is also clear that if [Pi ] denotes the equivalence class containing Pi E A, then [P, ] in A equals the coset LP;. 1 Based on lemma 3.2, A is a group under matrix multiplication ( modoulo 2 ).

135 Theorem A.2 (2-n) The equivalence relation RL partitions PF into 2 2 equivalence classes of (n2-+n) 2 2 permutations. Proof: Consider P, E PF. Let PF-' = LU. Then LPiF-1 - LU and LPi = [P I C PF. (n2+n) (n2+n) Since IL =2 2, LP = [P] is a partition of PF containing 2 2 permutations. (n2+n) Every permutation in PF belongs to a partition of size 2 2 in PF. Therefore, RL parti(n2-n) (n2n) tions PF into 2 2 equivalence classes of size 2 2 Corollary A.2.1 Given P, a set of permutations for v, if RL partitions P into more than 2 2 equivalence classes, then there does not exist a data mapping function for v such that every permutations in P can be realized in one or zero pass through an omega network. aI

BIBLIOGRAPHY 136

137 BIBLIOGRAHPY [AgLi78] T. Agerwala & B. Lint, "Communication in parallel alg( rithms for boolean matrix multiplication," IEEE Parallel Processing Conference 78, 1978, pp. 146-153. [Ager77] T. Agerwala, "Communication, computation, and computer architecture," 1977 Int'l Communication Conference Rec., Chicago, June 1977. [Bokh8l] S. H. Bokhari, "On the mapping problem," IEEE Trans. Computers, C-30, pp. 207-214. March 1981. [BoMu75] A. Borodin & I. Munro, "The computational complexity of algebraic and numeric problems," American Elsevier, New York, 1975. [Budn71] P. Budnik & D. J. Kuck, "The organization and use of parallel memories," IEEE Trans. on Computers, December 1971, pp1566-1569. [Chen8l] K.-W Chen, "Minimization of interpr' cessor communication in parallel computation," Ph.D. dissertation, U. of Micl igan, Ann Arbor, 1981. [Cool65] J. WV. Cooley & J. W. Tukey, "An algorithm for the Machine Calculation of complex Fourier series,' Mathematics of Computation, Vol. 19, 1965. [Flyn72] M. Flynn, "Some computer organizations and their effectiveness," IEEE Trans. Computers, C-21, pp. 948-960, Sept. 1972. [Gent78] W. M. Gentlemen, "Some complexity results for matrix computations on parallel processors," Journal of the ACM, pp. 112-115, January, 1978. [IIers64] I. N. lerstein, "Topics in algebra," Lexington, MA, Xerox College, 1964. [Hell78] D. Heller, "A survey of parallel algorithms in numerical linear algebra," SIAM Review, Vol.20, No. 4, October 1978, pp740-777. [IrCh82] K. B. Irani & K.-W. Chen, "Minimization of interprocessors communication for parallel computation," IEEE Trans. Comp., Vol. c-31, #11, November 1982. [IrCh80] K. B. Irani & K.-W. Chen, "A Jacobi algorithm and its implementation on parallel computers," Proc. of 18th Annual Allerton Conference on Comm., control, Computing, Oct. 1980. [IrWu82] IK. B. Irani & WV. S-F. Wu,"A data mapping methodology for enhancing the capability of a class of multistage interconnection networks", Proceedings of the 1982 Real Time Systems Symposium, Los Angeles, CA, pp. 101-109.

138 [Kuck781 D. J. Kuck, "The structure of computers and computations," Vol. 1, John Wiley & Sons, 1978. [KuSt77J H. T. Kung & D. Stevenson, "A software technique for reducing the routing time on a parallel computer with a fixed interconnection network," High Speed Computer and Algorithm Organization, (D.J. Kuck et al. editors), Academic Press, N.Y., pp. 423-433, 1977. [Kuhn79j R. H. Kuhn, "Efficient mapping of algorithms to single-stage interconnections," 7th Int'l Symp. on Computer Architecture, 1979. [Kuhn80] R. H. Kuhn, "Transforming algorithms for single-stage and VLSI architectures," 1980 Interconnection Netw ork Workshop, pll-17. [Lawr75] D. H. Lawrie, "Access and alignment of data in an array processor," IEEE Trans. on Comp., Vol. c-24, #12, December 1975, pp. 1145-1155. [Lenf78] J. Lenfant, "Parallel permutations of data: A Benes network control algorithm for frequently used permutations," IEEE Trans. Comp., Vol. C-27, #7, July 1978, pp. 637-647. [LuBa801 S. F. Lundstrom & G. H. Barnes, "A controllable MIMD architecture," IEEE Parallel Processing 1980, pp. 19-27. [NaSa80] D. Nassimi & Sartaj Sahni, "Paiallel algorithms to set-up the Benes permutation network," 1980 Interconnection Network Workshop. [N: S81A] D. Nassimi & Sartaj Sahni, "Data broadcasting in SIMD computers", IEEE Trans. Computers. Vol. 2, 1981. [NaSa81B] D. Nassimi & Sartaj Sahni, "Benes network and parallel permutation algorithms," IEEE Trans. Computers, Vol. 5. 1981. [Park80] D. S. Parker, "Notes on shuffle/exchange type networks," IEEE Trans. Comp. Vol. C-25, pp 458-473,May 1977. [Peas68] M. C. Pease. "An adaptation of the fast Fourier transform for parallel processing," Journal of the ACM, Vol. 15, pp. 252-264, Apr. 1968. [Peas77] M. C. Pease, "The indirect binary n-cube microprocessor Array," IEEE Trans. Comp., c-26, #5, May 1977, pp. 458. [Sieg78] H. J. Siegel & S. D. Smith, "Study of multistage SIMD interconnection networks," 5th Annual Symp. Computer Archecture, Apr. 1978, pp223-229. ISmit78] S. D. Smith & H. J. Siegel, "Recirculating, pipelined, and multistage SIMD interconnection Networks," 1978 Parallel Processing Conference.

139 [Sieg79] H. J. Siegel, "Partitioning permutation networks: the underlying theory," IEEE Parallel Processing, 1979, pp. 175-184. [Schw801 J. T. Schwartz, "Ultracomputers", ACM Transactions on Programming Language and Systems, Vol. 2, No. 4, October 1980. [Ston71] H. S. Stone, "Parallel processing with the perfect shuffle," IEEE Trans. Comp., c-20 #2, Feb. 1971, pp. 153-161. [Swan741 R. C. Swanson, "Interconnections for parallel memories to unscramble pOrdered vectors," IEEE Trans. on Comp. Nov. 1974. [ThKu77] C. D. Thompson & H. T. Kung, "Sorting on a mesh-connected parallel computer," Communications of the ACM, April 1977, pp. 263-271. [WuFe78] C. L. Wu & T. Y. Feng, "Routing techniques for a class of multistage interconaection networks," 1978 Parallel Processin Conference. [WuFe79] C.L. Wu & T.Y. Feng, "Routing techniques for a class of multistage interco ineciton networks," 1979 Parallel Processing Conference, pp 197-205.

UNIVERSITY OF MICHIGAN 3 9015 03529 9919