MULTIPLE INSTRUCTION ASSOCIATIVE PROCESSING by Keki B. Irani and Jamshed D. Mulla Technical Report 78-13 THE UNIVERSITY OF MICHIGAN Computer, Information and Control Engineering Program

MULTIPLE INSTRUCTION ASSOCIATIVE PROCESSING Keki B. Irani and Jamshed D. Mulla Systems Engineering Laboratory The University of Michigan Ann Arbor, Michigan 48109 Abstract -- This paper introduces the concept In this paper, we investigate a more general of multiple instruction associative processing as system consisting of programmable PE's. We call a departure from conventional SIMD associative the corresponding concept of associative proprocessing. cessing, multiple instruction associative processing. We will first outline a general repreA formal definition of the conventional asso- sentation scheme for an associative process. ciative processor (CAP) is provided. The CAP rep- Then we will define the conventional associative resents current SIMD associative processing archi- processor (CAP) that represents the current AP's. tectures. Next, the multiple instruction associa- Next, the concept of multiple instruction assotive processor (MIAP) is introduced and defined. ciative processing will be introduced and a forThe MIAP eliminates the drawbacks of the CAP mal definition of the MIAP will be provided. A caused by the undue synchronization required for multiprocessor computer architecture to implement SIMD execution of associative procedures. A the MIAP will be described. Finally, the performultiprocessor computer architecture for imple- mance of the CAP and MIAP systems will be anamenting the MIAP is also described. lyzed and compared and some results for certain classes of associative procedures will be given. The performances of the CAP and MIAP systems are analyzed. The performance of each system is The APL language [2] is used frequently in measured in terms of the length of the instruction the paper to represent certain functions and sequence required to execute a given associative operations of associative processors. procedure. Algorithms for computing the measure for each system for general procedures are given. Definition 1: An associative process is a Results are provided for some special families and two tuple (A, Q) where, specific examples of associative procedures. (1) A is a set of n data items. Let Introduction A = {A,, AA,..., A } The term associative processing is used here to mean the manipulation of a set of data items and A. 9 i < 1 < N by a common process in such a way that operations 1 - - performed on a particular datum are dependent oni the set of all possible values of a data item. its value or some function thereof. 2o) Q is a common procedure applied to each Early associative processors (AP's) were sim- Conetioal flo ple extensions of associative memories. These im structurte data set A. Coentional f chart structures will be used to represent the extensions were achieved by incorporating elemen- rceure r ll tary processing functions into each word of the memory. Present AP's, such as PEPE [1], have pro- flowchart is defined recursively as folA flowchart is defined recursively as folcessing elements (PE's) that are capable of exe- flowcharts, then so are cuting a broad repetoire of arithmetic and logical the structures (i) through (iv) shown below. operations on their data memories. The common limitation of all present AP's is that they are all Single Instruction Multiple Data (SIMD) systems. Since the PE's are essentially only execution units, the total control in the Y system remains in a main CPU that broadcasts oper- t ations to the set of PE's from a single instruc- p tion stream. - l This work was supported jointly by Rome Air Development Center under contract F30602-76-C-0029 and (i) Army Ballistic Missile Defense Advanced Technology (ii) Center under contract DASG-60-78-C-0073. 25

1) A = (A[l], A[2],..., A[n]) is a memory array of n words. is the set of all possible values for a single word A[i] andJvis the set of all possible values of the array A. g t 1^ N 2) R = (R[l], R[2],... R[n]) is a response register of n bits. Ris the set of all possible values of the register R. Y F 3) tYis a set of search functions C F1) =- = {T1, T2, TN} (iv) T.:WXx4?-+ g, 1 < i < N (iv) (iii) where each Ti has a corresponding mapping ti, In flowchart (i), the operation p is a i {,} lapping and Ti((A[1], A[2],..., A[n), p:n.+O9 ~P —:~ w "~~~~~ ~(R[l], R[2],..., R[n])) = nd its execution causes the data item Ai to be eplaced by p(Ai). l r2 n In flowcharts (ii) and (iii), the operation where corresponds to a characteristic function i t (A[]) = 1 A R[] 1 St:O, + {0,11}. l^]J n0 otherwise < j <n'xecution.proceeds to the Y path of the flowchart The operation of a search function Ti f t(Ai) = 1, otherwise it proceeds to the N path. causes Ti(A,R) to become the new contents of the response register R. That is, each search function Ti performs the operation represented by Note that the set of flowcharts defined above the following APL statement: orresponds to the set of D-charts defined in [3]. A ( R A ( TI A )'he flowcharts describe the procedure carried out,n a single datum Ai. For the associative proLess, the same procedure is applied to all data 4) SPis a set of processing functions.tems in A independently. -= {P1, P2'..., P Conventional Associative Processing _ _ P.:x +, 1 < i < M, All present AP's can be classified as CAP's.'EPE [1] and STARAN [4] are two particular exam- Each processing function Pi has a corres4les. ponding function Pi such that A CAP consists of two major subsystems. The p leart of the CAP is an associative processing iemory (APM) that stores and manipulates the set and )f data items to be processed. The APM can;elect data items by associative search opera- P.((A[1], A[2],..., A[n]),:ions and can process the selected subset in )arallel. (R[1], R[2],..., R[n])) = The APM receives its instructions from the (a1, a,..., a), )ther subsystem of the CAP called the global con-:rol unit (GCU). Besides broadcasting operations where:o the APM, the GCU performs other functions to p.(A[j]) if R[j] = 1;ynchronize the APM elements. We will first a. = 1 lefine the APM and then describe the GCU and its A[j] otherwise functions. - -.Definition 2: The associative. The operation of a processing function Pi Definition 2: The associative processing nef theC s a four tuple g causes Pi(A,R) to become the new contents of the nemory (APM) of the CAP is a four tuple (A,R,", That is, each processing funcy1 where: tion Pi corresponds to the following APL statement, 26

A - (( x (P A)) + (( - x A)) (ii) If F is a flowchart of type (ii), then A + ((R x (PI A)) + (( ~R ) x A)) S (F) = PUSH, T,S (F1), COMP,S (F2), POP The global control unit (GCU) of the CAP performs several overall functions to control the (iii) If F is a flowchart of type (iii), activity of the APM. The following are the main then functions performed by the GCU: Sc(F) = PUSH, L1:T, IF NONE GOTO L2, (i) The GCU possesses a stack which it uses to store the contents of the response register R Sc(F ), GOTO L1, L2: POP of the APM. (iv) If F is a flowchart of type (iv), then Let S[O], S[l],... represent the stack on which the response register is saved. The GCU S (F) = S (F1), Sc(F2). can perform three operations involving the stack c c1 and the response register. Consider the flowchart representation of a (a) The PUSH operation causes the current program shown in Figure 1. value of R to be pushed on the stack. (b) The POP operation causes the value of the response register R to be replaced by the topmost value on the stack. (c) The COMP operation causes the value of v N R to be replaced by its complement relative to the top of the stack, i.e., R + ( -R ) A S[TOP] where S[TOP] is the most recently pushed value on A + A - 1 the stack. (ii) In addition to broadcasting the sequence of search and processing functions to the APM, the GCU can perform certain conditional A 3 N branches by testing all response bits of the APM. In particular the GCU can branch to a particular instruction if the current response register is all zeros. We will denote this jump as IF NONE GOTO x A 1 A x 2 where x is the label of the target instruction. (iii) The GCU can also perform unconditional branches. This instruction is denoted GOTO x. The above control functions are sufficient for executing a general associative process. Similar control mechanisms have been proposed [5] and are also present in existing AP's such as, PEPE [613. Using Algorithm 1, we can determine the CAP program for the above flowchart as: We can nowdefine an algorithm to determine the "program" for a CAP for any valid associative Sc(F = PUSH, T1, P, COMP, P2, PUSH, T2, procedure flowchart as described in Definition 1. COMP, P POP OP Algorithm 1: Given a flowchart F, we can determine the sequence of operations Sc(F) where the search operations T1, T^ and the prodetermine the sequence of operations Sc(F)' Z required to execute F on a CAP by applying the cessing operations P1, P2, P3, P4 are defined in following rules recursively. (i) If F is a flowchart of type (i), then Sc(F) = P 27

Figure 2 shows a block diagram of an MIAP system. T1- R ( R A < 3 ) ) 2 T -- R A- ( R A ( A = 3 ) ) p = A - ( R x 0 ) + (( ~ R ) x A ) Data memories Processors Program memories P = A + ( R x ( A - 1 )) + (( xR ) x A ). P3 A + ( R x 1 ) + (( ~R ) A ) M A ( R x ( A x 2 )) + (( ~R ) x A ) 4There are three major drawbacks in the con- A l ntional associative processing system. First, 2 | - 2 l M2 large percentage of the operations executed by I J I I j e GCU are overhead operations necessary to syn- ronize the activity of the APM. The POP, PUSH,. d COMP operations are required solely to ensure Element * at the proper subset of PE's is active when the I (PE) arch and processing operations are broadcast to * e APM. A- C --- n n _ n Second, since the CAP is a SIMD machine and der the control of a single control unit, only _. -- _ subset of the PE's in the APM can participate / instruction execution at any instant. As shown Flynn [7], conditional branching has a severe Global Multiprocessor trimental effect on the performance of SIMD Synchronizationray chines such as the CAP. Unit (GSU) Finally, since the GCU broadcasts a single struction stream, it must ensure that all ssible paths of the program are traversed. Figure 2: MIAP System Block Diagram One of the objectives of the extensions of One of the objectives of the extensions of We will first define two subsystems, namely, e basic CAP is to minimize the amount of syn- the multprocessor array (MA) and the global ronization required between PE's of the APM. tnchniutiprocessor arry ( ) and the condio he case of the CAP such synchronization is synchronization unit (GSU) and state conditions the case of the CAP, such synchronization is tal because of n l of., sc s ithe system lies under which they constitute a multiple instructal because the control of the system lies tirely in the GCU. This restriction is relaxed tlon associative processor. the multiple instruction associative processor D 3 A m a Definition 3: A multiprocessor array (MA) allowing each PE to fetch and execute search is a four tuple (A, R, M, C) where: d processing operations independently and ynchronously. Synchronization of the PE's by a ntrol unit is forced only when it is absolutely 1) A s a data memory array of n words. cessary. A = (A[1], A[2],..., A[n]) Multiple Instruction Associative Processing is the set of all possible contents of a single — is the set of all possible contents of a single memory word A[i]. The most important change from conventional multiple instruction associative processing is 2) R is a response register of n bits. at the restriction of SLID operation is relaxed. stead of PE's receiving the same associative,,, structions broadcast by the GCU, each PE is graded to a programmable processor so that it R[i] {0,1}, 1 < i < n n fetch, decode and execute instructions indendently. By providing each PE processor with a 3) M is a program memory array consisting py of the set of instructions for an associative - p rogram memorocessing program, the system can execute it with of n program memories M1, M2..., M. Each Mi eae fiinnincrease in can contain the three types of instructions eater efficiency and significant increase in e eed. Since some global control instructions are ill required, the MIAP also contains a control,.'~~~~~~. ^-..,. (i) A test instruction T can set the it corresponding to the GCU. However, in this se, global synchronization operations on the response bit R[i] depending on the result of a se, global synchronization operations on the's are less frequent than in the case of the CAP. corresponding binary valued function t. 28

t:*'9- {0,1} We can now define a corresponding algorithm for the multiple instruction associative processor The execution of a test instruction T causes to determine the GSU and MA "programs" for any t(A[i]) to become the new value of the corres- valid flowchart. ponding response register bit R[i]. Algorithm 2: Given a flowchart F, we can (ii) A processing instruction P can alter determine the sequence of instructions required the contents of the data word A[i] by applying a to execute F on an MIAP as follows: corresponding mapping p, The GCU program for a flowchart consists of P:'- +~ only two instructions, namely, INIT and WAIT. The INIT instruction initiates the MA program The execution of a processing instruction P causes corresponding to the flowchart F. The MA program p(A[i]) to become the new contents of the corres- Sp(F) can be determined by applying the following ponding data word A[i]. rules, recursively: (iii) A branch instruction can transfer con- (i) If F is a flowchart of type (i), then trol to a target instruction either unconditionally or based on a test of the response bit R[i]. s (F) = P The two instructions are denoted P (ii) If F is a flowchart of type (ii), then GOTO x, and S (F) = T, IF NOT GOTO L1, S (F1), IF NOT GOTO x, GOTO L2, Ll: S (F2), respectively, where x is the label of the target where L2 is the label of the instruction followinstruction and the conditional branch is exe- ing the sequenceS (F). cuted if R[i] = 0. P 4) C is a processor array consisting of (iii) If F is a flowchart of type (iii), n processors C, C,,...,C. Each C. can exe- then cute instructions is frogm memory Mi and thus manipulate its corresponding data S(F) = L1: T IF NOT GOTO L2, Sp(Fl memory A[i] and response bit R[i]. The processors can operate independently. GOTO L1 The global synchronization unit (GSU) per- where L2 is the same as above. forms two major functions to control the activity of the MA. It can initiate the execution of the (iv) If F is a flowchart of type (iv), then MA processor programs by the respective processors. We will denote this operation as INIT. The S_(F) = S(F) S (F2) GSU must also wait for all the MA processors to P 2 terminate their programs before continuing with [ any other operation. This operation will be denoted as WAIT. The GSU can perform normal We can now derive the MA program for the sequential processing operations but these do not flowchart in Figure 1 by using the above algorelate to the present discussion and will not be rithm. The program S (F) is given by: treated here. S (F) = T1, IF NOT GOTO L1, P, EXIT, Multiple instruction associative processing is performed by the MA and GSU pair when each L1: P, T, IF NOT GOTO L2, P EXIT, program memory array Mi in the MA is loaded with the identical sequence of instructions. Whereas L2: P the PE's of the CAP either execute or ignore broadcasted instructions depending on the value The test and processing instructions for of their data items, the processors of the MA the ith processor are defined as follows: select the appropriate sequence of instructions from the common program in their program memories depending on their data memory contents. Assum- T E ( AI < 3 ing that this program contains some conditional 1 branches, the instruction sequence, and hence the T - RE[I] + ( A[I] = 3 ) data transformation operations performed by the individual processing element of the MA is data 1 A dependent and, therefore, associative. P - A[I] ACI] - 1 P - AEI] []- We will refer to a system consisting of an 1 MA and GSU under the conditions described above P - ACEI - ACI] x 2 as a multiple instruction associative processor. 29

a use EXIT to denote GOTO instructions that ter- particular module processor or a selected subset inate execution of the MA program. of Pi's to begin execution of a program at a specified starting address stored within their Comparing the above program with the corres- Mi's. After executing the programs, each Pi can ending CAP program, we see a significant interrupt the Pc and pass its results or other ecrease in the length of the instruction termination data through the module memory Mi. equence. In particular, the longest sequence of nstructions executed by the PE's of the MA is 6 P backup storage ompared to the fixed CAP program of 12 instrucions. A comprehensive performance analysis study ^ M f these two systems for a broad class of associa- c ive processing problems is provided in a later P activat ection.M J l i M clines ^ Referring back to the three major deficien- -- ies of the CAP, we see that the MIAP system MI L lo P _ voids all of them. First, since the PE's execute ndependently, there is no need for global nstructions to synchronize activity during execu-I ion of a flowchart program. This also eliminatesI P - M he need for PE's to be idle while waiting for data and ther PE's to execute different instructions. address bus inally, since each PE selects its own path hrough the program flowchart, the execution time/ M P epends on the longest path and not on the total/ raversal of the flowchart as in the case of the AP. The cost of using this system as opposed to M he CAP is the cost of the program memory array n n o store copies of the PE programs. The cost of _ _ _ oading the programs into the memory array is not interrupt ignificant since program partitioning is a deter- iterru inistic problem and dynamic loading of programs Processing lines to P s not necessary. P - M data and Element address bus An Implementation of an MIAP System Figure 3: Multiprocessor Architecture for the MIAP We now describe a multiprocessor system on hich multiple instruction associative processing The Performance Measure an be implemented. This architecture is similar o that described by Kober [8]. o that described by Kober 8].The performance measure used to compare the Anoverl. b k de - CAP and MIAP systems will be the length of the An overall block diagram of the system archiAn overl bok diagrm of the system archi instruction sequences required to execute flowecture is shown in Figure 3. cture is shown in Fiure 3.charts. In order to obtain quantitative results central processor (control unit) Pc is con- of this analysis, certain assumptions will be ected to a memory Mc which is divided into equal made while modelling the characteristics of )locks M1, M2,..., Mn. Each Mi has an asso- associative procedures represented by flowcharts. iated processor Pi. The Mi-Pi pairs serve theThese assumptions will be explicitly stated in unction of processing elements, the following discussion as they are used. Each Pi can reference only its own Mi and For a given flowchart F, the performance an execute programs and operate on data from it values NCF and NMF for the CAP and MIAP, ndependently of and asynchronously with all respectively, are the number of instructions,ther Pi's. The Pc can also access all Mi's as executed by each system from the corresponding rell as its own private memory Mp. sequences Sc(F) and Sp(F). Since the actual number of instructions executed for a particular The Pc can access the Mi's in two modes. flowchart is dependent on data conditional Ine is the conventional mode where the set of branching, the performance values NF and N F [i's is perceived as one large contiguous memory are random variables. The assumptions we make lock. The second is the "multiwrite" mode where about conditional branching probabilities of he Pc can write data into the same relative certain flowchart structures will enable us to ocation of a selected subset of Mi's simultan- determine distributions and expected values of ously. the variables involved. Although there is no direct data link We will first give an algorithm by which etween PC and Pi's, there are direct control the performance can be computed for any given ines connecting them. The Pc can instruct any flowchart structure and then analyze some results 30

for certain families of structures. Due to the MIAP, respectively, during the ith iteration of lack of space, details of computations are not the loop. provided in this paper but may be found in [9]. X is the random variable that characterizes Algorithm for General Flowchart Structures the number of loop iterations required for a single datum. If we assume that at any iteration, In the following algorithm we assume that the loop termination condition has probability p the associative processors are operating on a of being satisfied, then we can represent X by a data set of cardinality n. It is assumed that random variable that is distributed geometrically the data items are randomly selected from the with parameter p. Hence, space of allowable values. Pr(X=i) = p(l-p) i = 0, 1, 2, Algorithm 3: Given a flowchart F, the performance values NCF and NM F for the CAP and (iv) If flowchart F is of type (iv), then MIAP, respectively, can be determined as follows: NC,F NCF + NCF NM F is the length of the longest sequence 1 2 from a set of n independent sequences executed by and the n PE's of the MIAP. To determine NM,F, the length of the sequence executed by a single inde- N N + N pendent PE, Np,F will be determined first. NM P,F P,F= P,F2 is given in terms of Np F by M,F = (NPF(n) Results where, Y(n) denotes the n-th order statistic [10] We will now analyze and compare the perforof the random variable Y. In other words, Y(n) mance of the CAP and MIAP for two families of is the largest of n independent samples of the flowchart structures. The two major families ~random variable Y~~. ^will consist of loop and loop-free programs. Finally, performance values for some mixed flowFor a given flowchart F, NC F and NpF can chart examples will be compared. be determined by applying the following rules recursively. Loop Programs (i) If the flowchart F is of type (i), then In this section we will analyze the performance values for two classes of programs containN. = N =1 CF P,F ing loop structures, namely, programs with a sequence of loops and programs with a nested ii) If the flowchart F is of type (ii), pair of loops. then ~N =-~4+N, +N,~LN Loop Sequences N,F C,F, + NC,F CF~~- CF1~ CFConsider a program that consists of L conIf we assume that each branch of the flow- secutive loop structures as shown in Figure 4 chart F has equal probability of being executed below, by an arbitrary PE, then Np,F has the following We assume that the sequence consists of L consecdistribution: utive loops and that the data set cardinality is Pr{Ni = x} = lrn. We also assume that an arbitrary datum PrNp F = x} = (Pr{N = x - 2} requires X iterations of a loop to be executed, 1 where X is distributed geometrically with parameter p. + Pr{Np = x - 2}). 2 The performance values for the flowchart F (iii) If the flowchart F is of type (iii), in Figure 4 are then: then L (n) NCF = 4L + 3 X(n) NC =4+ I (2+ (NCF)i) i=l 1 and L and NMF = 2L + 3 ( i X)(n) X M, i=l Np = 2 + i (2 + (NpF )i) PiFl'i Figure 5 shows the plot of the ratio of the where (NC,Fl)i and (Np,Fl )i are the performance expected values of NCF and NM F against L for p = 0.3 and various values of n. measures for the flowchart F1 for the CAP and 31

Figure 6 shows similar results for a fixed n = 200 and various values of L. T^, 1~~~~~ T^ ^ --— As shown by the plots in Figures 5 and 6, the relative performance of the MIAP over the CAP improves for longer sequences of loops and larger mr —I flP^~ r cardinality of the data set. Nested Loops Consider a program that consists of a pair of nested loop structures as shown in Figure 7. P2 TT J~~~~~~~~~T ~~~~Pt~~~~~~~~~~ ~P L Figure 4: Loop Sequence Program T i~~~~~; ~Data set [ ~~~'~I ~cardinality (n) 5 applying the rules of Algorithm 3, we 3500 - 2 00 70 Number of loops (L) can find the two performance values for flowchart F in Figure 7 as: 30 F0 Figure 7: Nested Loop Structure 5 0 10 20 30 40 50 60 70 80 By applying the rules of Algorithm 3, we Number of loops (L) can find the two performance values for flowchart F in Figure 7 as: Figure 5: Expected Performance Ratio for Loop X(n) Sequence Programs (p = 0.3) NC 4 + 7 X +3 X(,) and 100 X 70 NMF 2 + (5 X+ 3 X) (n) 50 i=l 4 Number of loops (L) Figure 8 shows the curves for the ratio of.....n30 -—'~30 othe expected values, (E[NC F]/E[NMF]), against C F _ 20 p for various values of n. M,F 3- Once again it can be seen that the relative -10 performance of the MIAP increases with n. Similar results for a greater number of nested loop structures may be found by using Algorithm 3. 5 However, computation of expected values in such 2 t I I cases is not tractable. 0.5 0.3 0.2 0.1 0.05 0.01 Loop termination probability (p) Figure 6: Expected Performance Ratio for Loop Sequence Programs (n = 200) 32

Data set cardinality (n) EN 2 i CF E[NMF] 100 ~~~~3 ~50 30 20 2.5 100.2 0.3 0.4 0.5 Figure 9: Full T-Chart of Height 2 Loop termination probability (p) Figure 8: Expected Performance Ratio for Nested Loop Programs Loop-free Programs The loop-free constructs considered here consist of nested conditional constructs called Tcharts. We define this family of flowcharts as P P2 follows. Definition 4: A T-chart is a flowchart that can be constructed using the following rules recursively: (i) The null structure is a T-chart. (ii) If F1 and F2 are T-charts, then the 3 p following structure is also a T-chart: Figure 10: Bare T-Chart of Height 2 [ ^^sj>^,For a full T-chart F, of height h, the performance values are: [ ENCF = 6(2 - 1) NM,F = (NP,F)(n)= (3h)(n) 3h t F1 J F2 ) For a bare T-chart B, the performance for V- V —— ythe CAP is: We are interested mainly in two subclasses N = 6h of T-charts, namely, the full and bare T-charts. Full T-charts are those which contain the maximum The performance of the MIAP for a bare number of conditional constructs at each level. T-chart B is non-deterministic and lies between Bare T-charts are those which contain only a the following bounds single conditional at each level. The height of a T-chart is the maximum depth of nesting of con- 3 < N B < 3h ditional constructs. Figures 9 and 10 show sample' full and bare T-charts of height 2. Figure 11 shows the curves for the performances for bare and full T-charts for different values of height h. For the performance value NMB, the bounds Np B MAX and NP,B,MIN are shown. 33

ratio. In fact, if we approximate the distribution of the performance value of a single ~~~~~~N ~PE, (Np), by a Gaussian distribution (which is 60 - IC,F reasonable for small values of p and large values n / of h), the expected ratio is o5 ~|~~~~~ ~~50"~ /nE[NC]/E[NM] = 106 40 m / C B Conclusions o 30 - / This paper has introduced the concept of X / / multiple instruction associative processing that E 20 - N N Mremoves the restrictions of SIMD operation from E 20 - M,F P, BMAX Z / 3 ^ ^ M'' conventional associative processing techniques. The concept is realizable through a highly modulo /^ ^ ^1lar and extendible multiprocessor computer archi5 - N tecture. A study of the relative execution ~ I\ 1^~ 1\~,~ Xspeeds of the conventional and multiple instruc1 2 3 4 5 6 tion associative processor shows that significant T-Chart Height (h) improvements can be expected depending on the structure of the associative process under conFigure 11: Performance for Bare and sideration. Full T-Charts An implementation of the MIAP using Texas xed Flowchart Structures Instruments TI990/4 microcomputer modules is in xed Flowchart Structures progress at the University of Michigan. The system is supported by several software faciliAll the programs considered so far contain ties, including a high level language oriented ther loops or conditional branches but not both. t incl ing a ig level l particular, the loop programs are made up of towards associative processing problems, a cross particular the loop programs are made up of p constructs that contain only a single pro-compiler for the language, and an operating op constructs that contain only a single pro- ssing operation in each loop body. T-chartsrov ng run-time support for the a built up of conditional structures that con- language. in single processing operations in each branch. R References The performance of the MIAP is dependent on [ 1] A.J. Evensen, and J.L. Troy, "Introduction a complexity of the flowchart. In general, the t te A e e a Eeent PE to the Architecture of a 288-Element PEPE," cformance of the MIAP relative to that of the?roc. 1973 Sagamore Computer Conference on? improves substantially if a greater number of Parallel Processing (1973C pp. r62-169. Parallel Processing, (1973), pp. 162-169. =mentary structures are composed or nested. To —- Lustrate this point, we give results for some,., " - -.-.,.....-. - -- [ 2] L. Gilman, and A. Rose, "APL —An Inter-iations of the basic families of flowcharts a ie A r,". i &,, active Approach," J. Wiley & Sons, (1974), isidered previously. In each case, a single P 384 p.ration is replaced by composite subflowchart. [(i) Considr a s n o oo r 3] S.R. Kosaraju, "Analysis of Structured (i) Consider a sequence of 20 loops where Programs," Jour. of Comp. Sys. Sciences,:h loop body contains a full T-chart of height (9 1974) pp. 232-255. Assume that the parameter p for each loop is i and the cardinality of the data set is 500.. tcr, AA r rc r [ 4] K. Batcher, "STARAN Parallel Processor this flowchart, System Hardware," AFIPS 1974 NCC, (43, E[NCI/E[NM] = 16. 1974), pp. 405-410...).C a full -c ofheight[ 5] H.J. Siegel, "Masking Scheme for Determin(ii) Consider a full T-chart of height 5ing the ctiveInacv- e Stats ofSIM ing the Active/Inactive Status of SIMD ire each branch of conditional statements con- e ce Uvsi —t Machine Processors, Purdue University,.ns a single loop construct. Assume parameter, 177 n if each loop is 0.2 and the cardinality of dataTR-EE-77-25, (May 1977), 40 pp. is 50. The expected performance ratio is is 50. The expected performance ratio is 6] J.R. Dingeldine, "Parallel Fortran (PFOR) E[N ]/E[N ] -24 PEPE Assembly Language (PAL) User's LCJ7L MJ Manual," Systems Development Corp., SDC Rep. No. TM-HU-046/400/01, (Aug, 1976), (iii) Consider the same problem as in (ii) - / ept that a nested pair of loops is substitutedpp. each branch. In this case, eachbranch. Inthis case, [ 7] M.J. Flynn, "Some Computer Organizations,E[N1 /E[r.N > 65 * and Their Effectiveness," IEEE Trans. Comp, E[NC]/E[NM] > 65(Sept, 1972), pp. 948-960. This is a conservative lower bound for the 34

[ 8] R. Kober, et al, "SMS 101 - A Structured Multimicroprocessor System with Deadlock Free Operation Scheme," Euromicro Newsletter, (2, 1976), pp. 56-64. [ 9] J.D. Mulla, "Associative Processing and its Extensions," (Ph.D. Thesis), The Univ. of Mich., (1978), (also SEL Tech. Rep. 119, Systems Engin. Lab., Dept. of Elec. & Comp. Engin., Univ. of Mich.). [10] H.A. David, "Order Statistics," J. Wiley & Sons, (1970). 35