THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY1 A QUEUEING MODEL OF DELTA NETWORKS B. A. Makrucki T. N. Mudge CRL-TR-26-83 AUGUST 1983 Room 1079, East Enagineering Building Ann Arbor, Michigan 48109 USA Tel: (313) 763-8000 lAny opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the funding agencies. This report is a reiue of two Systems Rngineeing Iabaratory (now defunct) repoarts: SELTR-159 and SEIrTR157.

SEL-TR-159 A iQaeueing Model of Delta Networks B, A. Makrucki T. N. Mudge January 1982 This work was supported in part by National Science Foundation Grant MCS-8009315 DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING SYSTEMS ENGINEERING LABORATORY THE UNIVERSITY OF MICHIGAN, ANN ARBOR 48109

ABSTRACT This report describes an analysis of multistage interconnection networks where queues are placed in the bxb crossbar switches on which the networks are based. A queueing analysis of the network is presented, and results are obtained using approximations that are appropriate for network operation parameters of primary interest. From the analysis communication delay time and network throughput are derived. Using the results obtained, queue lengths may be chosen so that the network satisfies certain performance requirements. Networks with Queues

1. Introduction. In high performance multiprocessor computer systems, a high capacity communication channel is typically required if processor communication/data movement is not to be a system bottleneck. Interconnection networks have been developed to provide such a high capacity channel [Clo53, Ben65, GoL73, Law75, Pat79]. Operation performance of these networks is a critical measurement of their usefulness (for example, one type of network may yield totally unacceptable performance under certain operating conditions). This report describes an analysis of a class of packet switched multistage interconnection networks appropriate for use in multiple instruction, multiple data stream (MIMD) multiprocessor systems. Section 2 describes the multiprocessor model assumptions and the performance measures to be derived (several more are embedded in the analysis). Section 3 describes the queueing analysis (and approximations) of the model and derives expressions for the performance measures developed in section 2. Section 4 is the conclusion. Networks with Queues

2. System Model, Operation, and Assumptions. This section describes the multiprocessor system to be modeled. System operation in the context of the model is described and simplifying system operation assumptions are described, The multiprocessor architecture is shown in Figure 1. The system consists of processing elements (PE's) connected by an interconnection network. The model described may also be used for analysis of processor-to-memory module connected systems if slight extensions to the model are made. The system to be modeled is a packet switched system, in which n PE's (n = bI,b = 2k, k an integer, where z is the number of stages in the multistage interconnection network), or packet sources, emit communication packets of constant length. A packet consists of a destination address field and a data field For example, a packet may be a memory word or several memory Interconnection Network Figure 1. System Architecture. Networks with Queues

words. When a source emits a packet, the packet is destined for one of the n sink devices (PE's here), i.e., packet broadcasting is not modeled here. There are two network performance measures to be found, they are measures of communication delay time and network throughput. They are: (1) Packet delay time (PDT) - this is the average time required for a packet to reach its destination. PDT is the average delay encountered by a packet from the time of emission from its source to the time of arrival at its destination. (2) Network throughput (NTP) - this is the average rate of packet flow out of the network at the destination side, The interconnection network for this analysis is assumed to have the unique transmission path property. That is, for any transmission path required there exists only a unique set of switch settings that will yield the transmission path; in other words, there is only one choice of a transmission path for every route (from PE to PE) required.'Multistage networks which satisfy this property are those that are bit controlled, i.e., where each stage in the network determines its switch settings by using a unique bit field from the destination address tag. Furthermore, these fields are mutually disjoint. Multistage networks which exhibit this property include Omega, Delta and Generalized Cube Networks. The requirement of unique transmission paths will be seen later in the network analysis. 2.1. Assumptions. (1) All processors behave independently. In particular this means interprocessor data dependencies are not modeled here. Networks with Queues

(2) Each processor emits packets as a Poisson process with rate X. Thus program behavior is modeled as follows: a program executing on processor i emits packets at random times, the average time between packet emissions is A This is simply a continuous time extension of discrete time processor models [SkA69, Str7O, Bha75, BaS76, Hoo77, SeD79, Pat79, Rau79, MaM81a] where the packet emission process is a Bernoulli process. In discrete time processor models, it is assumed that processors emit packets during system cycles with probability p (which may be viewed as the fraction of instructions that are communication instructions, such fractions might be characterized by Gibson mix relative frequencies). The continuous time model is simply an extension of this idea, in fact it is the process that arises if the system cycle time goes to zero (with an appropriate adjustment in p) in the discrete time model. (3) When a program emits a packet, the selection of a destination sink is assumed to be uniformly distributed over all sink devices. This is an approximation, since a processor will not emit packets destined for itself. Removing this approximation makes analysis unduly complex without gaining much. From the bit controlled property, and the uniform distribution assumption, it may be seen that for every packet traversing the network (at stage i), its position at the next stage is distributed uniformly over all outputs of the next stage switch to which it is routed (each bit of the address field is either 0 or 1 with equal probability). Thus packet routing at any stage is independent of all other stage routings. This allows a stage-by-stage queueing analysis to be done. Figure 2 shows the queueing network configuration of a bxb crossbar switch on which the multistage networks considered here are based. Networks with Queues

Length L Queue bxb Crossbar Switch Figure 2. A b xb Switch with Queues. Networks with Queues

3. Network Queueing Analysis.'rhis section analyzes interconnection network behavior to find PDT and NTP, The analysis will be performed in a stage-wise fashion, starting with the source end of the network. The first stage will be analyzed (note that all queues in a given stage behave similarly due to symmetry in processor emission rates and uniform destination distributions) to find queue behavior. The results of the first stage may then be used to find second stage behavior. Likewise for all successive stages. Approximations will be made in order to make the analysis tractable. Again, each packet entering a bxb switch is randomly destined (with uniform distribution) for one of the b output queues. All inputs to all b xb switches in the first stage are Poisson processes with rate X. Thus, by decomposition and superposition of Poisson processes, each queue in the first stage sees a Poisson process with rate X = - at its input. Figure 2 shows the situation. The exponential servers model randomness associated with the time required to move a packet from one stage to the next. Synchronous queue servers take a non-zero amount of time to move a packet to the next stage. Thus, multiple packets may try to enter a queue simultaneously (in a synchronous design), some of which are delayed. The use of an exponential server is an attempt to model this interstage data movement delay, without unduly complicating the analysis. All queues In the first stage behave identically and independently; hence, it suffices to analyze a single queue of this stage. Networks with Queues

3.1. FlIrt Stage, Single Queue Analysis. For an M/ MI 1/ L queue, results from queueing theory are available [Coo72, GrH74, Kle76]. Let, p, = Pr k packets in the queueing stations Ogk: L be the steady-state, general-time, occupancy probabilities for a queueing station. Then, Pb = i1_opL+lP Ok 5:L, p 1, MIM/1/L (1.1) L+1 Where p = Let N be the random variable representing the steady-state number of packets in a queueing station (N will be subscripted with a queue number later). E[N] = i kpk kul pp[ - (+l 1)pL + LpL+]1 p lO 1, M/M/ 1/ (2) (1 -p)(1 - pL+l) L- p= 1, M/M/1/L Thus the first stage queue is easily analyzed. For successive stages the situation is generally more complicated. Consider the interdeparture process at the output of an M/M/ 1/* (* denotes any queue length) queueing station with Poisson input rate X and Poisson service rate A, its probability density is found as follows [Kle76]: Let E be the event that the queue is empty after a departure and let T be the interdeparture time random variable. Let f T(t) be the probability density of T, /rTI(t) be the probability density of T given that E occurred, and fTJrl-E(t) Networks with Queues

be the probability density of T given that E did not occur. Then, IT(t) = ITIE(t)Pr7Ej + fTI-r(t)(1 - PrIEt) = ITlIE(tPO + fTI-E(t)(1 -Po) And, fTIE(t) = probability density of the server f TI (t) = e -a fTIE(t) = probability density of service time plus arrival time Taking Laplace transforms, fT (s) = 4PO + - ( ) So _-p-P _ _-_ + P0 e l 1 -p 4 + 1-p po1 (3.1) | TL /.,t e + L -.' p 1 (3.2) L+1 L +-1 Note that here instead of using -rn, the departure point probability of an empty queue, the asymptotic limit (as L-oao), i.e., the general time probability, is used. This makes the density above approximate but allows bounds on the approximation regions (see later) to be derived in a simple manner. Note that the interdeparture process is a renewal process with a nonmemoryless interdeparture time distribution. From this it may be seen that processes input to second stage queues are not renewal processes [Cin75] unless (to an approximation): 1 - p, pr 0, or p = 1 with 1. If processes input to queueing staL+1 tions are not renewal processes, the analysis is very complex and simulation may be a better approach for obtaining accurate results. The three cases po 1 - p, pO k 0, and A =, correspond to the following situations: Networks with Queues

10 Case I pO t 1 - p corresponds to a tightly loaded M/ M/ 1/ L queueing station with an output process that is close to a Poisson process with rate X. Note that o 1 - p is also an approximation to an M/M/ 1/oo queue, which has Po = 1 -p. Since. is the average time required for a packet transfer between stages and A is the average time between packet emissions (i.e., message emissions) from processors, u >> X in a typical multiprocessor system running in an independent processor, MIMD mode. Az models a high rate of operation while X models a low rate of operation (relative to 1). Thus the lightly loaded situation is the one expected to be applicable in most systems. Case II pos 0 corresponds to a saturated MI /M/I 1/ L queueing station, here the output process is close to a Poisson process with rate A, because the queue is empty with low probability. Case III p = 1 (A = A) corresponds to an MI MI 1/ L queue where the input and output rates are about the same, In this case, the second expression (3.2) may be used to see that for L large (to be described later) the second term dominates the first so the output process is approximately Poisson with rate'/.. Notice that in these three cases, first stage output processes are approximately Poisson so the second stage has approximate Poisson inputs. Obviously, there is some inaccuracy involved unless Po = 1 -p, Pa = 0, or p = 1 with L -,. The regions of validity for the approximations may be Networks with Queues

11 characterized in terms of allowable values for po, p, and L. First consider the region for which the lightly loaded M/MI 1/L approximation applies. Since an M/IMI 1/00 queue has a Poisson interdeparture process with rate X, a simple bound on the difference between Pom,1,,,,, and POa,{/t/L will suffice (only po affects f T(t) for both M/ M/ 1/ L and M/ M/ 1/ X queues when p ~ 1) for a bound on error. Define the maximum allowable difference between D =P ~llml//,z and PoNI1 u1L in relative error with respect to PO,/ I/e1/. Then D.O/ /tL -Apop /'m/ (4) And 1-p That is, p K [D.L+ will satisfy (4). Thus selecting D and L places a bound on the region of validity (with respect to D) for the light loading approximation. Next consider the region of the saturation approximation validity. Define maximum allowable value for Pom, = Po. i.e., it is the maximum non-zero value deemed allowable. Then, P Nmae 1-P +l Networks with Queues

12 So for all p > po the saturation model will hold with poas Pom,. Po is the solution to pomx(1 -po) + po = 1 This establishes a space of saturation approximation validity where L and Poma are the parameters. Finally consider the p k 1 case. Here it is desirable to require an approximate Poisson output process because a renewal process is needed at the next queue input. This approximation is satisfied as follows: Let C -- minimum coefficient of the /zeJ-t term in (3.2). Then C! L1 for the p ~ 1 approximation to be accurate with respect to the chosen C. Figure 3 shows the regions of analysis validity for Pomx = D =.05, C =.95... m As can be seen from the graph L' 30 = the analysis is relatively accurate for almost all p. 3.2. Network Analysis. Using the approximations from section 3.1, the network will now be analyzed to find PDT and NTP from section 2. Since queues are finite in length, packets will be rejected when they attempt to enter full queues. Due to this effect, packets may be lost at any stage of the network. That is, the network does not exhibit the blocking property (DiJ80]. When a packet is lost, it is routed back to its source to be resubmitted (alternatively, buffers could be placed between stages for lost packets, but this amounts to lengthening queues). Figure 4 shows the return path configuration. The submission/resubmission process resembles a Bernoulli process where the probability of event occurrence is: Networks with Queues

1.2 rCse III 1I"~. -~de I x O. 0.0 0.6 0 5 10 15 20 25 30 Figure 3. Approximation Validity Regions. p = Pr packet is not rejected at any queue [. This approximation is supported by discrete time analysis and simulation [MaM81 a]. The average number of trials before the first event occurrence in a Bernoulli process is: E[number of trials before event occurrence] = 1 p p Define the following: Pkt = Pr k packets in queue i at an arrival I random variable representing the time spent Tt = by a packet in queue i, service time is included in this measurement. Networks with Queues

14 Stage Stage z Input paths ( Output paths Rejected packet paths Figure 4. Resubmission Path. the average time from packet emission = to return when the packet is lost (rejected) somewhere in the network. Then, p = (1 -PLi) J.=l With the Bernoulli submission/resubmission approximation PDT = k E[TJ] + E[Trgjct] (5) And E[ Tr.st ] = (1 - PL )PL2E[T + (1 - pLI)(1 - PL2)PL(E[ T1] + E[ T2]) (6) + (1 -pL)(1 -PLZ) (1 -PL -I)PLZ(EE[TI] + " + E[ T-11]) = V VEE[T1] PL, H (i -PrL) z > 2 Here resubmission processing by sources is assumed to be instantaneous. See Networks with Queues

15 [MaM81b] for a similar approximation. The Bernoulli resubmission process approximation is expected to hold when: (1) p is small enough that PL's are small (this is case I) (2) and pLa's are the same for all packet submissions and resubmissions. Case I Analysis. For the light loading approximation Figure 5.a shows a series of z queues that represents a transmission path through the network. This equivalent series of queues relies on the equal rate, uniform destination distribution, and unique connection path assumptions. The analysis may be extended (by considering Poisson flow rates) to accommodate nonequal, general destination distributions but the analysis is then specific to the network topology, network rates and distributions considered. From Little's formula'[Nj] ~ E[T] =(1p= 1, 2,,z. From (1.1) ~L ip p i= I-1, 2,', z 1 - pL+l So from (2) E[r]= E[TI]= 1 _- (L + 1)p= 1, 2,, And, t= C(1 -pi) = [1 pL+1i) From (6) Networks with Queues

I I-( —a (6a) A jh (5b) Figure 5. Equivalent Queue Series'. E[ Trot ] - p (i -1)E[ T](1 - pL)It=2 pLE[ T] E k(1 - L)pk k=! [ g91 )#1Z- + (z-1)3z (CI.1): _EL_ (1 _ Z9-i + (z-1):.)E[T] z - 2 1 - Where, = 1-PL = 1 1 - pL+ P = " Therefore, Networks with Queues

17 PDT = [z+ 1 (1 -z' +-l (z-1)#f)]E[T] (CI.2) To find NTP, simply subtract the rate of network packet loss from nX, the packet input rate. NTP = nX - n XPL = |, —p L+ L PDT and NTP for n = 64, b = 4, verses p for several L are shown in Figure 6. Note that as L -, o, NTP - nX as would be expected because no packets are lost when L..is very large Note also that as L -b (or L gets large) the cost of the network becomes large for both delta networks and crossbars (i.e., the largest cost factor of the network becomes the queues) so it is actually less costly to use a crossbar because only n queues are required whereas for multistage networks nlogbn queues are needed. With a single stage, PDT is considerably better also. For a single stage, analysis is generally much more flexible [MaM8 lb]. Case II Analysis. From Figure 5.b, the equivalent approximate series of queues. E[NI] E[TI] X( -PLi) 1- (L + )pL +Lp+.... - P -')(' PL) L+i E[N]i = 2, 3,, 2 Therefore, Networks with Queues

18 L = S 0. 14. 8. 4 //L S 8 32,~~~~~~0. 4L 16 108~~~~~~~~~.~~40. 20. 4. 10. 2. 0. _ 0.2 0. I I 0.0 0.2 0. 4 0.6 0. 8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p - p - Figure 6. PDT and NTP for n - 64, b = 4. E[ rjsot = E[T1](1 - pLI)(1 -( 1 - PL2)z-1) 1 -- (Z- -1)(1 - PLZ)'2 + (z-2)(1 PL2)Z_ + E[T2](1 -PLID(1 -PL PL2) PL2 1 -PL,='1.- pL+ 1 _pHL+ p- l —+! L + 1 Finally, from (5) PDT = E[T,] + (z -1)E[T2] + 1P E[Tr,+jc] Again, find NTP from Networks with Queues

19 NWT = tX n- XPLI - n l-PLi i=2 = nX n _. Provided N (incorrect use of the approximation can result in NTP 1 in Provided NTP7 >_ 0 (incorrect use of the approximation can result in NTP < 0, in which case NTP t 0 in actuality). If a detailed analysis were available, this requirement would be satisfied because processes between stages would not be approximate. Case III Analysis. Here the equivalent series of queues is again shown by Figure 6.b with X=Lt. L+1 E'[:J -A'[ T,] i = I1., 2,,,z..?F E[ 7jct ] may be found using (CI. 1) with L p = # And PDT is given by (CI.2). Finally, NTP=nX1 - 1 I ~IP 1-L + 1 Provided 0a < 1 - +. Networks with Queues Netwolrks with Queues

20 4. Conclusion. An analysis of a class of packet switched, multistage interconnection networks that exhibit the bit controlled property was presented. The analysis (with approximations) allows network communication delay and network throughput to be evaluated for certain combinations of queue lengths, interstage transfer rates, and processor packet emissions rates. From the analysis, queue lengths that satisfy performance requirements may be chosen. Networks with Queues

21 5. References. [BaS76] F. Baskett, and A. J. Smith, "Interference in Multiprocessor Computer Systems with Interleaved Memory," CACM, Vol. 19, No. 6, June 1976, pp. 327334. [Ben65] V. E. Benes, Mathematical Theory of Connecting Networks and Telephone Traffic, Academic Press, New York, 1965. [Bha75] D. P. Bhandarkar, "Analysis of Memory Interference in Multiprocessors," IEEE TC, Vol. C-24, No. 9, Sept. 1975, pp. 897-908. [Cin75] E. Cinlar, Introduction to Stochastic Processes, Prentice-Hall Inc., Englewood Cliffs, N.J., 1975. [Clo53] C. Clos, "A Study of Non-blocking Switching Networks," The Bell System Technical Journal, Vol. 32, March 1953, pp. 406-424. [Coo72] R. B13. Cooper, Introduction to Queueing Theory, The Macmillan Company, NewYork, 1972. [DiJ80] -D. M. Dias, and J. R. Jump, "Analysis and Simulation of Buffered Delta Networks," Proc. Workshop on Interconnection Networks, Purdue University, April 21-22, 1980. [GrH74] D. Gross, and C. M. Harris, Fundamentals of Queueing Theory, John Wiley and Sons Inc., New York, 1974. [Hoo77] C. H. Hoogendorn, "A General Model for Memory Interference in Multiprocessors," IEEE TC, Vol. C-26, No. 10, Oct. 1977, pp. 998-1005. [GoL73] G. R. Goke, and G. J. Lipovski, "Banyan Networks for Partitioning Multiprocessor Systems," Proc. First Annual Symp. on Computer Architecture, IEEE, Dec. 1973, pp. 21-28. [Kle75] L. Kleinrock, Queueing Systems Volume I: Theory, John Wiley & Sons Inc., New York, 1975. [Law75] D. H. Lawrie, "Access and Alignment of Data in an Array Processor," IEEE Networks with Queues

22 TC, Vol. C-24, No. 18, Dec. 1975, pp. 1145-1155. [MaGB1] M. A. Marsan, and M. Gerla, Markov Models for Multiple Bus Multiprocessor Systems, Report No. CSD 810304, Computer Science Department, UCLA, Feb. 1981. [MaM8la] B. A. Makrucki, and T. N. Mudge, Probabilistic Analysis of a Crossbar Sirtch, SEL Report No. 150, Department of Electrical and Computer Engineering, University of Michigan, March 1981. [MaM8 b] B. A. Makrucki, and T. N. Mudge, A Multiple M/ D9s/ 1/ L Quezueing Model of Orossbar-based Multiprocessors, SEL Report No. 157, Department of Electrical and Computer Engineering, University of Michigan, September 1981. [Pat79] J. H. Patel, "Processor-Memory Interconnections for Multiprocessors," Proc. 6th Annuzal Symp. on Computer Architecture, IEEE, April 1979, pp. 166-177. [Rau79] B. R. Rau, "Interleaved Memory Bandwidth in a Model of a Multiprocessor Computer System," IEEE TC, Vol. C-28, No. 9, Sept. 1979, pp. 678-681. [SeD79] A. S. Sethi, and N. Deo, "Interference in Multiprocessor Systems with Localized Memory Access Probabilities," IEEE TC, Vol. C-28, No. 2, Feb. 1979, pp. 157-163. [Sie7??] H. J. Siegel, "Analysis Techniques for SIMD Machine Interconnection Networks and the Effects of Processor Address Masks," IEEE TC, Vol. C-26, No. 2, Feb. 1977, pp. 153-161. [Sie0O] H. J. Siegel, (Ed.), Proc. Workshop on Interconnection Networks, Purdue University, April 21-22, 1980. [SkA69] C. E. Skinner, and J. R. Asher, "Effects of storage contention on system performance," IBM Systems Journal, No. 4,' 1969, pp. 319-333. [Str70] W. D. Strecker, Analysis of the Instruction Execution Rate in Certain Computer Structures, Ph.D. dissertation, Carnegie-Mellon University, Pittsburgh, 1970. Networks with Queues

SEL-TR-167 Note on a Queueing Model of Delta Networks B. A Makrucki T. N. Mudge July 1982 This work was supported in part by National Science Foundation Grant MCS-8009315 DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING SYSTEMS ENGINEERING LABORATORY THE UNIVERSITY OF MICHIGAN, ANN ARBOR 48109

This is intended as an addendum to A Queueing Model of Delta Networks (SEL Report No. 159). where it was assumed that packets rejected from multistage interconnection networks (ICN's) are sent back to their source for resubmission. This approach "penalizes" packets lost in the latter stages of the network more severly and may be misleading in overestimating actual PDT (packet delay time) for ICN's which do exhibit blocking. This addendum describes a modification which may be more realistic in predicting PDT when blocking ocCUrs. As an approximation to the rejection/blocking phenomenon assume that a packet lost in state i (2 ri < z) is sent back to the input of its stage i - 1 queue. This should be a feasible approximation because a packet leaving stage i -1 leaves an open position in its queue, and if it is rejected at the input of stage i, it may return to stage i - 1 where its position will still be available with high probability. If the rejection is practically instantaneous, then by the approximation of Poisson processes, another packet will arrive at the same time with probability 0 (or close to it). Hence the assumption of a single stage rejection delay seems reasonable. Revising the Bernoulli resubmission approximation of equation (5) of SEL 159: [ PL"T PDT A Z E[T,] + E " _ PL 1E[ ]. 1 Pit is the average number of times that a packet must try to enter a queue 1 - Pu in stage i before it is accepted. Each rejection takes E[Tt1] time units as its mean time for retry. Or, Note on SEL 159

2 PDT k E[ T,] + [1 + J p E[ Ti-1] E[T] +: E[T-_1] t=2 1 — PL Which leads to: Case I [Tt = ET] i =... z PL=pt - be L Pt i - 1 z So PDT = E[T] + P1 - (z 1)[ T]:i- pL+r - )[ r] = [1 + (z -1)[iJ.A~'JJE[T] Case II PDT = E[Tr] + + [. [T] 1[ T11 _ -2 1IE[T 1 - PPLsZ + 1 z JET Case III is handled in a similar manner as discussed in SEL 159. This technique for approximating network blocking seems to be a more reasonable technique than previously presented. Note on SEL 159