034796-3-T Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design Linda P.B. Katehi Donghoon Chen Eray Yasan Jong-Gwan Yook April 1998 34796-3-T = RL-2479

1998 CEN4 Review & Testing at WPAFB, OH: Apri 14, 1998 Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design Graduate Students: Donghoon Chen and Eray Yasan Research Fellow: Jong-Gwan Yook Principal Investigator: Linda Katehi Radiation Laboratory Department of Electrical Engineering and Computer Science The University of Michigan, Ann Arbor, MI 48109-2122, U. S. A. Project Leader:: Leo DiDomenico and Dr. Barry Perlman Applied Communications-Electronics Laboratory US Army Communications-Electronics Command (CECOM) Command Control & Systems Integration Directorate AMSEL-RD-C2, Ft. Monmouth, N.J. 07703

Contents 1 Introduction 2 2 Project Goal 2 3 Integral Equation with Multiresolution Analysis [1, 2] 3 4 Parallelized FEM for MMIC Simulation [1, 2, 5] 4 5 Active Circuits in FEM [6, 7] 6 A User Manual for FEM part 7 A. 1 Compilation and Execution............................ 7 A.2 User Manual for Phase I - Mesh Generation................... 9 A.3 Sample Output Data Files for Phase I....................... 17 A.4 User Manual for Phase II - FEM Main Program................. 20 B Benchmarking 23 C User Manual for Hybrid (FEM + MoM) Method 28 C.1 Procedure..................................... 28 C.2 File flow chart................................... 29 C.3 File description.................................. 30 C.4 Compilation and Execution............................ 31 C.5 User Manual for FEM Mesh........................... 34 C.6 User Manual - MoM............................... 41 D Application of Parallel Computing in MoM 43 D. 1 Introduction.................................... 43 D.2 Parallelization Model of MoM Matrix Fill Process................ 44 D.3 Parallelization Model of FWT.......................... 48 D.4 Conclusion.................................... 49 E The Implementation of Parallel Programming in C++ 51 E. 1 Dynamic job allocation.............................. 51 E.2 Static Job Allocation............................... 53 F Publications 59 1

1 Introduction This report summarizes the effort on the development of accurate and computationally efficient codes for large scale electromagnetic problems with emphasis on three-dimensional monolithic circuits for high-frequency applications. This development will be successfully accomplished by two approaches: * increasingly sparse matrix techniques also known as near matrix diagonalization (NMD) techniques * code parallelization using Massage Passing Interface (MPI). In the following, description of CHSSI scalable software project goals and progress summaries for various efforts are provided. 2 Project Goal High-frequency radar systems for communication, detection and surveillance incorporate MMIC modules which are characterized by high density and substantial geometrical complexity. In most cases these modules are packaged for electrical and/or environmental protection and the resulting 3D structures exhibit a considerable complexity that impedes design and influences electrical performance due to unwanted parasitics. The design of these complex three-dimensional monolithic circuit (MMIC modules) with hundreds of vias and lines interconnecting a large number of active and passive circuit components to a rather large number of radiating elements, has been a problem of critical importance to DoD. The design and characterization of these circuits requires numerical approaches which can fully characterize the excited electromagnetic fields. Among the various techniques, the method of moments has demonstrated superiority in solving radiation problems but it has been hindered by long computational times and is limited, for this reason, to small circuits not exceeding more than a few components and interconnecting lines. In this study, we have concentrated on the development of codes which have the capability to accelerate numerical computations in electromagnetic problems with large computational domains and/or computationally intensive tasks. Among the existing full-wave methods which have been developed for the solution of EM problems are the Integral Equation Technique (IE) and Finite Element Method (FEM). These methods are numerical techniques which solve Maxwell's Equations in the frequency domain using large computational volumes and intensive numerical calculations which have to be performed repeatedly at all frequency points of interest. In addition to these problems, the numerical solution of Maxwell's Equations results in full (IE case) and huge sparse (FEM case) matrices which further limit applicability of the technique. With 2

regards to substantially increasing the computational efficiency of these techniques, we concentrate on improving IE method through near-matrix diagonalization techniques(multiresolution analysis) as well as code parallelization and FEM through task parallelization. The project leverages on existing synergism between DOD and the University of Michigan efforts on large-scale simulations of advanced software systems. The team will concentrate on the development, validation, documentation, visualization, and demonstration of scaleable algorithms for modeling critical DOD problems in high frequency electromagnetic circuits. Technology transfer to the DOD user community is an integral component of the project. In addition, cross-platform portability and software reusability will be emphasized using the MessagePassing Interface (MPI) standard. In our effort, the developement of scalable softwaresis based on the IE and FEM techniques to sustain the high accuracy capability while, at the same time, solve complex planar circuits including their packages in very short times, allowing for real time simulations. Such softwares can eventually lead to real time design and optimization. 3 Integral Equation with Multiresolution Analysis [1, 2] From our previous study, it has been indicated that the use of Multiresolution expansions in integral based formulations leads to the generation of highly sparse linear systems whose numerical solution is fast and efficient with minimal memory usage. This finding allows us to believe that application of wavelet techniques to the method of moments has made the modeling of large scale electromagnetic problems possible. The capability of the multiresolution analysis (MRA) to compete with the fastest up-to-date techniques, such as the Fast Multipole Method (FMM) and Multilevel Fast Multipole Method (MFMM) has been demonstrated through very thorough computer experiments. We expect this trend to sustain as the size of matrix increases to hundreds of thousands or millions. The roof top multiresolution basis is used for the expansion of planar currents, in a way that the relevant boundary and edge conditions are satisfied to an appropriate degree of accuracy. Through this project it is shown that wavelet-dominated expansion bases generate highly sparse moment matrices and the resulting sparse linear systems can be solved numerically using very efficient computational tools. The fast wavelet transform is exploited for the numerical integration of the Green's function to speed up the matrix fill process. This work is emphasizing results for microstrip patch antenna and finite size arrays of such antennas. In addition, special emphasis is being placed on the solution of large-scale problems, which could not otherwise be handled by conventional moment method implementations. These efforts should yield a unique design capability for maintaining technology leadership. The MRA allows the analysis software to perform a thresholding process on the entire ma 3

trix. An additional benefit in the development of this software will be the available theoretical foundation for the thresholding required for matrix sparsification. The result of this code development will be the use of scaleable parallel high performance computing systems to solve electrically large electromagnetics problems of critical importance to DOD. While in these problems we have been able to provide matrix sparsities of the order of 99.9% (a perfectly diagonal matrix has a sparsity of (1 - 1/N)% which is equal to 99.99% for N = 1000), numerical computations have become more complex and lengthy. A large portion of these computations are in the form of multiple nested loops where similar types of intensive numerical computations are repeated. Furthermore, a large number of branched or cascaded loops are performed in a serial fashion. Parallelization of these code segments improves the speed and efficiency to a large extend. Another approach to improve the computation time is through the fast wavelet algorithm. This algorithm performs all the computations at the highest resolution level only once and then reconstructs the rest of the computations in the remain resolution levels very efficiently. In addition to code parallelization, domain decomposition techniques and task parallelization have a great potential with this technique and are applied in addition to near-matrix diagonlization. Demonstration of scaleable simulations will address the critical DOD applications area of electrically large antennas integrated to multi-layer RF front end electronics which will itself be electrically large. Such systems need to be analyzed at the phenomenological level at one time in order to understand the details of device performance in the large subsystems environment. 4 Parallelized FEM for MMIC Simulation [1, 2, 5] The numerical analysis of MMIC using parallel computer becomes powerful especially in FEMbased codes. The frequency parallelization of FEM achieves linearly scaleable performance with the number of processors being used. The frequency parallelization is completed using MPI standards for distributed memory machines such as IBM SP2. The application of frequency parallelized FEM code to microstrip feeding network for patch array antenna is accomplished and its field plot is available by a recently developed post field plotting Matlab file. Parallelization schemes will be further applied to the patch antenna structures including their microstrip feeding network using a hybrid method(MoM/FEM). This work will involve the parallelization of matrix generation for this hybrid method, the parallelization of far field calculation for each array antenna and the development of post-processor files for radiation patterns. The parallelized FEM scheme is also appropriate to the characterization of a microwave/millimeterwave package. As efficient packaging technologies are emphasized for high frequency and high performance microwave/millimeter-wave circuits, it is very important to provide design rules to avoid unwanted electromagnetic effects of the package on the circuit performance. In this study, the electromagnetic effects of a via fense on a microstrip/stripline and the suppression of higher order 4

modes in the cavity are investigated. In this study, three different types of via fences are considered: continuous metal filled via fence on both sides of the circuits, short section of a via fence on both sides of the circuits, and via fence cross to the circuits. In each case, the fence is moved between 1 h to 4h from the circuit where h is the height between the ground plane and the center conductor of the circuit. As the scattering parameters and radiation loss of the circuits do not vary by more than a few percent over the simulation frequency (10 GHz - 40 GHz), average values over this frequency range are calculated for each case. Moreover, field distributions are also generated. The result shows that when the via fence is closer to the circuits the EM effects are stronger resulting in increased reflection coefficient, insertion and radiation loss. Furthermore, the results show that the closely spaced via fence confines more electric field than the wider spaced via fence giving less radiation loss. It is noted that the stripline shows better EM performance than microstrip over the via fence in the package. The suppression of higher order packaging modes is studied for the cavity with two different types of striplines: (a) cavity with two isolated striplines and (b) cavity with one bent stripline. In the former case, via fence is placed between two isolated striplines and via gaps are changed from 57.228h to 13.932h. In order to visualize the suppression of higher order modes in a quantitative way, only one stripline is excited and the ratio, maximum magnitude of total electric field on the unexcited stripline over the one on the excited stripline, is calculated. Results show that the via gap should be less than 0.394 Ac fora of 1 h x 1 h via to suppress unwanted cavity mode under -50 dB, where Ac is the cutoff wavelength of the dominant mode in the cavity. The field distribution over the cavity is shown in the result. The cavity with bent stripline is simulated for different via gaps, from 56.76h to 2.6h. The scattering paramenters of the stripline are calculated for each case and field distributions over the cavity are presented. It is seen that the via gap should not greater than 0.248 Ac for a 1 h x Ih via to suppress cavity mode in the package. All simulations have been carried out by a parallelized 3D-FEM. The frequency parallelization scheme is implemented in this study, due to the independent nature of the problem in the frequency domain. The parallel computer used for the simulation is the distributed memory IBM SP2 at the University of Michigan. The performance of the parallelized 3-D FEM program is near linearly improved with respect to the number of processes that are used. This study shows the electromagnetic effects of a via fence on microstrip/stripline and the suppression of higher order modes in the package. The result can be used as guide lines for designing high performance and high frequency packages. 5

5 Active Circuits in FEM [6, 7] Based on our previous research, we have launched an effort to analyze active as well as passive circuits, such as a mixer circuit (with 4 diodes), by use of FEM. This is an extension of our previous effort in the sense that we have added capability of handling nonlinear elements to our codes. Using the FEM, the Z parameters (or Y parameters) of the passive part of the circuit are computed and then these values are fed into a software such as Libra in the form of a linear data element with inner and outer ports. The linear or nonlinear elements are connected to inner ports and the whole circuit is analyzed by harmonic balance in order to find in/output characteristics. The goal is to verify the experimental results of mixer circuits under consideration and make some further improvements. In the next step, the harmonic balance code will be integrated into the FEM method and will be parallelized. The direct inclusion of linear passive elements (resistors, capacitiors etc) into FEM code has already been accomplished. 6

A User Manual for FEM part A.1 Compilation and Execution There are 2 important files, one for meshing and the other for FEM main program as illustrated in Fig. 1, and these files are written in high level language Fortran. The first file, fem-mesh. f, is a serial code, since it only takes a small fraction of overall solution process. However, the second Fortran code, f emnmain. f, has been parallelized using Message Passing Interface (MPI) to enhance portability of the code on different parallel computer architectures and vendors. Here are some examples of how to compile these 2 codes on parallel computers, for example, IBM SP2 and SGI Powerchallenge Array (PCA): Compile meshing program conventional workstation f77 -0 -o fem-mesh.x fem-mesh.f |xCompile main program on IBM SP2 mpxlf -us -03 -Impi -o fem-minx femain..f on SGI PCA f77 -64 -02 -o fem-main.x fem-main.f -L/usr/lib64 -Impi Executing these compiled codes is fairly straightforward for they meshing code, but not for the FEM main part. There are different methods for different parallel platforms. For IBM SP2, one can use LoadLeveler to submit jobs in a batch mode or can utilize interactive nodes for small size short running codes and for debugging purposes. Here are some examples on executing the FEM main code on IBM and SGI machines: Executing meshing program (phase I) conventional workstation fem-mesh.x femmesh.f < input-mesh-file Executing main program (phase II) with 8 CPUs on IBM SP2 fem-main.x -rmpool 0 -procs 8 < input-main-file on SGI PCA mpirun -np 8 femmain.x < input-main-file Note that these scripts are specific to the environment and operating systems. For example, Maui High Performance Computing Center (MHPCC) and Center for Parallel Computing (CPC) at the University of Michigan have similar IBM SP2 machines, but they have different batch systems and scheduling arrangements. Therefore, users need to be aware of what kind of system and environment they have to work with. 7

Start PHASE I: Mesh Generation I~ Mesh Data Files........ --- —------- ------ -—... ---------- ----- - ------------------- —. --- - -------—..- ------ Input File I Input File II PHASE II: FEM Main Routine lI }.................................... EM Field Data Files PHASE III: Visualization Finish Figure 1: Flow chart of the finite element method (FEM) solution process. 8

A.2 User Manual for Phase I - Mesh Generation In this section, a brief description of the input file for Phase I, (mesh generation) is given. As an example, we have chosen to use a simple patch antenna on duroid substrate operating at 4.0 GHz. The schemetic of the antenna is shown in Fig. 2. To model this geometry, first, one needs to define an outer terminating boundary as shown in Fig. 3. This terminating boundary is crucial for mesh termination and control of the overall problem size. As shown in the figure, the distance of the boundary from the structure to be modeled, Dside and Dtop, need to be at least 4 to 5 times greater than the substrate thickness or 0.2 x A, for correct result. The outer boundaries are assigned to be perfect electric or magnetic (PEC or PMC) surface defending on the application.... duroid substrate Figure 2: Single element patch antenna on duroid substrate: L = 11.2 mm, W = 8.0 mm, h = 2.25mm, eduroid = 10.8 In addition, for simulation of open boundary problems, such as antennas, artificial absorbing materials need to be placed inside of the outer terminating boundaries as shown in Fig. 4. The material constants (Er and pt) are chosen to minimize the reflections at the interface and to maximize electromagnetic field attenuation in the material. Please refer to [3] for mnore details. After the outer boundaries are defined, one needs to specify the location of the conducting surfaces, such as the antenna and its feeding network. As shown in Fig. 5, the antenna and feeding network are divided into number of rectangular sections that can be defined by any two diagonal points. 9

Figure 3: Illustration of total solution space and outer boundary. Outer Boundary Definitions 0. 0, 24. 0 Xmin, Xmax -10.0, 54.0 Ymin Ymax 0.0, 20.0 Zmin Zmax Note 1. One has to define all of the above parameters regardless of whether the solution space is an open or close boundary. Note 2. For closed boundary problems, such as cavity problem, one can use all the outside surface as PEC. Note 3. For open boundary problems, such as antenna problem, one needs to design and place artificial absorbing material inside of the outermost boundary. Note 4. When there are symmetries in the problem, one can take advantage of the fact to reduce the overall problem size. 10

AA for sbustrate Figure 4: Artificial absorbing layers for simulation of open boundary problem. Figure 5: Example of surface boundary condition application with appropriate rectangulation. 11

I 2 5 1.00000, 1.00000, 0.80000, 0.50000, 0.55000, 6 1.00000, 1.00000, 0.76625, 0.84500, 0.94000, 1.00000, 3 0.85000, 1.24500, 1.00000, 4 16 3 1 2 4 26 8 6 20 4 3 10 5 Discretization discretization method: [1] uniform discretization [2] non-uniform discretization number of divisions in x direction Ax1, Nx Ax2, Nx2 Ax3, N3 AX/4, N4 Ax5, N5 number of divisions in y direction Ay1, N1 Ay2, N2 Ay3, Ny3 Ay4, N4 Ays, Ny5 Ay6, Ny6 number of divisions in z direction AZ1, Nz1 Az2, N,2 Az3, Nz3 _____j Note. Ax, y, zi: ith subsection size in x, y, z-direction. Nxy,: number of ith subsection. 12

9 10.8, 0.0 1.0, 0.0 10.8, 15.0 1.0, 1.3889 10.8, 15.0 1.0, 1.3889 10.8, 15.0 1.0, 1.3889 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 0.0 1.0, 0.0 Material Property: Real and Imaginary Parts of Material number of material Re(erl), Im(erl): duroid substrate Re(pri), Im(Ltri) Re(cr2), Im(c,2): AA for duroid substrate Re(pur2), Im(pur2) Re(c,3), Im(r3): AA for duroid substrate Re(,r3), Im(Yr3) Re(c,4), Im(Cr4): AA for duroid substrate Re(r4), Im(r4) Re((cr), Im(c,5): AA for free space Re({rs), Im(/r5) Re(8r6), Im(Cr6): AA for free space Re(1,r6), Im(jtr6) Re(6r7), Im(er7): AA for free space Re(fr7), Im(Pr7) Re(ers), Im(Cer): AA for free space Re(8rs), Im(r8s) Re(er9), Im(er9): AA for free space Re(r9\) Im(yr9) Note 1. For each material, real and imaginary parts of the relative permitivity (cr) and permeability (jar) need to be defined as above. Note 2. Artificial absorbers (AA) for free space and any arbitrary material have to be chosen by user Theformulars can be found in [3]. -i 13

Material Allocations: (Xmin, Ymin, Zmirn)i and (Xmax, Ymax, Zmax)i 2 points are diagonal of a cube containing given materal 4. 0, -6.0, 0.0 24.0, 50.0, 2.55 0.0, -10.0, 0.0 4.0, 54.0, 2.55 4.0, -10.0, 0.0 24.0, -6.0, 2.55 4.0, 50.0, 0.0 24.0, 54.0, 2.55 0.0, -10.0, 2.55 4.0, 54.0, 15.0 4.0, -10.0, 2.55 24.0, -6.0, 15.0 4.0, 50.0, 2.55 24.0, 54.0, 15.0 0.0, -10.0, 15.0 24.0, 54.0, 20.0 4.0, -6.0, 2.55 24.0, 50.0, 15.0 (Xmin Ymin, Zmin )l (Xmax, Ymax, Zmax)l (Xmin, Ymin, Zmimn)2 (Xmax, Ymax, Zmax)2 (Xmin, Ymini Zmin)3 (Xmax, Ymax, Zmax)3 (Xmin, Ymin, Zmin)4 (Xmax, Ymax, Zmax)4 (Xmin, Ymin, Zmin)5 (Xmax, Ymaxj Zmax)5 (Xmin Ymin Zmin)6 (Xmax, Ymaxr Zmax)6 (Xmin, Ymin, Zmin)7 (Xmax, Ymax) Zmax)7 (Xmin, Ymin, Zmin)8 (Xmax, Ymax, Zmax)8 (Xmin, Ymin, Zmin)9 (Xmax, Ymaxi Zmax)9 Note. Any 2 diagonal points can be used to define a cube whose material property is homogeneous. 14

Surface Boundary Conditions: (Xmin, Ymi,, Zmin)i and (Xmax, Ymax, Zmax)i 2 points define the diagonal of a rectangle for ith surface 20.0, 22.4, 1 Y 22.4, 24.0, 1 y 22.7, 24.0, 1 Y 0.0, 24.0, 1 y 0.0, 24.0, 20.0, 2.55 31.2, 2.55 26.13, 2.55 31.2, 2.55 -10.0, 2.55 26.13, 2.55 (Xmin, Ymin, Zmin)i: patch 1 (Xmax, Ymax, Zmax)l "1" for PEC, "0" for PMC "y" for more surface, "n" for no more surface to define (Xmin, Ymin Zmin)2: patch2 (Xmax, Ymax, Zmax)2 PEC more surface to define (Xmin, Ymin, Zmin)3 ' input feeding line (Xmax, Ymax, Zmax)3 PEC more surface to define -10. 0, 54.0, -10.0, 54.0, 0.0 0.0 20.( 20.( 1 y 0.0, -10.0, 0.0 0.0, 54.0, 20.0 1 y Y 0.0, -10.0, 0.0 24.0, -10.0, 20 1 y (xmin, Ymin Zmin)4: bottom surface (Xmax, Ymax, Zmax)4 PEC more surface to define 3 (Xmin, Ymin, Zmin)5: top surface ) (Xmax, Ymax, Zmax)5 PEC more surface to define (Xmin, Ymin, Zmin)6: left surface (Xmax, Ymax, Zmax)6 PEC more surface to define (Xmin, Ymin, Zmin)7: front surface 0 (Xmax, Ymax, Zmax)7 PEC more surface to define 15

I - -- I 24.0, 24.0, 0 y 0.0, 24.0, 1 Y 23.45 23.45 1 n -10.0, O. 54.0, 20. 54.0, 0.0 54.0, 20.' -5.0, 1., -5.0, 2. Boundary Conditions (Surface): continued 0 (Xmin, Ymin, Zmin)8: right surface 0 (Xmax, Ymax, Zmax)8 PMC more surface to define (Xmin Ymin Zmin)9: back surface 0 (Xmax, Ymax, Zmax)9 PEC more surface to define 7 (Xmin, Ymin, Zmin)l0: excitation place 55 (Xmax, Ymax, Zmax)10 PEC NO more surface to define l Note 1. PEC - Perfect electric conductor PMC - Perfect magnetic conductor Note 2. These boundary conditions can be utilized to define symmetric boundary conditions depending on the electromagnetic field distribution. 1 23.45, -5 23.45, -5 Excitation Points: define 2 end points (Xmin, Ymin, Zmin)i and (Xmax, Ymax, Zmax)i number of excitation points. 0, 1.7 (Xmin, Ymin) Zmin)l. 0, 2.55 (Xmax, Ymax, Zmax)i -I Note. One has to define a PEC line for excitation edges. You may need to define more than 1 line as it is in the case of coplanar waveguide (CPW). I I 0 Impedance Load Definition "0" for no impedance loads "1" for impedance loads Note. One can use this option to define lumped circuit element, such as R, L, and C. -I 16

A.3 Sample Output Data Files for Phase I In this section, the first few lines of each of the output data files are shown for better understanding. In general, the user does not need to modify or look into these mesh output data files. These files will be read from the FEM main program as described in phase II. In terms of storage space requirement, these output data files range from several Mega-byte to 100 Mega-byte depending on overall mesh size. File Name: DAT-EDGY Description: edge label and x, y, and z components of the edge vectors. 1.O0000000000000000E+00 1.000000000000000 0.8500000000000000 2 1.000000000000000 1.000000000000000 0.OOOOOOOOOOOOOOOOE+00 File Name: DAT-EGLB Description: material constants, number of nodes and tetrahedrons, and node labels for non-overlapping edges. 1 10.80000 0.OOOOOOOOOOOOOOOOE+00 1.000000 0.OOOOOOOOOOOOOOOOE+00 2 10.80000 15.00000000000000 1.000000 1.388900000000000 8 1.000000 5.000000000000000 1.000000 5.000000000000000 9 1.000000 0.OOOOOOOOOOOOOOOOE+00 1.000000 0.OOOOOOOOOOOOOOOOE+00 35397 159120 1 1 1 1891 2 1 2 1 29 2 1 3 1 28 2 17

File Name: DAT_ELMT Description: element labels, 4 nodes for each tetrahedron, and material index. 1 1 1891 29 28 2 2 29 1891 1865 1892 2 3 1 2 29 1865 2 4 1 1891 1865 1864 2 5 1 1891 1865 29 2 6 29 1865 1893 1892 2 7 1865 3 1893 1866 2 8 29 1865 3 2 2 9 29 3 1893 30 2 10 29 3 1893 1865 2 File Name: DAT_ESUR Description: surface edge definitions. 31996 6420 6421 1 32151 6420 6447 1 32004 6421 6422 1 32156 6421 6447 1 32158 6421 6448 1 32160 6421 6449 1 32328 6450 6477 1 File Name: DATEXCI Description: excitation position definition. 28045 File Name: DATGRID Description: surface edge definitions for later visualizations. 1 20.000000 20.000000 2.5500000 20.800000 20.000000 2.5500000 1 20.000000 20.000000 2.5500000 20.000000 20.766250 2.5500000 18

File Name: DATGRUP Description: node labels on each surface definition. PEC NODE ON SURFACE 1 1 1 2 3 4 5 60 6420 6421 6422 6423 6447 File Name: DATNODE Description: node coordinate table. 1 0.00000000E+00 -10.000000 0.OOOOOOOOE+00 2 1.0000000 -10.000000 0.OOOOOOOOE+00 3 2.0000000 -10.000000 0.OOOOOOOOE+00 4 3.0000000 -10.000000 0.OOOOOOOOE+00 File Name: DATNUMB Description: total number surface, edges and edges on surfaces. 9 201436 21181 19

A.4 User Manual for Phase II - FEM Main Program In this section, a brief description of the input file for Pahse II, FEM main program, is given. This input file contains parameters for geometry selection and some other miscellaneous data as will be described below. Also, operating frequency range and output data file names for each frequency point are required. For electromagnetic field sampling in the solution domain, the user has to provide appropriate cross-section and coordinates for two diagonal points of the circuit cross-section. Finally, the user can select appropriate iterative solver convergence criteria to specify the residual norm. In most case, iO-' to 10i- will be enough. Problem Definition 1 What problem will be solved? [1] PEC/PMC/Artificial absorber [2] 1 st order absorbing boundary condition [3] 2nd order absorbing boundary condition 1 What excitation mechanism (only for 2-port network)? [1] Even [2] Odd Note: For 1-port network, you can choose either I or 2. Miscellaneous Parameters: | ~These parameters provide optional information for discretization 0. 0 5 Global mesh size 0. 5 Length of the structure 7. 5 Approximate effective dielectric constant Note: These parameters do not affect the accuracy and performance of the (parallel) FEM code 20

Operating Frequency Range and Output File Names 3.80 Start frequency (fstart) 4.20 Stop frequency (ftop) 0.02 Frequency step (Af) ant-array-3. 80ghz.m Output data file names ant-array-3.82ghz.m ant-array-3.84ghz.m ant-array-3.86ghz.m ant-array-3.88ghz.m ant-array-3.90ghz.m ant-array-3.92ghz.m ant-array-3.94ghz.m ant-array-3.96ghz.m ant-array-3.98ghz.m ant-array-4.00ghz.m ant-array-4.02ghz.m ant-array-4.04ghz.m ant-array-4.06ghz.m ant-array-4.08ghz.m ant-array-4.10ghz.m ant-array-4.12ghz.m ant-array-4.14ghz.m ant-array-4.16ghz.m ant-array-4.18ghz.m ant-array-4.20ghz.m Note 1. The number offrequency points are determined by (fstop - fsttt) /Af. Note 2. For task parallelization scheme, the best linear scalability can be achieved when the number of CPUs is equal to the number offrequency points. 21

I - 2 0.010, 23.99 -9.98, 55.99 2.56 Electromagnetic Field Sampling Which cross section do you want to cut? [1] X-Z plane(constant y plane) [2] X-Y plane(constant z plane) [3] Y-Z plane(constant x plane) Xmin, Xmax Ymini Ymax Constant z Note. If you choose [1], you need to provide (xmin, xmax) first, (Zmin, Zmax) second, and constant y. If you choose [3], you need to provide (ymin, Ymax) first, (Zminr Zmax) second, and constant x. -J Resistive Material 2 Resistive material? [1] Yes, [2] No Iterative Solver (BiCG or QMR) Convergence Criterion 1.0-4 Tolerence (residue) < 10-4 22

B Benchmarking In this section, some benchmarking examples of our parallel electromagnetic solvers are illustrated. As shown in Fig. 6, a 2-element patch array antenna has been simulated using different number of CPUs: 1 to 25. The direct output of the simulator provides electromagnetic field data as illustrated in Fig. 7. From thsese data, one can easily extract important circuits parameters, such as S-parameters, can be extracted as shown in Fig. 8. Figure 6: 2 element patch array antenna with corporate feeding network. Figure 7: Electromagnetic field distributions for 2 element patch array antenna with corporate feeding network. 23

U I /: \ Iv I - 2........................................................................ - 4........................................................................................ -* / / ': - 6 /.......... -. - -6............;.............:......................:........................;.......... -o 0 -10 Is -1 2............ 21 - 1 4......................................... -1 6............ - 1 8........................................... -20 22 24 26 28 30 32 34 36 Frequency [GHz] Figure 8: Scattering parameters for 2 element patch array antenna with corporate feeding network. In this particular example, we have computed the electromagnetic field distribution at 25 different frequency points using our task parallelized solver and we also presented speedup curves for various number of CPUs(Fig. 9). As shown in the figure, we have achieved linear scalability on distributed memory machine using MPI (Message passing interace) standard. It needs to be pointed out that with our task parallelization strategy the best performance can be achieved when there are p = M/2n processors, where is the total number of frequency points to be computed and n = {0, 1, 2, 3,...}. When M is not a multiple of 2, p = int(M/2n) + mod(M, M/2Tn). The scalability of a parallelized code can be studied by examining the load balancing between different processors. As shown in 10, the best performance is achieved when we have good load balancing between those CPUs. For example, when one needs to obtain 40 frequency data, the best scalability can be achieved when there are 40 CPUs available. However, if there are only 30 CPUs available, the use of 30 or 20 CPUs will not make a big difference. As a last example, Fig. 11 shows the coupling between two radiating antenna element in an array environment. For this study, conventional and micromachined silicon substrates are used as depicted in the figure. Apart from the coupling issues, this example demonstrates the capability of our parallel electromagnetic codes to solve and visualize very complicate electromagnetic field distributions. This will be greatly useful to RF circuit and antenna designers and even to engineers in DoD community. 24

:-4 r I I I 30 - o o M=130,573 * * M= 183,816 - - - Linear (Ideal) 25 - / / y..... 20 k Q. a, a) C):: /:.. -:...... /. ~.. 15 101............... <D: / /:: / /. 8 /...'. 0. ~.O' /: /o * 5 I I I I I 0 5 10 15 20 25 30 35 Number of CPUs Figure 9: Speedup curves for two different problem sizes. M is the number of unknowns to be solved. 25

3.5, 3 2.5 2 1.5 1 0.5 0 Processor Label (a) 2.5, ---- ||- 1.6 1.4:::::::: 4. 1.2 1.5...... 0.6 0.4 0.5 0.2 0 0 l. (b) I,........ 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Processor Label C 2 4 6 8 10 12 14 16 Processor Label 1.6 1.4 1.2 9 0.8 0.6 0.4 0.2 0 (C)................. I li111 1_ 1 00 (d).......-.... 1..........................25 0.6 0.5 - 0.4 O F 0.3 0.2 0.1 10 15 20 Processor Label 25 30 Processor Label (e) (f) Figure 10: Load balancing characteristics for different number of CPUs (N): (a) N = 5 (b) N = 10 (C) N = 15 (d) N = 16 (e) N = 20 (f) N = 25 26

(a) (b) Figure 11: Coupling between two adjcent radiating elements: (a) Antenna on conventional substrate. (b) Antenna on micromachined substrate. 27

C User Manual for Hybrid (FEM + MoM) Method C.1 Procedure In this section, a brief description of hybrid method is given. As can be seen in Fig. 12, the FEM and MoM solutions are combined to yield circuit parameters or electromagnetic field distribution. Start FEM, XFEMRo Input file Impedence N EM data fror utine I < MoM MoM Routine ' ---- M Input file I Ilatrix & Impedence Matrix n FEM from MoM \ / Hybrid Method Routine J Input Impedence & Scattering Parameter Finish Figure 12: Flow chart of the hybrid method solution process. 28

C.2 File flow chart The hybrid method requires several different input files for FEM and MoM solution processes as described in Fig. 13. Also, some of ouut files are used in the next stage or contain final solution. Start hybrid fem.dat — (FEM) hybrid Jem-input.dat hybrid-fem.c current. *.dat yps.*.dat (MoM) hybridMoM_yb.c * --- zmnin.dat hybridMoM_zmnb.c hybridMoM.dat hybridMoM_tmb.c 1 yb.tem zmnb.tem tmb.tem / (Hybrid Method) hybrid.c current.dat sum.dat I Finish Figure 13: File flow chart of the hybrid method solution process. 29

C.3 File description Here are descriptions of files, source codes, input and out files, for FEM, MoM, and hybrid method. FEM file hybridf em. dat FEM mesh input file. hybridfeminput. dat FEM input file. hybridf em. c FEM program. current. *. dat Output file: current information on microstrip for each excitation. File format: Re(Ex), Im(Ex), Re(Ey), Im(Ey), Re(Ez), Im(Ez) yps. *. tem Output file: Impedance matrix component for each excitation. Note. * = 0,1,2,.... Numbered according to each excitation. zmnin.dat hybridMoM. dat hybridMoM-yb. c hybridMoM-zmnb. c hybrid MoMtmnb. c yb.tem zmnb.tem tmnb.tem MoM file MoM input file. MoM input file. MoM program for slot. MoM program for patch. MoM program for interaction between slot and patch. Output file: Impedence matrix for slot. Output file: Impedence matrix for patch. Output file: Impedence matrix for interaction between slot and patch. -i I I hybrid.c current.dat sum.dat Hybrid method file Hybrid program. Output file: current distribution on input microstrip, which is the final solution of the problem. Input impedence and scattering parameters are calculated with this information. Output file: current profile on slot and patch. File format: Map is magnetic basis function profile on slot. Ib is current basis function profile on patch. I 30

C.4 Compilation and Execution In hybrid method there are 3 important steps. FEM, Method of Moment(MoM), and hybrid simulation which combines FEM and MoM results. All the programs are written in C++ language, which enables dynamic memory allocations and memory savings as a result. The programs are also parallelized using MPI to give portablility for different types of parallel computers. As a first step, FEM program, hybrid f erm. c, is simulated using its data file, hybridf em. dat. Compilation and execution examples are shown as follows. Compile FEM program * conventional workstation g++ hybridfem.c -DDOMAINS -DSUNC__ -o hybrid-fem.x -02 -lm * IBM SP2 mpCC -+ hybridJem.c -DDOMAINS -DPARALLEL -D —GNUC- — o hybridfem.x -03 -lm If the structure is symmetry, we can utilize it by adding option -DSYMMETY when compile in both conventional workstation and IBM SP2. As mentioned before, FEM part of hybrid method can be simulated on two modes in IBM SP2, which is loadleveler mode and interactive mode. Here are some examples on executing the FEM code on conventional workstation and IBM SP2. Executing FEM program * conventional workstation hybridfem.x < hybridfeminput.dat (fem input file) Executing FEM program with 8 CPUs * IBM SP2 hybridifem.x -rmpool 0 -procs 8 < hybridfeminput.dat (fem input file) Compile MoM program * IBM SP2 mpCC -+ hybrid-MoM-yb.c -DPARALLEL -D —GNUC- -o hybrid-MoM-yb.x -03 -lm mpCC -+ hybridMoMzmnb.c -DPARALLEL -D__GNUC__ -o hybrid-MoM-zmnb.x -03 -lm mpCC -+ hybridMoMtmb.c -DPARALLEL -D__GNUC__ -o hybridMoM-tmb.x -03 -lm 31

Executing MoM program with 4 CPUs * IBM SP2 hybridMoM-yb.x -rmpool 0 -procs 4 < zmnin.dat (MoM input file) hybrid-MoM-zmnb.x -rmpool 0 -procs 4 < zmnin.dat hybrid-MoMtmb.x -rmpool 0 -procs 4 < zmnin.dat Compile Hybrid program * conventional workstation g++ hybrid.c -D__SUNC_ -o hybrid.x -lm * IBM SP2 mpCC -+ hybrid.c -DPARALLEL -D__GNUC__ -o hybrid.x -03 -lm Executing Hybrid program * conventional workstation hybrid.x * IBM SP2 hybrid.x -rmpool 0 -procs 1 32

The structure, as an example, to be solved using hybrid method are shown below. FEM is used for microstrip feeding network where analytic Green's function is not known. Method of momentum(MoM) is used for patch antenna using Green's function. By combing the results from FEM and MoM, we can solve the whole structure, which is hybrid method. SuPatchst ra Figure 14: Single patch antenna with microstrip feeding network. 33

C.5 User Manual for FEM Mesh In this section, a brief description of input files for FEM simulation is given. Wf,"-~ Substrate Figure 15: Microstrip feeding network. termination / < —'boundary Figure 16: Illustration of total solution space and outer boundary. 34

AA for free space AA for substrate Figure 17: Artificial absorbing layers for simulation of open boundary problem. < Top mesh ' — Bottom mesh Figure 18: Meshing schemes for each layer. 35

2 Discri S _39.45e-3 -27.45e-3 -6e-3 6e-3 27.45e-3 39. 45e-3 5 8 16 8 5 6 -202. 55e-3 -182.55e-3 -0.55e-3 0.55e-3 20e-3 42e-3 54e-3 5 70 2 8 8 5 1 0 1. 6e-3 2 7 -39.45e-3 -27.45e-3 -6e-3 -2.475e-3 2.475e-3 6e-3 27.45e-3 39. 45e-3 5 8 56 585 6 -202. 55e-3 -182.55e-3 -0.55e-3 0.55e-3 20e-3 42e-3 54e-3 5 70 2 8 8 5 FEM Part.etization: file name (hybridifem.dat) Number of layer. * Mesh the bottom section offirst layer* NS'X Xbl 1.. Xb5 Xmax,) End position NS'~ 1 Ybl' 1b Ymax, End position. NS'Z 1 ZbI Zmax, End position. *Mesh the top section offirst layer* Nxtlx 1 xt Xmax, End position. NS' Yti1.'Y6 Ymax, End position. ] ] 36

I -- 7 -39.45e-3 -27.45e-3 -6e-3 -2.475e-3 2.475e-3 6e-3 27.45e-3 39.45e-3 5 8 5 6 5 8 5 6 -202.55e-3 -182.55e-3 -0.55e-3 0.55e-3 20e-3 42e-3 54e-3 2 1.6e-3 22.4e-3 34.4e-3 8 5 7 -39.45e-3 -27.45e-3 6e-3 -2.475e-3 2.475e-3 6e-3 27.45e-3 39.45e-3 5 8 5 6 5 8 5 6 -202.55e-3 -182.55e-3 -0.55e-3 0.55e-3 20e-3 42e-3 54e-3 5 70 2 8 8 5 Discretization (continue) *Mesh the bottom section of second layer.* NS2 nrb2x 2 Xb,... 2 Xb7 Xmax, End position. N^ 2 2 Nbx1 5. * ) Nbx7 NS4z by Yb,... 2 Yb6 Ymax, End position. NSbz 1 1. Zbl, Zb2 Zmax, End position. Nbzl,. bz2 Mesh the top section of second layer. NSt2 2 xtl, * 2 Xt7 Xmax, End position. tx2..., N tx7 NS2 Nty 2 Yti, '6 2...., Yt6 ymax, End position. N2v2... N2 tyi/ *** ' Yb6 - Note 1. NSbx,by,bz/NStX,t,tz: layer in x, y, z-direction. Number of uniform mesh subsection of bottom/top for ith 37

Note 2. x, y, zlz/ y, z: Starting position ofjth uniform mesh subsection in x,y,z-directionfor bottom/top of ith layer Note 3. Nijbyjbzj /lJ tyj tzj: Number of division of jth uniform mesh subsection in x, y, zdirection for bottom/top of ith layer. Note 4. Top section and bottom section should be meshedfor each layer and number of top mesh should be equal to the number of bottom mesh. Note 5. For open boundary problems, such as antenna problem, one need to design and place artificial absorbing material inside of the outermost boundary. Note 6. When there are symmetry in the problem, one can take advantage of the fact to reduce overall problem size. Surface Boundary Conditions: (xmin, Xax, Ymin, Ymax, z)i 2 points are diagonal of a rectangle for ith x-y surface 1 Number of connected PEC domain 1 Number of PEC subdomain per connected PEC domain -2.475e-3 2.475e-3 (Xmin, Xmax)i microstrip -200e-3 20e-3 (Ymin Ymax)i 1.6e-3 z 1 Add edge above PEC domain (default) Material Allocations: (Xmin, Xmax)i, (ymin, Ymax)i, and (Zmin, Zmax)i 2 points are diagonal of a cube containing given materal Material Property: Real and Imaginary Parts of Material /Re(erl), -m(6rl), Re(/rl), Im(p.rl) 11 number of material -27.45e-3 27.45e-3 (Xmin,Xmax)i -182.55e-3 42e-3 (Ymin Ymax)i 0 1.6e-3 (Zmin Zmax)l 2.54 0. 0 Re(er), Im(elr): substrate 1.0, 0.0 Re(/rl),Im(irl) -27.45e-3 27.45e-3 (Xmin,Xmax)2 -182.55e-3 42e-3 (Ymin, Ymax)2 1.6e-3 22.4e-3 (Zmin max)2 1.0 0.0 Re(er2), Im(c62): Free space 1.0, 0.0 Re(r2), Im(ur2) 38

I -39.45e-3 39.45e. -202.55e-3 54e-3 22.4e-3 34.4e-3 1.0 -5.0 1.0, -5.0 -39.45e-3 -27.45( -202.55e-3 54e-3 0 1.6e-3 2.54 -8 1 -3.14961 27.45e-3 39.45e-. -202.55e-3 54e-3 0 1.6e-3 2.54 -8 1 -3.14961 -27.45e-3 27.45e-202. 55e-3 -182.55e-3 0 1.6e-3 2.54 -8 1 -3.14961 -27.45e-3 27.45e42e-3 54e-3 0 1.6e-3 2.54 -8 1 -3.14961 -39.45e-3 -27.45( -202.55e-3 54e-3 1.6e-3 22.4e-3 1 -5 1 -5 27.45e-3 39.45e —' -202.55e-3 54e-3 1. 6e-3 22.4e-3 1 -5 Surface Boundary Conditions (continue) -3 (Xmi n, Xmax)3 (Ymin, Ymax)3 (Zmtn I Zmax)3 Re6(,3), IM(c,3):AA for free space e-3 (Xmin, Xmax)4 (Ymini, Ymax)4 (Zmi'n, Zmax)4 Rc(Cr4), IM (cr4):AA for substrate R e(iir4), IM ([,u4) 3 (Xmin, Xmax)5 (Ymni, Ymax)5 (Zmin, Zmax)5 Re(Cr5), IM(c,5):AA for substrate -3 (Xmin, Xmax)6 (Ymini~ Ymax)6 (Zm'nI Zmax)6 Re('E,6), I M(c,6) AA for substrate -3 (Xmin, Xmax)7 (Ymini Ymax) 7 (Zmin, Zmax)7 Re(C,7), IM(cr7) AA for substrate e-3 (Xmin, Xmax)8 (Ymini Ymax) 8 (Zmi'nI Zmax)8 R6(cr8), Im(c,8) AA for free space 3 (Xmin, Xmax)9 (Ymin, Ymax)9 (Zmin, Zmax)9 Re,(cr9), Im(cr9):AA for free space ___j 39

- - - 1 -5 -27.45e-3 27.45e -202.55e-3 -182.55e-3 1.6e-3 22.4e-3 1 -5 1 -5 -27.45e-3 27.45e 42e-3 54e-3 1.6e-3 22.4e-3 1 -5 1 -5 Surface Boundary Conditions (continue) Re(/r9), Im(Pr9) -3 (Xmin, max)10 (Ymin, Ymax)0o (Zmin, Zmax)O10 Re(rlo), Im(crlo): AA for free space Re(/rlo), Im(rio) -3 (Xmin Xmax)ll (Ymini Ymax)1i (Zmini Zmax)11 Re(Crll), Im(Crll) AA for free space Re(pn1), Im(Prll) - Note. Any 2 diagonal points can be used to define a cube whose material property is homogeneous. Note 2. Artificial absorbers (AA) for free space and any arbitrary material have to be chosen by user Theformulars can be found in the Dr Yook's Ph.D. thesis. Field Interpolation: define 2 end points (xmin, Ymin, Zmin)i and (Xmax, Ymax, Zmax)i 7 0 Number of interpolation points. 0.00le-3 -162e-3 (Xmin, Ymin, Zmin)l 0.801e-3 O.00le-3 19.9e-3 (Xmax Ymax, Zmax)i 0.801e-3 Excitation Points: define source as ractangular patch: (,center, Ycenter)i and (Ax, Ay)i 0 -177.35e-3 (xcenter, ycenter)i: center of the source 6e-3 5.2e-3 (Ax, Ay)i: width of the source FEM input file: file name (hybrid-feminput.dat) le-3 le-5 50000 Tolerence of integration routine Tolerence of iterative solver Maximum number of iteration 40

C.6 User Manual - MoM In this section, a brief description of input files for MoM simulation is given. Microstrip Figure 19: Geometry and parameters of patch antenna. Substrate Ground Plane Figure 20: Dimension parameters of slot. 41

MoM Part: file name (hybridMoM.dat) 5 Nb: Number of basis on a patch. 3 Nap: Number of basis on a slot. 2.54 Erb: Re(ers) of substrate in microstrip 1. 6e-3 db: Substrate thickness of microstrip 4 0 e-3 Lp: Patch length 3 0 e- 3 Wap: Patch width 0 0 Xos, Yos: Center position of the patch 12e-3 Lap: Slot length 1. le-3 Wap: Slot width 20 e-3 Ls: Distance between center the slot and microstrip end 4.9 5 e-3 Wf: Width of microstrip 2. 225e9 freq: Simulation frequency MoM Part: file name (zmnin.dat) 1 e- 5 Integration tolerence 10 0 0 Maximum number of division for integration 1 Integration parameter: default 150 Truncate integration range at 150ko for oo 3 Use 3Ag for input traveling wave in microstrip 5 Integration parameter: default 42

This Appendix has been excerpted from Juiching Cheng 's Ph.D. Dissertation, Chapter 6. D Application of Parallel Computing in MoM D.1 Introduction Due to advance in semiconductor industry, the price of high performance processors falls rapidly, making it economical to build parallel computers with large number of processors. The need of high performance computing not achievable by single processor also drives the development of parallel computing technology. During the past decade we saw a lot of commercial parallel computers made by IBM, SGI and INTEL, to name a few. Parallel computers are no longer the proprieties of government or university research laboratories, but a reality. However, there is one cause hindering its widespread adoption. To tap into the power of parallel computing, the user has to modify his program according to the programming interface provided by the vender. Every time the user changes the parallel computer he used, he will need to modify his program again. There is no common programming interface among parallel computers until the Message Passing Interface (MPI) was standardized in 1994. Since then, MPI has been widely adopted by the computer industry. To utilize parallel computers, first a programmer needs to design a parallelization model for his program. The idea of parallelization is to divide a single task to many independent jobs. A simple and easy way to implement is the Single Program Multiple Data (SPMD) model. In this model, different data sets are executed by the same program on different processors. These data sets have to be independent of each other, otherwise they cannot be executed in parallel. One simple example in electromagnetic problems is frequency response. In FEM or MoM, the computation process at each frequency point will not effect the outcome of other frequencies. Thus, we can simply submit the whole computation process of one frequency point to one processor. However, this may not achieve the best efficiency in terms of processor utilization. For instance, if there are 8 frequency points and 7 processors, when the last frequency point is computed on one processor, all other 6 processors will be idling. To achieve better efficiency, it is necessary to go to the inside of the computation process to seek out other independent operations which can be parallelized. Many numerical techniques for solving electromagnetic problems have utilized parallel computers. Especially, the FDTD has attracted a lot of effort in parallelization due to its computerintensive nature. A comprehensive study of its implementation can be found. A detailed description of a parallelized implementation of a MoM code is presented. A comparison of linear equation solvers for integral-equation formulations can be found. In the following sections, two processes in MoM are parallelized using MPI. One is the impedance 43

matrix fill process; the other is the FWT process. The detail of the implementation and the measurement of the turn around time are included. D.2 Parallelization Model of MoM Matrix Fill Process The formulation of MoM always leads to a matrix equation as follows Ax=B, (1) where A is a two-dimensional matrix, B is a column vector, and x is the unknown column vector we want to solve by direct inversion of A or by an iterative linear equation solver. The elements of matrix A are often the result of numerical integration of two basis functions with an integration kernel. The following equation is an example in one-dimensional space: aij = J J fi(x)g(x, x') fj(x) dx dx where a i is the element of the matrix at i-th row and j-th column, fi(x) and fj(x) are the basis functions, and g(x, x') is the kernel. Depending on the form and the distance between the basis functions, the time needed to perform the numerical integration of one element may vary greatly. To fully utilize the processors of a parallel computer, a dynamical job allocation scheme is necessary. As shown in??, a processor is reserved as a dispatcher to send jobs and receive results from the workers. Other processors are served as workers to perform the numerical computation. 4 types of messages are passed between the dispatcher and workers. They are FREE: sent by a worker to inform the dispatcher that it is ready to accept new jobs, RESULT: sent by a worker to the dispatcher, containing the result of numerical integration, TERMINATION: sent by the dispatcher to inform the worker that all jobs are sent. JOB: send by the dispatcher, containing the information of a job. Following are the procedures performed by the dispatcher: 1. create a list of the unique elements, 2. wait for messages from the workers, 3. if the message is a RESULT message, store the result, otherwise the message is a FREE message and a JOB message is sent to the worker, 4. repeat step 2 and 3 until all jobs are sent and returned, 5. send TERMINATION message to inform each worker to quit. 44

For the worker, the procedures are: 1. send FREE message to the dispatcher, 2. wait for messages from the dispatcher, 3. if it is a JOB message, perform the numerical integration and return the result, otherwise it is a TERMINATION message and the worker quits. 4. goto step 2. Parallelization Model of Impedance Matrix Fill *- Command path -.-* Data path U U U -N^^ Worker Worker Worker Dispatcher * Send commands to Workers. ' * Receive data from Workers. * Store data in impedance matrix. * Receive commands from Dispatcher. * Perform computation. * Send result to Dispatcher. Figure 21: A dynamical allocation scheme for matrix fill process. 1 shows the flow chart of the above process. The dashed lines in the figure indicate the communication paths. Two kinds of processes, dispatcher and worker, are concurrently executed. First, the process of the dispatcher sorts out the unique matrix elements and builds a job list after it starts and reads in the necessary parameters. Then, it enters a loop waiting for messages from the workers. The loop terminates only after all the jobs are sent and all the data are collected from the workers. Inside the loop, the dispatcher checks the messages from the workers. If the message contains data, the dispatcher will store it and then wait for new messages. If the message is a FREE message, the dispatcher checks if all the jobs have been sent. If not, the dispatcher will send a job to the worker which sends the message. If all jobs have been sent, the 45

Dispatcher Worker tart ( Start Input / / Input / parameters / parameters Find unique matrix - -- elements. Send a reque Build a job list. > message to — s ---- Idispatcher. -----— > < --- —-s --- — **', c. * " '" ' I messages from }<.,Wait for Send data to W fo workers..messages froi L > i | | ~dispatcher|, eae )ata messages Request messages _ ____ ^ di <space disatcer. No No r t a l os Send a job to the | Process the Termination Storethedata. A jobs done? w ob. message? Yes - Yes ____ ' ___..- siJ Terminate the End worker. No ______ ^ No active \ "K workers? / Yes / utput the / / results. CTc Figure 22: The flow chart of a dynamical allocation scheme for matrix fill process. 46

dispatcher will send a TERMINATION message to the worker. If all worker have been terminated, the dispatcher exits the loop, outputs the results and quits. If there are still active workers, the dispatcher will continue waiting for new messages. The worker process, after starts and reads in the necessary parameters, sends a FREE message to the dispatcher to indicate that it is ready to accept new job. Then, it waits for a reply from the dispatcher. If the reply is a TERMINATION message, the worker quits. Otherwise, the message contains a job and the worker starts to execute it. After the worker has finished the job, the result is sent back to the dispatcher via a RESULT message. After sending the RESULT message, the worker sends a FREE message to the dispatcher and waits for the reply from the dispatcher again. A short program section is included in?? to reveal the detail of implementation. A measurement of turn around time v.s. number of processors is shown in??. Here the "turn around time" means the elapsed time from the start of the program until completion. The measurement is performed on a 48 processor IBM SP2 machine. In the figure, the theoretical result is curve-fitted to the experimental result according to the following formula time(i) = P + Comm x n + Comp/(i - 1), for i > 1 (2) where time(i) turn around time, P preprocessing time, Comm communication time per message, Comp total computation time of the jobs, i:number of processors, n number of transmitted messages. Note that Comm includes the communication latency and the actual transmission time of one message. Since the message size is fixed, Comm is a constant. Let N be the total number of the jobs. From the flow chart, there are N + i - 1 FREE,, N JOB, N RESULT, and i - 1 TERMINATION messages. Thus, the total number of transmitted messages is 3 x N + 2(i - 1). When N is much larger than i, the total communication time is proportional to N. From??, it is seen that when the number of processors increases, only the third term in the formula decreases. This means that the preprocessing time plus the total communication time governs the limit. The formula also indicates that the curve of the turn around time should approximate a hyperbola when N is much larger than i. The measured results show very good agreement with the above theoretical model. The improvement of turn around time is larger when the number of processors is small. However, when the number of processors is large, only a small improvement is achieved because the preprocessing and communication time dominate. Note that the turn around time of 2 processors is higher than 1 processor. This is due to the higher preprocessing time of parallelized program and the communication cost. Also note that, due to the programming model, one processor is always reserved for coordinating other processors. 47

When the communication cost is small, this processor will mostly be idle. It is possible to also utilize the idle time on this processor to perform numerical integration by more sophisticated programming. D.3 Parallelization Model of FWT As discussed in??, the FWT process transforms the matrix derived in MoM from one set of basis to another.?? and?? are repeated below for convenience. -= EZ i pijj (3) [z] = [P]t[Z][P] (4) In??, qi and q' are the members of the basis ( and I' respectively, and N is the rank of the basis. In??, [Z] and [Z'] are the MoM matrices evaluated in bases (I and I' respectively. The pij in?? is the i-th row, j-th column element of matrix [P] in??. Let z' be the m-th row, n-th column element of [Z'], from the above two equations, we can derive N N Zj = E PmiZmnPnj, (5) m=l n=l which is the form of the single operation in the FWT. The process of FWT fits to the SPMD model which is characterized by repeating the same operation, for example,??, on different data. Comparing to the previous matrix-fill process, each operation of FWT is shorter and fixed, however the total number of operations is much larger. If the same algorithm as the previous one is used, the overhead of communication will be too great. In stead, a fixed job-allocation scheme as shown in?? is devised. At the beginning, each processor computes the job list and equally divides it to the total number of processors. Each processor can determines its share of jobs by his "rank"'. The only communication that takes place is when each processor sends its final computed results to one designated processor for post-processing. Following are the procedures performed by each processor: 1. create a list of the operations, 2. divide the list equally into n shares, where n is the total number of processors, 3. use the rank of the processor as an index to determine which share belongs to itself, 4. perform the FWT, 1A terminology used by MPI to number the processors. If there is n processors, the rank of the processors will be O, 1, 2,..., n. 48

5. send the computed result to the processor with rank equal to 0. 1 shows the flow chart of the above process. A short program section is included in?? to reveal the details of the implementation. A measurement of turn around time v.s. number of processors is shown in??. The definition of "turn around time" is the same as in the previous section. The measurement is also performed on the same parallel computer. In the figure, the theoretical result is curve-fitted to the experiment result according to the following formula time(i) = P + Tran x N + L x i + Comp x N/i, (6) where time(i) turn around time, P preprocessing time, Tran actual transmission time of one unit of data, L communication latency, Comp computation time, i: number of processors, N number of unique matrix elements. Since each processor only sends computed results once, the total latency time equals to L x i. Comparing to other terms in?? the total latency is negligible since the latency of SP2 is only 40 hS. As in the previous section, when the number of processors increases, only the last term in the formula decreases, while the first and second terms remain the same. This means that the preprocessing time and the total communication time is the limit. The formula also indicates that the curve of the turn around time should be a hyperbola. The measured results show very good agreement with the above theoretical model. The improvement of turn around time is larger when the number of processors is small. However, when the number of processors is large, only diminishing returns are obtained because the preprocessing and communication time dominates. D.4 Conclusion In this section, the matrix-fill process and FWT in MoM are parallelized. The measured improvement is consistent to the theoretical model of the turn around time. Another bottleneck not addressed in this chapter is the process of solving??. When the size of matrix A is large, direct inversion of A becomes impossible. Thus, iterative solvers need to be used. An iterative solver by nature is sequential. Each iteration relies on the result of last iteration. In this dissertation, a bi-conjugate gradient [?] iterative solver is used when the matrix size is large. In this solver, the most time-consuming part is the multiplication of a matrix with a column vector. Every iteration produces an improved guess of the solution vector. Thus, the communication overhead of 49

at least one solution vector in every iteration is not avoidable. When the dimension of the matrix is large, this communication overhead outweighs the improvement incurred by parallelizing the linear solver, since the communication speed is much slower than computation speed. This problem still needs to be addressed if further improvement on the speed of MoM is necessary. 50

This Appendix has been excerptedfrom Juiching Cheng's Ph.D. Dissertation, Appendix C. E The Implementation of Parallel Programming in C++ The following two sections show the implementation of the parallel programming of dynamic and static job allocation schemes in C++ language. The definition of the MPI functions can be found in [?,?]. E.1 Dynamic job allocation int rank, processsize; mpidata mdata; /* common buffer for MPI messages */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank==O) /* Dispatcher Process */ int send_count=O, receive_count=O, actual_count; int *process=new int[processsize-l]; /* for storing the status of processors */ for (int i=O;i<processsize-l;i++) process[i]=0; intpair* ip;/* pointer to the job list */ int finished=0; do { MPI_Status status; /* receive message from workers */ MPI_Recv(&mdata, sizeof(mdata), MPIBYTE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status); int sent=0; switch (status.tag) /* check message type */ { case TAGREQUEST: /* request for job */ if (finished) /* inform the work to end */ MPISend(&mdata, sizeof(mdata), MPIBYTE, status.source MPI_Send(&mdata, sizeof(mdata), MPI_BYTE, status.source, 51

TAGEND, MPI_COMM_WORLD); process[status.source-1]=1; sent=l; }; while (!finished &&!sent) /* send job to the worker */ { mdata.j ob=ip[send_count] mdata.i=send_count++; MPI_Send(&mdata, sizeof(mdata), MPI_BYTE, status.source, TAGDATA, MPI_COMM_WORLD); actual_count++; if (send_count>=count) finished=l; }; sent=1; }; break; case TAGDATA: /* returned results */ ip[mdata.i].v=mdata.v; receivecount++; break; }; } while (receive_count<actual_count finished==O); /* inform workers to end */ for (int i=l;i<process_size;i++) { MPI_Status status; if (process[i-l]==O) { MPI_Recv(&mdata, sizeof(mdata), MPI_BYTE, i, TAG_REQUEST, MPI_COMM_WORLD,&status); MPI_Send(&mdata, sizeof(mdata), MPI_BYTE, i, TAG_END, MPI_COMM_WORLD); }; /* output the results */ 52

} else { /* Work Process */ MPI-Status status; while ( MPI_Send(&mdata, sizeof(mdata), MPI-BYTE, 0, TAG-REQUEST, MPI_COMM_WORLD), MPI_Recv(&mdata, sizeof(mdata), MPI BYTE, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status), status.tag==TAGDATA) /* compute the job stored in mdata.job */ /* store the result in mdata.v*/ /* return the result in mdata.v */ MPI_Send(&mdata, sizeof(mdata), MPI-BYTE, 0, TAG-DATA, MPI_COMM_WORLD); }; }; MPI Finalize(); E.2 Static Job Allocation int size; /* total number of jobs */ int rank, process-size; mpidata mdata; /* common buffer for MPI messages */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &process_size); /* compute the size of jobs for each processors */ int count=size/processsize; int remain=size % process-size; /* remaining jobs */ int *recvcount, *displs, offset; if (rank==0) /* processor 0 is also in charge of collecting the results, so it needs to prepare the place to store them */ 53

recvcount=new int[processsize]; displs=new int[process_size]; for (int i=0; i<process_size; i++) if (i<remain) displs[i]=i*(count+l)*sizeof(intpair); recvcount[i]=(count+l)*sizeof(intpair); } else { displs[i]=(i*count+remain)*sizeof(intpair); }; }; /* adjustment for the remaining jobs */ if (rank<remain) { count++; offset=count*rank; } else { offset=count*rank+remain; /* compute the jobs */ /* return the results */ MPI_Gatherv(&(z.ia[offset]),count*sizeof(intpair),MPI_BYTE,&(z.ia[0]), recvcount, displs, MPI_BYTE, 0, MPI_COMM_WORLD); if (rank==O) processor output the total results /* processor 0 output the total results */ 54

Impedance matrix fill, 240 matrix elements 1200 100 0 Measured L 800 | 800 -Theory E 600 400 200 0~~i --- --------- - --— ~it --- ~ - 5 10 15 20 25 30 Number of CPUs Figure 23: The measured turn around time of the impedance matrix fill process. The theoretical result is plotted by curve-fitting the measured data to?? 55

Parallelizing Model of Fast Wavelet Transform -N Data path U U Transformed - No Impedance Matrix * Gather results from each CPU. * Evenly divided by N. * Send each share to each CPU. * Perform FWT. Figure 24: A static allocation scheme for the FWT process. 56

Worker 0 Worker 1....... Workei Star StarStt nput Input X parameters parameters Iparmter / " / S /J BuildJob list Build Job list Build Job Ii Process its own p... Proess ts own Process its own Process its ov josjobs jobs Gather data -- \eL / Output results. / End Figure 25: The flow chart of a static allocation scheme for the FWT process. rn 57

Fast wavelet transform, 3568 matrix elements -~ \ -- Theory 4 o 3 2 -1 2 4 6 8 10 12 14 16 Number of CPUs Figure 26: The measured turn around time of the FWT process. The theoretical result is plotted by curve-fitting the measured data to?? 58

F Publications References [1] J.-G. Yook, J. Cheng, D. Chen and Linda katehi, "Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design," DoD HPCMP User Group Meeting, San Diego, CA. 1997. [2] J.-G. Yook, J. Cheng and Linda katehi, "Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design," DoD Mission Success Story from High Performance Computing. [3] J.-G. Yook, "Electromagnetic Modeling of High-speed High-frequency Interconnects," Ph.D. Dissertation, The University of Michigan, Ann Arbor, Michigan, 1996. [4] J. Cheng, "Theoretical Modeling of MMICs Using Wavelets, Parallel Computing and a Hybrid MoM/FEM Method," Ph.D. Dissertation, The University of Michigan, Ann Arbor, Michigan, 1998. [5] G. Ponchak, D.-H. Chen, J.-G. Yook and Linda katehi, "Characterization of Plated Via Hole Fensces for Isolation between Stripline Circuits in LTCC Packages," IEEE MTT-S 1998 Symposium (accepted). [6] E. Yasan, J.-G. Yook and Linda katehi, "Generalized Method for Including Two Port Networks in Microwave circuits Using the Finite Element Method," ACES 97 (presented). [7] E. Yasan and Linda katehi, "An FEM based Method on Full Wave Characterization of Distributed Circuits Including Linear and Nonlinear Elements," IEEE AP-SIURSI 1998 Symposium (accepted). 59

"DoD Mission Success Story from High Performance Computing: 1998 ed." PARALLEL ELECTROMAGNETIC SOLVERS FOR HIGH FREQUENCY ANTENNA AND CIRCUIT DESIGN J.-G. Yook, J. Cheng and L. Katehi Radiation Laboratory Department of Electrical Engineering and Computer Science The University of Michigan, Ann Arbor, MI 48109-2122, U. S. A. Tel: 313-764-0608, Fax: 313-647-2106 DoD partner: Barry S. Perlman, Army Communications- Electronics Commands. Computer Resource: IBM SP2 [MHPCC] Research Objective: The purpose of this effort is development of linearly scalable parallel electromagnetic solvers for large array antenna and complicated microwave/millimeter-wave circuits which have not been successfully characterized by other means. Fast and accurate characterization of these electromagnetic structures are of critical importance for a successful DoD mission in digitizing the battlefield and developing secure communication systems. Methodology: For parallel electromagnetic tools, we have developed wavelet-based method of moment (MoM) and tetrahedral finite element method (FEM) codes for arbitrary three-dimensional structures, and a message passing interface (MPI) has been utilized for parallelization on distributed memory parallel computers. Furthermore, fast wavelet transform and thresholding techniques are applied to achieve accelerated computation of the MoM solution. The linear scalable performance of the parallelized MoM and FEM is also investigated for real time design and optimization of the large array antenna and high frequency circuits. Results: Preliminary results of the parallel impedance matrix filling and fast wavelet transform for MoM and task parallelization strategy for FEM show linearly scalable performance improvement. This truly scalable parallel MoM and FEM code performs successfully due the minimal communication overhead between the computing nodes and is not subject to the bandwidth of the network or switches. The number of unknowns of the problem ranges from 15,000 to 20,000 in the MoM case and from 150,000 to 200,000 in the FEM case. Figures below shows three-dimensional electric field distributions at far region from the 6x6 rectangular patch array antenna which are computed on parallel computers very efficiently. Significance: Design and optimization of large electromagnetic structures, such as array antennas and high density circuits and packages, are of critical importance in the development of the advanced sensors and tactical communication systems. Through our research, we have demonstrated linearly scalable parallel strategies for various numerical electromagnetic codes and solved various class of problems which were not possible or extremely time consuming with conventional method and computers. This project has been funded by NASA Lewis Research Center, Army Research Office and the DoD High Performance Computing Program.,I * * Figure: Far field radiation patterns for 6x6 rectangular patch array antenna. Authors: Jong-Gwan Yook, Juiching Cheng and Linda P..B. Katehi in the University of Michigan. Tel: 313-764-0608.

"Contributions to DoD Mission Success from High Performance Computing: 1996 ed." CO-FIRED CERAMIC PACKAGE FOR A KA-BAND MIMIC PHASE SHIFTER Jong-Gwan Yook and Linda P. B. Katehi Radiation Laboratory Department of Electrical Engineering and Computer Science The University of Michigan, Ann Arbor, MI 48109-2122, U. S. A. Tel: 313-764-0502, Fax: 313-747-2122 Email:yookjong@engin.umich.edu DoD partner: J. Harvey, Army Research Office, RTP, NC. Computer Resource: IBM SP2 [MHPCC] Research Objective: The purpose of this effort is the full-wave characterization of Microwave/Millimeter wave packages for discrete MMIC components. While these packages find widespread use in MMICs, their modeling has been limited by the large size and complexity of the problem leading into long design cycles, low yield and high cost. Methodology: To model the MMIC package we have parallelized a full-wave finite element (FEM) code on the distributed memory parallel computer, IBM SP2, in Maui High performance Computing Center (MHPCC). Among the many parallelization schemes, we have selected "task parallelization" based on the considerations that FEM is a frequency domain technique and each frequency point needs to be computed independently in order to obtain the complete package response over the desired frequency spectrum. Results: Preliminary results of the "task parallelization" strategy show linearly scalable performance improvement. This truly scalable parallel FEM code performs successfully due the minimal communication overhead between the computing nodes and it is not subject to the bandwidth of the network or switches. The number of unknowns of the problem ranges from 150,000 to 200,000 and the global matrix equations are simultaneously built and solved at all frequency points within an hour. The same code would normally require 40 hours of CPU time if it were run in serial computer. Significance: Hermetic packages are frequently utilized in microwave integrated circuits to provide protection from a hostile environment, reduced electromagnetic interference and minimal radiation leakage. A variety of packages based on a cofired ceramic technology have developed for use in high frequency application by DoD. The presented effort has allowed, for the first time, the complete and accurate high-frequency characterization of a MMIC package and the understanding of the electromagnetic interference it produces in the form of parasitic resonances. This package has been developed by Hughes at Torrance under a contract from NASA Lewis Research Center in Cleveland, OH. Electnc Field Distribution in Package at 29.5 GHz -30 so Li ~-40 Va Hole. -50 RF In/Out -8o Ground Plane Figure: The 18 - 40 GHz MMIC package for phase shifter chip and electric field distribution in the package. Authors: Jong-Gwan Yook and Linda P.B. Katehi in the University of Michigan. Tel: 313-747-1796.

"Contributions to DoD Mission Success from High Performance Computing: 1996 ed." W-BAND COPLANAR WAVEGUIDE PROBE STRUCTURES Emmanouil M. Tentzeris, Jong-Gwan Yook, and Linda P. B. Katehi Radiation Laboratory Department of Electrical Engineering and Computer Science The University of Michigan, Ann Arbor, MI 48109-2122, U. S. A. Telephone: 313-764-0502 Fax: 313-747-2122 Email: yookjong@engin.umich.edu DoD partner: J. Harvey, Army Research Office, RTP, NC. Computer Resource: Convex SPP-1 [NCCOSC San Diego] Research Objective: Significant attention is being devoted now-a-days to the analysis and design of W-band waveguide structures which are used either to probe the modal propagation inside the waveguides or as active-element mounts. This effort aims at the full-wave characterization of these highly complex waveguide probe structures. Methodology: We have applied the Finite Difference Time Domain (FDTD) technique in the calculation of the S-parameters of a W-band coplanar waveguide-to-rectangulat waveguide probe. The FDTD code has been parallelized on the Convex SPP-1 [NCCOSC] using "task parallelization". Results: A parametrical study has been performed in an effort to maximize the coupling between the probe and the waveguide over the widest possible frequency range. A waveguide absorber based on an analytical Green's -function has been developed to minimize the reflections over a wide band of frequencies. As an application, a membrane coplanar waveguide-to-rectangular waveguide probe has been characterized and designed for less than -20 dB return loss from 75 GHz to 115 GHz. The mesh size is 480x 477x 52 and it would require about 250 hours of CPU time if the FDTD code were to run on an HP 735 Workstation, and 72 hours in a Cray C90 supercomputer. The parallelized version of the same program on Convex SPP-1 requires about 31 hours CPU time for completion. Significance: The presented effort has allowed, for the first time, the complete and accurate highfrequency characterization and design of a W-band back-to-back waveguide probe configuration. Until recently, W-band waveguide probes for diode mounting or waveguide probing have only been designed experimentally through very long and costly design cycles. I E I at t - 6000 ' dt on const-Z plane 0 s0 5-10 10 -20 15 -30 10 1-40 FE5 -50 -60 35 OUT-80 -90 -100 100 120 1 [,v ]ni 160 180 Figure: The membrane cplanar waveguide probe and the electric field distribution for the structure. Authors: Emmanouil M. Tentzeris, Jong-Gwan Yook, and Linda P.B. Katehi in the University of Michigan. Tel: 313-747-1796.

DoD HPC 1998 User Group Meeting (June l-5th, 1998 at Rice University, Houston, TX) Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design J.-G. Yook, D. Chen, J. Cheng and L. Katehi Radiation Laboratory Department of Electrical Engineering and Computer Science The University of Michigan, Ann Arbor, MI 48109-2122 Tel: 734-764-0608, Fax: 734-647-2106 B. Perlman US Army Communications-Electronics Commands(CECOM) Command/Control & System Integration Directorate AMSEL-RD-C2 Ft. Monmouth, N.J. 07703 Tel: 723-427-4883 Introduction The purpose of this effort is development of linearly scalable parallel electromagnetic solvers for large scale array antennas and complicated microwave/millimeter-wave circuits which have not been successfully characterized by other means. Fast and accurate characterization of these electromagnetic structures are of critical importance for a successful DoD mission in digitizing the battlefield and developing secure communication systems. Methodology For parallel electromagnetic tools, we have developed wavelet-based method of moment (MoM) and tetrahedral finite element method (FEM) codes for arbitrary three-dimensional structures, and a message passing interface (MPI) has been utilized for parallelization on distributed memory parallel computers. Furthermore, fast wavelet transform and thresholding techniques are applied to achieve accelerated computation of the MoM solution. The linear scalable performance of the parallelized MoM and FEM is also investigated for real time design and optimization of the large array antenna and high frequency circuits. Results Preliminary results of the parallel impedance matrix filling and fast wavelet transform for MoM and task parallelization strategy for FEM show linearly scalable performance improvement. This truly scalable parallel MoM and FEM code performs successfully due the minimal communication overhead between the computing nodes and is not subject to the bandwidth of the network or switches. The number of unknowns of the problem ranges from 15,000 to 20,000 in the MoM case and from 150,000 to 200,000 in the FEM case. Figures below shows three-dimensional electric field distributions at far region from the 6x6 rectangular patch array antenna which are computed on parallel computers very efficiently. Significance

Design and optimization of large electromagnetic structures, such as array antennas and high density circuits and packages, are of critical importance in the development of the advanced sensors and tactical communication systems. Through our research, we have demonstrated linearly scalable parallel strategies for various numerical electromagnetic codes and solved various class of problems which were not possible or extremely time consuming with conventional method and computers. This project has been funded by NASA Lewis Research Center, Army Research Office and the DoD High Performance Computing Program. Figure 1: Far field radiation patterns for 6x6 rectangular patch array antenna

PARALLEL ELECTROMAGNETIC SOLVERS FOR HIGH FREQUENCY ANTENNA/CIRCUIT DESIGN Barry S. Perlman Jong-Gwan Yook*, Juiching Cheng Army Research Laboratory Donghoon Chen, and Linda P. B. Katehi Sensors and Electronics Directorate Radiation Laboratory Fort Monmouth, NJ 07703 The University of Michigan, Ann Arbor, MI 48109 yookjong@engin.umich.edu Presented/Published in DoD HPCMP User Group Meeting, San Diego, CA, 1997 Abstract In this paper, several different parallelization strategies for the method of moment (MoM) and the finite element method (FEM) are implemented on a distributed memory parallel computer (IBM SP2) having 48 CPUs. For the portability of the parallel codes on various platforms, standard message passing paradigm (MPI) has been adapted. Having in mind that a major bottleneck of the parallelized codes is the interprocessor communication overhead, data independence is fully exploited for efficient parallelization and as a result near linear scalability has been achieved. I. Introduction High-frequency radar systems for communication, detection and surveillance incorporate MMIC modules which are characterized by high density and substantial geometrical complexity. In most cases these modules are packaged for electrical and/or environmental protection and the resulting 3D structures exhibit a considerable complexity that impedes design and influences electrical performance due to unwanted parasitics. The design of these complex three-dimensional monolithic circuit (MMIC modules) with hundreds of vias and lines interconnecting a large number of active and passive circuit components to a rather large number of radiating elements, has been a problem of critical importance to DoD. The design and characterization of these circuits requires numerical approaches which can fully characterize the excited electromagnetic fields. Among the various techniques, the method of moments has demonstrated superiority in solving radiation problems but it has been hindered by long computational times and is limited, for this reason, to small circuits not exceeding more than a few components and interconnecting lines. In this paper, a novel technique based on the multiresolution expansion and the hybrid combination of the integral equation technique and the finite element method is parallelized to solve very large problems. II. Parallelization Strategies In this section, several different parallelization strategies for MoM and FEM have been presented. In the former case, parallel MoM impedance matrix generation and parallel fast wavelet transform (FWT) are 1

ParallelizationrI Model of Impedance Matrix Fill Parallelizi-mg Model of Fast Wavelet Transform Command patth - Worker -_v- - medn _ - Worker md | Transformed Matrix -* Matrix Manre Dispatcher Send commands to Wo ers * Evenly divided by Nh t Gather results frot Receive data fromn Workers. * Send each share to each CPU..Store data in impedance matrix. Worker each CPU. - * Receive commands from Dispatcher. * Perform FWT * Perform conmputation. * Send result to Dispatcher. (a) (b) Figure 1: Parallelization schemes for MoM: (a) Parallel impedance matrix filling. (b) Parallel fast wavelet transform. implemented, while parallelization of iterative linear equation solver and frequency parallelization schemes are tested for the later case. A. Parallel MoM Matrix Element Generation In method of moments, each matrix element is produced by the integration of a source function and a test function with an integration kernel. When the closed form representation is not available, numerical integration, which is time-consuming, must be used. However, the procedure of the integration of each matrix element fits quite well to the Single Instruction Multiple Data (SIMD) parallelization model, enabling the utilization of parallel computers to speed up the process. Because the computation time of each matrix element can vary in a great range, anr algorithm that dynamically assigns jobs to each CPU is devised as illustrated in figure 1 (a). In this algorithm, one CPU is reserved as a dispatcher to send jobs to and gather results from other CPUs. Jobs are sent whenever there are free CPUs. To achieve performance gain, the communication time between the CPUs must be much smaller than the computation time which is true in most impedance matrix calculation. B. Fast Wavelet Transform In order to take the advantage of multiresolution analysis [1], the MoM matrix is further transformed by FWT to its equivalent representation in wavelet bases. The FWT also fits to SIMD model, but comparing to the generation of MoM matrix elements, the computation time for each element is much smaller and the total number of iterations is much higher. If the same algorithm as the MoM matrix element generation is used, communication overhead will be too high. Instead, another algorithm is devised to reduce the communication overhead as shown in figure l(b). In this algorithm, jobs are evenly divided to each CPU. Only one communication is needed when each CPU completes its entire job list and sends the results back to the designated CPU for further processing. C. Task Parallelization 2

Element Matrix Computation Matrix Assembly: Ax=b r --- — -----------— BiCGF --- —----- -- —, I Linear Equation Solver: BiCG i Parallelization S-Parameters Geometry Definition ) Pre-processing: Mesh Generation < Frequency Sweep>> Parallelization f"l f 1... fn. — f! f2 i Element Matrix Computation! fn-1 fn -----— ' --- -------------- ' - - -------------— 1 i I Matrix Assembly: Ax=b I i --- —_ --- —_ ---r - - - - --------- I — i Linear Equation Solver: BiCG i - - - - - - - - - - - - -- - - - - - - - - i ) L ] JS-Parameters Y. b -------—:::::::::::::: --- --— LAV Output (a) (b) Figure 2: Parallelization schemes for FEM: (a) Parallelization of linear equation solver- BiCG. (b) Task parallelization. The generation of the matrix entries in the finite element method is in general a much easier and faster process compare to that of the MoM [2, 3]. Under most practical situations, FEM matrix generation takes less than 10% of total FEM solution time. As a result, the parallelization of linear equation solver may result in an efficient method (refer fig. 2(a)). On the other hand, based on the property of the frequency domain finite element method, each frequency point can be parallelized with minimum inter-process communication overhead which might lead to scalable FEM code as shown in figure 2(b). In the following section, the performance of two different parallelization schemes for FEM is investigated and the speedup curves are presented. III. Numerical Results A. Parallelized MoM Code for Antenna Applications In order to demonstrate the performance gain of the parallelization models discussed in the previous section, the impedance matrix of a 6 x 6 microstrip patch array antenna is calculated and solved by taking the FWT. The individual patch size, the dielectric constant of the substrate, and the thickness of the substrate is 59.5 mm x 50 mm, 2.62, and 1.59 mm, respectively. The rooftop function is used as the basis function of the field expansion. The number of cells are 16 x 16 on each patch and the resulting total number of unknowns is 17280 for the whole array.. The radiation patterns for E0 and E9 are calculated at 1.536 GHz. Figure 3(a) and (b) show the measured computation time v.s. number of CPUs and theoretical result for both the impedance matrix filling process and FWT using the models presented in Section II. As can be seen from the figures, the measured result matches very well to the theoretical prediction. The radiation patterns of the antenna are computed as shown in figure 4 for E0 and Es plane patterns. The thresholding is taken after 3

the FWT and the resulting sparse matrix is solved with iterative linear equation solver, such as biconjugate gradient method. Impedance Matrix Fill, 240 matrix elements! Iu[. I I. I X- 1000. 800 E C 600 0 4o -C. 400 i -- Measured --- Theory Time(i) = P + Comm x N + Comp x N / (i - 1), for i >1 P: preprocessing time. Comm: commnunication time. Comp: computation time. i: number of CPUs. N: number of matrix elements. Fast Wavelet Transform, 3568 matrix elements 5 - Measured C 7 -— Theory 0 6 - *E= \ Time(i) = OComm + Comp x N i c \ 0: overhead. O 4 - Comm: communication time. X \ Comp: computation time. E 3- i: number of CPUs. I N: number of matrix elements. 2 0 - - -. -. ----.-.. ------ 2 4 6 8 10 12 14 16 18 Number of CPUs Note: the overhead includes the extra function calls for measuring time and parallel processing. (b) 200 I.I..... 1 5 10 15 20 25 30 Number of CPUs Note: the turn around time of 2 CPUs is higher than i CPU. This is due to the higher preprocessing time cf parallelized program and the communication cost. Also note that, due to the programming model, one CPU is always reserved for coordinating other CPUs, which leaves only one CPU to perform the computation when only two CPUs are used. (a) Figure 3: Parallelization performances for MoM: (a) Parallel impedance matrix generation, (b) Parallel fast wavelet transform. A. Parallelized FEM Code for Packaging Applications To measure the erformnce of the parallelized Bi-CG routine on distributed memory machine, two different size of problems are solved. As shown in figure 5(a), parallelization efficiencies are minimal, even though the larger problem size shows slightly better result. With this observation, the task parallelization scheme is tested as shown in figure 5(b) and reveals excellent result. The poor performance of the parallel Bi-CG routine is mainly due to heavy communication overhead. On the other hand, the linearly scalable performance of the task parallelization scheme is coming from the minimum data dependency between the CPUs. As an application of the parallelizaed FEM code, we have characterized two typical microwave/millimeter-wave packages and the internal electric field distributions are calculated as presented in figure 6. IV. Conclusions In this paper, single instruction multiple data (SIMD) model have been fully implemented for the finite element method and the method of moment using the MPI standard on distributed memory machine. Due to the data independence between the processes, linearly scalable parallel codes are achieved. In particular, matrix element generation in MoM has been parallelized in view of its heavy computational burden. In the FEM code, frequency parallelization scheme has been implemented after the observation has bet the performance of the parallelized linear equation solver has been poor. 4

Et (~ = 0) Patter for 6x6 Patch Array Ee (A = 90) Pattern for 6x6 Patch Array u.... r I I I 0. "....... I I I IX \ ---- Threshold =105 -to I leal Array -20 -........ 0 -=:-:20, 3. - \\-30 - -40 - D: 40 -6 "! 6 -60 -50 - I 0 0 0 0 0- Threshold =10-5.......... Ideal Array — 8o -90 -12 -100 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 O [deg] 8 [deg] (a) (b) Figure 4: Radiation patterns for 6 x 6 rectangular patch array: (a) Ek pattern, (b) Eg pattern Acknowledgment This work was partially supported by ARL under contract QK8820. Also, the authors would like to thank the Maui High Performance Computing Center (MHPCC) and the University of Michigan Center for Parallel Computing (CPC), which is partially funded by NSF grant CDA-92-14296 and the Ford Motor Company, for the use of their computational facilities. References [1] Kazem F. Sabet, Jui-Ching Cheng and Linda P. B. Katehi, "Efficient Wavelet-based Modeling of Printed Circuit Antennas and Arrays," submitted to IEEE Transactions on Antennas and Propagation. [2] Jong-Gwan Yook and Linda katehi, "Characterization of MIMICs Packages Using A Parallelized 3D FEM Code," 1996 ACES Symposium, Monterey, CA., pp. 954-961. [3] Jong-Gwan Yook and Linda katehi, "Parallelization of the Finite Difference Time Domain Code on Shared Memory Machine," PIERS 1996, Innsbruck, Austria. 5

1600 1400 I, I I ' '-1- — r I ' I I I I 35 1200 - I ' 1., --— B-.- -----.. 1000 8 I) c 800.Er rv -o- M=38109 --- --- M=76709. —... - I. --- - I 30 25 I 20 0 S 15 10 I UOW r 400 H 200 (1 I. I. I. I. I................... v 0 2 4 6 8 10 12 14 16 18 20 0 5 10 15 20 25 30 35 Number of Processors Number of Processors (a) (b) Figure 5: Parallelization performances for FEM ("M" is the number of unknowns to be solved): (a) Parallel Bi-CG routine, (b) Frequency parallelization scheme. (a) (b) Figure 6: Two different packaging structures and electric field distributions inside of the structures. 6

LE'EE MTT —5 19q ( Accee?) Characterization of Plated Via Hole Fences for Isolation Between Stripline Circuits in LTCC Packages George E. Ponchak', Donghoon Chen2, Jong-Gwan Yook2, and Linda P. B. Katehi2 1. NASA Lewis Research Center, 21000 Brookpark Rd., MS 54/5, Cleveland, OH 44135 2. University of Michigan, 3240 EECS Building, Ann Arbor, MI 48109-2122 Abstract Reduced coupling between adjacent striplines in LTCC packages is commonly accomplished by walls made of plated via holes. In this paper, a 3DFEM electromagnetic simulation of stripline with filled via fences on both sides is presented. It is shown that the radiation loss of the stripline and the coupling between striplines increases if the fence is placed too close to the stripline. Introduction Smaller packages with more circuitry are required for the advanced RF systems being built today and planned for tomorrow. These new packages house a variety of high density circuits for data processing, biasing, and memory in addition to the RF circuits. While the size is being reduced and the complexity increased, the cost of the package must also decrease. To accomplish these contradictory goals, new packaging technologies are required. Low Temperature Cofired Ceramics (LTCC) is an ideal packaging technology. The material has a moderate dielectric constant, 4<-Er<8, which permits wider strips and thus lower conductor loss than circuits on Si, GaAs, or Alumina. In addition, the loss tangent is on the order of 0.002 at 10 GHz which yields an acceptably low dielectric loss. LTCC packages are comprised of many 0.1-0.15 mm thick ceramic layers with transmission lines on each layer. This increases the level of integration by allowing bias, digital routing, and RF transmission lines and interconnects to be built up in three dimensions. The multilayer character of these circuits leads to RF transmission lines which are not of microstrip type but of stripline type. Even with the high levels of integration LTCC offers, designers are required to decrease the spacing between striplines to meet the new size and cost requirements. In doing so, coupling between adjacent striplines severely limits the overall packaged circuit performance. As a solution to this problem, package designers have included filled via feinces ad(jacent to the stripline to confine the electrom'lagnetic fields lao)lund the center strip, Ah Metal-filled (a) -I Dv - PEC Metal-filled Via T (b) Fig. 1 (a) Stripline with a continuous filled via fense on both sides (b) Cross section. partition the package and reduce electrical couplings [1-2]. In this paper, we utilize a 3D-Finite Element Method (FEM) [3] to evaluate stripline structures in the vicinity of filled via fences. It is shown for the first time that filled via fences increase the radiation loss of the stripline and decrease the coupling between adjacent striplines. Design guidelines are given to minimize these parasitic effects which degrade overall performance. Results The cross section of the stripline structure is shown in Figure 1. Throughout this paper: the relative permittivitv of the LTCC material is 5.3; the via diameter, Dv, is 0.25 mm; the width, W, of the stripline is 0.1) mm to yield a 50 Ohm characteristic impedance: and the thickness, h, is 0.25 mm. The distance betw\een the stripline and the vias is kept greater than.25 mm and the via-to-via spacing rangces between 1.3 and 5 times the via diameter. Fhle, \ aluS t-c ac ll standard for- typical LTCC

(0 ----— ~J ()lI:I2I 0-5 E -25 C -30 '-35 -40 --- — i 'I (S I IAI Rad. Lo.ssI (A, II -I i 'i (ih1 Ra~d. I 0)01l (A% I): 5.2 h) 5.2 h) 1.1 h) U -o i — 20 -.-0 -~' 9-40 -45 -i0 — 4S AS/ h (a) Fig. 3 Electric field distribution of stripline with a continuous via fence at 25GHz with AS/h= I and Ah/h=5.2. a) C' 0.0 r, -02.2 — -0.4 - -o -- ISji (Ah= 1.3h) -0 8 -1.0.2 3 4 AS/ h (b) 0 -5 -10o -2 l-S -30 -35 -40 -45 -90 -SS I Fig. 2 (a) Return loss and radiation loss, l-IS 12-IS l 12, of stripline with a continuous via fence as a function of Ah/h and AS/h, (b) Insertion loss. processes and thus make these results directly applicable to the packages being designed today. The first structure investigated is a single stripline with a via fence on both sides of the strip as shown in Figure 1. This is a typical structure to ensure suppression of the parasitic parallel plate waveguide modes. The transmission lines are first analyzed over the frequency range of 10 to 4(0 GHz. It is found that the scattering parameters do not vary by more than a few percent over this frequen cy range as long as 2AS+W<,j/2 where 7k is the wavelength in the dielectric. This condition is necessary to avoid the excitation of dielectric filled rectangular waveguide lodes. Based onl (his finding, the results are presented as a fulncion of thet line geometry with the averIagC l the scattering parameters plotted. Figures 2 (a) and (b) show hClie magnitude of Si, S2,, and the radiation loss, 1 IS zl"- IS 21 12, as a function of the translission line geolletr-v. It is seen that the 1r'c lct' lill ' t o c icicitl, tlhe insertion loss, and tlc radiillion los,;all ttc'.-rct.sc as tlie separation between Fig. 4 Electric field distribution of stripline with a continuous via fence at 25GHz with AS/h=l and Ah/h= 1.3. the strip and the via holes, AS/h, increases. Reduction of the parasitic effects levels off as AS/h approaches 2. Furthermore, it is seen that the insertion loss decreases as the distance between the via holes decreases. Field distribution for a stripline with large narrow via hole spacing are shown in Figures 3 and 4 respectively. These plots indicate that the closely spaced via fence completely confines the electromagnetic field while the wider spaced via holes permit a significant leakage of power. Note that AS/h =l in both of these field plots. In case where wevc need to locally improve isolation, the continuous via fence Lmay be replaced by a short via fence as shown in Figure 5 [2]. Figures 6 (a) and (h) show the simulated performance for this line. It is seen that the scatteringl paramleters and radiation loss vary wivth \AS/h in the satil way as the contl i nuIu1s via c c. owe t t ver, upon coipatri son witl Figures 2 and 3, (there is 5 3 dtB dgradatio in lie lile c ihaictCleristics. Ft:iglir 7 shows till t th is is due to

Ali Metal-filled Via PEG. ne -2 -25 -30 -45 -50 -5s I I lo I I i; ". - I.,., IMPE S ''Mimi"M [/'SnbRtrate 11 I Ground Plar Fig. 5 Stripline with a short selction of a filled vla fence on both sides. --- IS I I -0- Rad. (A h = 13 h) tLossl (A h= 1.3 h) 20 U. -5 --10 --15 --20 --25 -30 --35 --40 -2 34 A S'/ h (a) Fig. 7 Electric field distribution of stripline with a short via fence at 25GHz with AS/h=lI and Ah/h= 1.3. AS 0.0 -0.2 --0.4-~ -0.8 - --— T- I -I — j 0 — r 7 U7z 15I,.s A i h. I _ _ _ _ _ _ _ PEG 01 Fig,.8 Adj'acent striplines seperated bv via tenCleS. A S / h1 i. (6 (a) Re tuLI foI anS ~ld radiatio oI (),S. I I KI 012.o 1tr1p1ir With A Short vial 'CnCC aIS Ir InnIC10t On f AS/fr..f/r..(13) I niscrtilon losN o stri-1pII inc 'Vs I h a CoiitnrlLi()LiS Via I'Lrr1ce asz 1un1ction of A*\5,f AIIh/=v I.

the perturbation of the fields upon reaching the via fence on either side of the strip. The electromagnetic fields that escape the confinement of the via fences and result in radiation loss cause coupling between adjacent striplines as shown in Figure 8. Figure 9 shows the magnitude of the total electric fields for two striplines separated by a via fence with AS/h=0.615 and Ah/h=0.764 (h=0.7366mm). The stripline on the left has an imposed voltage on it while the stripline on the right is coupled. Separation of the field into its components shows that electric field in the plane of the strips is approximately 10 dB greater than the electric field normal to the strips near the via fence. The large magnitude of this parallel field component is due to the small via size and large value of Ah/h, resulting -10 dB coupling between the strips. Although perfect conductors were assumed in this study, the current distribution on the strip with a close via fence is expected to cause an increase in conductor loss compared to the conventional stripline case. Conclusions It has been shown that when filled via fences are placed too close to the stripline, radiation loss increases resulting in coupling between adjacent striplines. To minimize the radiation loss, the via fence should be kept at least four times the ground plane separation away from the strip, or AS/h>2. Furthermore, the use of via fences locally only where isolation is required, can actually degrade the stripline characteristics by causing a large perturbation in the electric fields that normally extend on either side of the strip. Acknowledgement The authors would like to thank the University of Michigan Center for Parallel Computing (CPC), which is partially funded by NSF grant CDA-92 -14296 and Ford Motor Company, for the use of computational facilities. References 1. J. W. Gipprich, "EM modeling of via wall structures for high isolation stripline," Digest for Workshop WWFC, Interconnect and Packaging Technologies and Issues, IEEE Int. Microwave Symp., San Diego, CA, pp. 78-114, June 1994. 2. J. Gipprich, C. Wilson, and D. Stevens, "Multilayered 3-D packaging issues," Digest for Workshop WFC, Interconnects and Packaging for RF Wireless CommuLnications Systems, IEEE Int. Microwave Symp., Denver, CO, June 8-13, 1997. 3. J.-G. Yook, "Electromagnetic modeling of highspeed high-frequency Interconnects," Ph. D. dissertation, University of Michgan, Ann Arbor, MI, 1996. Fig. 9 Electric field distribution of two striplines separated by a continuous fence f=10GHz, AS/h=0.45, and Ah/h=0.5).

"presented at ACES 97" A GENERALIZED METHOD FOR INCLUDING TWO PORT NETWORKS IN MICROWAVE CIRCUITS USING THE FINITE ELEMENT METHOD Eray Yasan, Jong-Gwan Yook and Linda P. Katehi Radiation Laboratory Dept. of Electrical Engineering and Computer Sci. The University of Michigan, Ann Arbor, MI 48109-2122, U.S.A. Tel: 313-764-0502, Fax: 313-747-2106, Email: eray@engin.umich.edu Abstract This paper discusses the methodologies required to introduce basic passive elements into an FEM modelled microstrip circuit. Three different approaches have been investigated and are presented herein. The first two methods use the basic principles of circuit theory while the third employs the so called zeroth order approximation or volume current method. One of the EM-circuit method uses the voltage-current relations applying at the passive element(s) in the circuit. This is done through the impedance of that element and requires a modification of the functional of the FEM equations. The second method uses the S-parameters of the passive element(s) to provide the required circuit relation. By using these methods the effect of the presence of the passive element(s) on the microstrip circuit can be observed. 1 Introduction The Finite Element Method (FEM) has been established during the past ten years as an accurate and versatile frequency-domain technique for passive circuit problems. Despite the capability of the technique to treat a broad variety of circuit geometries, it has been limited to only distributed elements that are mostly passive and linear. To be able to make the technique applicable to more complete microwave and millimeter wave circuits, its capability needs to be extended to handle passive and active elements. Some studies about these issues have been made by using different techniques such as FDTD, TLM and FEM [1]-[3]. The techniques presented herein are divided into circuit element methods and volume current methods. In the following, a short description of each technique is given.

1.1 Circuit Method # 1 This method uses circuit concepts such as the current-voltage (I - V) relations through the nodes connected to the element[6]. From circuit theory the following relations for the resistor, capacitor and inductor hold respectively V = RI V = jwLI I = jwCV. (1) In terms of field quantities the voltage and current are expressed in the following way V= E.dl (2) I= fH dl (3) where the integral is evaluated along the element for the voltage and around the element for the current flowing through the nodal points. Since tetrahedral based FEM is used throughout the analysis, lumped elements are located along the edges of the tetrahedrons. 1.2 Circuit Method # 2 In the second circuit method, S-parameters of the lumped elements are known and given to the system a priori. In order to use (I - V) relations, S-parameters are converted to Z-parameters. Referring to Fig.l, the following relations hold V1 = Z111 + Z1212 (4) V2 = Z2111 + Z2212 (5) In the microstrip case V1 is the voltage between the signal line and the ground. If this is a symmetric microwave circuit, then V1 is equal to V2. I1 and 12 are the currents flowing through the conductors towards the element as shown in Fig. 1. In terms of field expressions V1 and V2 are the line integrals of electric fields evaluated through the edges from conductors to the ground. I1 and 12 are the closed line integrals of magnetic fields around the conductors. 1.3 Volume Current Method The third method [7] uses the relation J- = aE (6) in order to account for the lumped elements' contribution by introducing fictitious conductivity regions in the volume. The conductivity a in these regions is given by o = l/(ZLS) (7) where ZL, s and represent load impedance ini ohms(), cross-sectional area and length of the load respectively.

12 11; ---- +o - + V, [Z] V2 - o 0o - Figure 1: Two port network with Z parameters 2 FEM Formulation Starting with Maxwell's equations, the following wave equation is weighted according to Galerkin's method and discretized V x V x E - w2EcE = -jwpJi (8) where Ji is the impressed current source. Discretization is done in a conventional way using tetrahedral elements. The following weak form of Maxwell's equations is obtained / J [(V x E) * (V x P)-w2aE ~ P]dv= jw(H x P) ~ nds.- J juP ~ Jidv. (9) Expanding the electric field as a summation of linear basis functions and choosing the weighting function as the same edge basis function give 6 E = aiWi (10) i=1 P = W, j = 1,...,6 (11) where a, are the unknowns to be determined, and W's are the first order edge based Whitney functions. For the circuit method #1 the first term on the right hand side of the FEM equation (9) is expressed in such a way that the effect of the lumped element is included. The resulting term in the FEM equations becomes Z -- [el2 E6(r- r)][j E. dl]ds (12) where e12 is the unit vector pointing from node 1 to node 2 along which the element is placed and ZL is the load impedance. According to the circuit method #2, having additional set of equations from I - V relations enables us to write another matrix equation having N unknowns not necessarily being equal to the number of additional equations. By careful examination, elimination of some of the

unknowns, especially through the edges along which the element is located, by use of these equations decreases the size of the original FEM matrix. This matrix equation is solved using iterative matrix-solving algorithms. The third method uses the second term on the right hand side of the FEM equation (9) in order to introduce the effect of the lumped element to the equations. The term takes the form -j JjE - Edv. (13) Similar to the first method, this term only contributes to the diagonal elements in the FEM matrix. The most efficient way to solve these matrix equations is found to be the Pre-Conditioned Conjugate Orthogonal Conjugate Gradient (COCG) Method. 3 Results A shielded symmetric two port microstrip structure shown in Fig. 2 was examined. Using circuit method # 1 and volume current method, the case of 1000 between the lines was studied. The S parameters obtained out of these two methods are shown in Fig. 3. For S11 the results agreed with theoritical ones (|Sii| = IS221 = -6.02dB) quite reasonably but for S12 they are degraded in the high frequency region. This might arise because of circuit discontinuity at the element and also insufficient discretization. For circuit method # 2, the same geometry with a PEC wire is used in place of the resistor(Fig. 4). The results are shown in Fig. 5 and are in good agreement with the regular FEM results. A variety of results involving complex arrangement of passive elements will be presented and discussed at the conference. 4 Conclusions Three different approaches have been proposed and applied to the shielded two port microstrip geometry. These approaches can be applied to similar geometries which include inductors and capacitors as well. Generalization of these approaches might lead us to analyze structures having active lumped elements and to utilize harmonic balance technique with FEM. Acknowledgement This has been partially supported by the Army Research Lab at Fort Mamouth and the Army Research Office under Contract # DAAH04-96-1-0390 References [1] T. Shibata and It. Kini-tirla,"(Cotllipter-aided engitieelrilg for mlicrowave and millimeterwave(' circuits using the FDTD techini(lue of field sinmulations," hit. Joturnal of Micriowave alnd

Millimeter-Wave Computer-Aided Engineering, Vol.3, No. 3 pp. 238-250, 1993. [2] V.A. Thomas, M.E. JonesM. Piket-May, A. Taflove and E. Harrigan,"The use of SPICE lumped circuits as sub-grid models for FDTD analysis," IEEE Microwave Guided Wave Ltters,vol. 4, pp. 141-143, May 1994. [3] A. Jahnsen and V. Hansen,"Spectral analysis of multiport microstrip circuit with active and passive lumped elements,"EMC, p. 1053,1990. [4] R.H. Voelker and R. Lomax, "A Finite Difference Transmission Line Matrix method incorporating a nonlinear device model," IEEE Trans. Microwave Theory Tech., vol.38, March 1990,pp.302-312. [5] K. Guillouard, M.-F. Wong, V. Fouad Hanna, J. Citerne, "A new global Finite Element analysis of microwave circuits including lumped elements," Digest of the 1996 IEEE Microwave Theory Tech. Symposium,vol. 1. [6] E. Yasan, J.-G. Yook and L. Katehi, "Modeling the interface between fields and devices using the Finite Element Method," Digest of 1996 Progress In Electromagnetic Research Symposium (PIERS). [7] Jong-Gwan Yook, "Electromagnetic modeling of high-speed high frequency interconnects," PhD Thesis,The University of Michigan 1996.

Figure 2: Shielded microstrip circuit with Zo = 50f, cr = 10.5 w= 0.34mm R = 100l h=0.38mm UKII -21 -4 V (I., CMS -6 -8 S1 1(..) Circuit Method #1 S11(-.) Volume Current Method S12(-) Circuit Method #1 S12(*) Volume Current Method -I I. $ -10 -1 I nI I Ii - 10 10 15 20 25 frequency (GHz) Figure 3: S-p)arameters of the shielded microstrip circuit with a resistor

PEC PEC PEC PEC Figure 4: Shielded microstrip circuit with a PEC wire between the conductors S11 (-.) Circuit Method #2 S 1 (-) Regular FEM S12(o) Circuit Method #2 S12( —) Regular FEM I! frequency (GHz) Figure 5: S-parameters of the shielded microstrip circuit with a PEC wire

"accepted for URSI-98" An FEM Based Method on Full Wave Characterization of Distributed Circuits Including Linear and Nonlinear Elements Eray Yasan* and Linda P. Katehi Radiation Laboratory Dept. of Electrical Engineering and Computer Sci. The University of Michigan, Ann Arbor, MI 48109-2122, U.S.A. Tel: 313-764-0502, Fax: 313-747-2106, Email: eray@engin.umich.edu Theoritical characterization of microwave circuits including linear and nonlinear elements has been an interest in many ways using numerical techniques including Finite Element Method (FEM). Since the lenghts of elements connected to the microwave circuit are very short so that Kirchoff's laws can be applied, the conventional scheme to find S parameters by adding extra line lengths is no more applicable. In this case the circuit is treated as a combination of inner (internal) ports where the linear or nonlinear elements are connected and the outer (external) ports where the input and output of the circuit are taken. Work has been done in time domain using "Compression pproach"(J. Kunish,M. Rittweger, S. Heinen and I. Wolff,Proc. 21st European Microwave Conf.,1296-1301,1991). The linear passive part of the circuit is analyzed using the FEM. The ports are excited one at a time while the others are left open to derive the appropriate voltages and currents at all ports. The linear part of the circuit is thus characterized by the impedance matrix (Z-matrix). To reduce the computational cost, circuit symmetries are used and assuming circuit lossless, S parameters can be found using the simple relation ([Z}+[I]) [Z - [I) =1 [S] Following the computation of scattering matrix, a simulation software such as HP EEsof Libra and connecting the linear or nonlinear elements to the inner ports yields the input and output S-parameters of the microwave circuit. As an example a microstrip line with a gap where a lumped element is connected has been analyzed. Simple lumped elements like resistors,capacitors and inductors have been connected and the resulting circuits have been modeled successfully. A diode mixer on a coplanar waveguide (cpw) designed for higher power handling capability has been analyzed using the above method in connection with Harmonic Balance Method (HBM). Results from these studies will be presented and discussed extensively.

CEN4/ ELECTROMAGNETIC SOLVERS FOR HIGHFREQUENCY CIRCUITS I * Jong-Gwan Yook and Linda P.B. Katehi The University of Michigan * Barry Perlman Army Research Lab * Jim Harvey Army Research Office * Leo Di Dominico Army Research Lab *Contributors: Jui-Ching Cheng and Donghoon Chen The University of Michigan ___j CEN4/ ELECTROMAGNETIC SOLVERS FOR HIGHFREQUENCY CIRCUITS > Goal > Real Time Design and Optimization of Large Scale Electromagnetic Problems with Emphasis on ThreeDimensional MMICs/Antennas > Approach > Develop Computationally Efficient Codes by Use of MRA (Multi-Resolution Analysis) > FWT (Fast Wavelet Transform) for Accelerated Computations > Develop Code Parallelization Schemes for Maximum Speed ___

CEN4/ ELECTROMAGNETIC SOLVERS FOR HIGHFREQUENCY CIRCUITS Relevance to Army/DoD and Impact >BCIS-DS (Battlefield Combat Identification Systems-Dismounted Soldier) o Need for Better Design o Need for Faster Turnaround o Need for Lower Cost o Need for Better System Evaluation CEN4/ ELECTROMAGNETIC SOLVERS FOR HIGHFREQUENCY CIRCUITS - ---- - ------- Relevance to Army/DoD and Impact (cont.) >-LongBow Point-to-Point Communications o Need for Conformal Flat Array Design o Mixed Analog/Digital Circuits o Need for Single Iteration Designs and Verification

CEN4/ ELECTROMAGNETIC SOLVERS FOR HIGHFREQUENCY CIRCUITS Relevance to Army/DoD and Impact (cont.) >-21st Century Land WarriorDigitized Battlefield o Need for Conformal Small High-Efficiency Antennas for PCS o Need for Very High-Density Low Weight Circuit Designs —.i CEN4/ Goal: Three-Dimensional Integrated Circuits Radiating Elements i Coupling Slots Feeding Network Feeding Network Power/Source Layer ___j

CEN4/ Goal - Size and Memory Budget -- --- -— 1111_~1 0 Microstrip Patch Array Layer: 6 x 6 Elements 3 9,500 unknowns 3 9,500x9,500 matrix = 90,250,000 entries 3 99.9% sparsity: 902,500 non-zero elements ] Coupling Slot Layer: 6x6 slots 3 540 unknowns 3 540x540 matrix= 291,600 entries 3 85% sparsity: 43,740 non-zero elements [ Distribution Layer: 3 9,000 unknowns 3 9,000x9,000 matrix=81,000,000 entries 3 99.9% sparsity:810,000 nonzero elements D Total Budget with Coupling Terms: 319,040 unknowns 3 362,521,600 matrix entries 3 1,756,240 nonzero elements CEN4/ Goal - Size and Memory Budget ] Parallelization Strategies 3 Parallel Computation of Matrix Entries Using FWT - compute 45,000 nonzero elements 3 Parallel Matrix Solver 3 Frequency Parallelization HC Computational Requirements 3 Required Core Memory > 512 Mbyte 3 Required Communication Latency < 10 gsec 3 Required Bandwidth > a few hundred Mbytes/sec 3 Preferred Computers: IBM SP2, Cray T3D, Cray YMP, C90, SGI Powerchallenge -—.i

CEN4/ Theory: Hybrid MoM/FEM Method * Separate Problem Domain into Subdomains * Replace Green's Function with FEM Solution in Areas where Green's Function is Unknown * Match Boundary Conditions Between Domains * Apply Galerkin's Moment Method * Build Impedance Matrix * Apply FWT and Thresholding * Solve the Resulting Sparse Matrix Equation __j CEN4/ Theory: Hybrid MoM/FEM Method Match boundary conditions Green's function: unknown Use FEM to replace the Green's function Green's function: known Use the Green's function

CEN4/ Array layer: Near Matrix-Diagonalization.. __~111._. __11111111 1~1. ~ ~ ~ ~RTP9 FWT and Thresholding Elements to be computed: 45,105 IC) I.. -I~ la. -| Sprs: y I 99 " rrr un 11 ~ 7 1 z;s~~~ I CEN4/ Array layer: Near matrix-diagonalization -- ----------— 71M Parallel FWT Parallelized F.W.T. Im~ kp- Ol.-I O.-I b.-, 0,.-, b.-, b, P- )tn-l lOm-1 h.. p- VI.-l ),n-l

CEN4/ Array layer: Near matrix-diagonalization Thresholding = 0.001: Sparsity=99.19% 6 x 6 Alrrays, l rshold 0.000. sparsly 0.991 0 - 1000o 2000 3000 4000 000o dc000 7000 9000 9000 0 2000 4000 S000 S000 nz - 730200 CEN4/ Parallelization Model of Impedance Matrix Fill -D> Command path CPU #1 Worker -> Data path Worker Impedance CPU #0 CPU #2 Matrix CPU Dispatcher 1. Send Commands to Workers 2. Receive Data from Workers 3. Store Data in Impedance Matrix CW s CPU#N Worker 1. Receive Commands from Dispatcher 2. Perform Computation 3. Send Result to Dispatcher

CEN4/ Parallelization Model of Fast Wavelet Transform --- Data path Impedance Matrix * Evenly divided by N. * Send each share to each CPU * Perform FWT CEN4/ Parallelization of Linear Equation Solver ~.~.. ~ w..~.,. ~ ~~: ~...,. ~......-.....,,..... ~. ~~>~~.~~~:......_....... ~~.:-~.:~-~-....................,~~..>~~: ~::~::::~......,..... ~..~~?,,...:.:>~...,.>.. -..... * Per frequency point, 80 to 90% of solution time is spent in the linear equation solution routine * In the BiCG routine, more than 90% of time is spent by matrix-vector multiplication * This approach resolves memory and time requirements due to problem size and complexity * It performs frequency computations serially Geometry Definition Pre-processing: Mesh Generation Frequency Sweep fi... fn Element Matrix Computation Matrix Assembly: Ax=b r --- —------------------ LinearEquation Solver: BiCG Parallelization --------------------------- S-Parameters Output ) I

CEN4/ Approach: Parallel Matrix Multiplication,.. - -`7 --- ---- - - - ------ 1:1.-,:-:-: —:-: -:;::,: -:-: Step 1 aste *Broadcast matrix A and vector b at once *Broadcast vector x in every iteration * Perform partial matrix-vector multiplication Ajx on every slave node *Gather results to form b *Find gradient in BiCG and repeat Ax CEN4/ Task Parallelization * Use frequency domain nature of FEM: * No data dependency * No communication overheads * Unlimited communication bandwidth * This approach is limited by the size and complexity of the problem. Geometry Definition Pre-processing: NVsh Generation I, "' Fequency Sweep" - Parallelization, fl... fn V" y~-'........ --- —-. _ v r ------------ fi f,fl- ~2 F Element vatrix Computation ------------- ------- -----: " --- —--------! lhtrix Assembly: Ax=b ------------------------- - I.......-.....-. —. -' —.-. — -,. Linear Equation Solver: BiCG -----------— T --- —-------- ( Opt. - S-P ee ---- l ------ c- - -- - -- - --

CEN4/ Performance Measure Impedance Matrix Fill for Single Patch Antenna (240 matrix elements, Toeplitz Matrix) - 1 Ed ICu -Measured - - Theory Time(i) = P + Comm x N + Comp x N /(i - 1), for i >1 P: preprocessing time. Comm: commnunication time. Comp: computation time. i: number of CPUs. N: number of matrix elements. _ ________ 1 5 10 15 20 Number of CPUs 25 30 CEN4/ Performance Measure --------- Fast Wavelet Transform for Single Patch Antenna (3568 matrix elements, Not Toeplitz Matrix) 9 - 8 \ -Measured o 7 \ lT LTheory '6 5Time(i)= + Comm x N + Comp x N i 0: overhead. m 4 \ Comm: communication time. \ Comp: computation time. 3 i: number of CPUs. < N: number of matrix elements. Cu2 1 Ei -------- I. nf i i 2 4 6 8 10 12 14 16 18 Number of CPUs Note: the overhead includes the extra function calls for measuring time and parallel processing. I

CEN4/ 3X3 Patch Array Radiation Patterns: E <.... ~.,..~.....................~...~:.. S,..,:.~,..,................................................. E, (O = 0) Pattern for 3x3 Patch Array 0 [deg] * Number of Unknowns = 2376 * Total Number of Matrix Entries = 5,645,376 * Computation Time = 1.5 hours on 32 CPUs CEN4/ 3x3 Patch Array Radiation Patterns: E 0 0 E, (< = 90) Pattern for 3x3 Patch Array 0 [deg] * Number of Unknowns = 2376 * Total Number of Matrix Entries = 5,645,376 * Computation Time = 1.5 hours on 32 CPUs. - -_ _ _

CEN4/ Current Distribution on a Patch -0.0 5 0 2 0 Y axis (mm) X axis (mm) I CEN4/ Near Field Distribution for Single Patch Antenna

CEN4/ 6x6 Patch Array Radiation Patterns: E @ -7 — - - - - - - - - - - - 20 30 40 50 60 e0 [deg] 90 * Number of Unknowns = 17,280 * Total Number of Matrix Entries = 298,598,400 * Computation Time = 4 hours on 32 CPUs CEN4/ 6x6 Patch Array Radiation Patterns: E 0 ~3 p ~F............~.,~.,.,,.:.~,..,.:..:...~....~ ~....~..................~II................................ Ee (( = 90) Pattern for 6x6 Patch Array - -30 m to 0 -40._7 so 2 _ -60 0)-70 Ir 0 [deg] * Number of Unknowns = 17,280 * Total Number of Matrix Entries = 298,598,400 * Computation Time = 4 hours on 32 CPUs

CEN4/ 6x6 Patch Array Radiation Patterns Etotal Ephi CEN4/ 6x6 Patch Array Radiation Patterns Ephi Etheta E total +

CEN4/ Performance Measure "Mi l Parallel Matrix-Vector Multiplication,, -. - ' - - This Approach is Limited by: * Communication Latency (40isec) 'Bandwidth of Switch *Node Synchronization -/ E? V 10UU 1400 1200 1000 800 600 400 200 q.U. -- M=38109... M=76709 - O........................,.. II I. I 0 2 4 6 8 10 12 14 16 Number of Processors 18 20 - CEN4/ Performance Measure Task (Frequency) Parallelization 35 30 Linearly scalable 25 M=160, J parallelization limited, 20 by the size and - complexity of problem 15 U 10 5' 00 5 10 15 20 25 30 35 Number of Processors

CEN4/ Packaging Application (I) I ~I I II, I --- — ---- * Operating Frequency Range = 5 to 40 GHz * Number of unknowns = 150,000 * Solution time = 2.0 hours Ah Via Holes Microstrip / Line.......::::::::::::::::::: -:::::::::::::::::-::::':::::::::::::::::::::::::::::::::::::::::::::::7::' ~r =5.3 Ah = 0.4 mm AS = 0.25 mm H = 0.25 mm W = 0.414 mm CEN4/ Electric Field Distribution in the Package Operating Frequency = 33 GHz

CEN4/ Packaging Application (II) * Operating Frequency Range = 5 to 40 GHz - Number of Unknowns = 100,000 - Solution Time = 1.3 hours Cr= 5.3 Ah = 0.326 mm AS = 0.25 mm H = 0.5 mm h = 0.25 mm W =0.19 mm C"EN4/ Electric Field Distribution in the Package Operating Frequency = 28 GHz

CEN4/ Near Field Distribution for Single Patch Antenna i~ ----------- -ii i... * Vertical Vias are Placed Near the Microstrip Line * Via Spacing = 1.3h (h = substrate thickness) * Number of Unknowns = 400,000 ( -3.5 hr to solve) 50 LQ Microstrip Metal-filled Vias Ground Plane -- CEN4/ Electric Field Distribution in Package. --- —-...-, -:,-:.. -, —. ~~~~~?.. ~~~~..~~: ~~,~.: ~..~..~~s ~~:. ~~~~ *......: mn..~ ~~ ~ ~. ~.~:.~ ~~:~ ~~:-,.:...:.~~., ~~~,~: ~.:~.:~. ~...~...~.~. ~:-. ~:~ ~:,: ~.,....~:.~. ~~:.~ ~~~.?........ * Operating Frequency = 25 GHz * Separation = 1.3h (h = substrate thickness) ___j

CEN4/ Package with Open Striplines -= ------- 11:0: * Via Hole Spacing < 2.5h (h = substrate height) * Number of Unknowns = 350,000 ( -3.0 hr to solve) a CEN4/ Near Field Distribution for Single Patch Antenna ------- ------ ----- - - --- --- - Without Via Holes With Via Holes

CEN4/ Package with Bended Stripline * Number of Unknowns = 400,000 ( -3.5 hr to solve) 50 Stripline / CEN4/ Electric Field Distribution in Package -------------- Below Cutoff Frequency Above Cutoff Above Cutoff Frequency Without Frequency With Via Holes Via Holes

CEN4/ ELECTROMAGNETIC SOLVERS FOR HIGHFREQUENCY CIRCUITS - Conclusions > Milestones and Conclusions > Completion of a Frequency Parallelized Code 12/96 >Modeling of a 2in x 2in Ka-Band Single Layer Power Combining Network - About 105 Unknowns 12/96 > Completion of a Matrix Element Parallelized Code 12/97 >Modeling of a 3in x 3in Ka-Band Double Layer 3-D Co-Fired Ceramic Circuit- About 5X105 Unknowns 12/97 >Completion of a Fully Parallelized Code 12/98 >Modeling of a 4in x 4in Ka-Band Multi-Layer 3-D Co-Fired Ceramic Circuit Including Interconnects, Filters and Radiating Structures - About 106 Unknowns 12/98 I CEN4/ ELECTROMAGNETIC SOLVERS FOR HIGHFREQUENCY CIRCUITS - Teaming >Teaming > ARL > Univ of Michigan > NASA Lewis Research Center > Air Force Wright Patterson Laboratory >POC > Leo DiDomenico, ARL, (908) 427-2377 ldomenico @ ftmon.arl.mil > Linda P.B. Katehi, U of M, (313)647-1796 katehi @eecs.umich.edu __j

CEN4/ ELECTROMAGNETIC SOLVERS FOR HIGHFREQUENCY CIRCUITS Module Definitions Open - Space Radiation Problem Modules 1) IMPSORT This module uses the symmetries or antisymmetries of the impedance matrix to sort out a list of unique matrix elements. In non-parallized version only. 2) IMP..vAT, IMP_..MAT.P: This module (in regular and parallel versions) generates the elements of the impedance matrix for a microstrip patch fed by a slot using the single layer greens function. This program creates a vector of unique non-zero elements. 3) FWT, FWT.P: This module applies the fast wavelet transform to generate the interactions (impedance matrix) for a given module. 4) MAT SOLVE, MATSOLVE.P: This module performs thresholding of the matrix, prepares the matrix for solution and solves it using either Gaussian elimination or BiCG as requested by the user. 5) MoM-POST: This module plots the data calculated by the previous modules in a variety of ways ( near field, far field, S-Parameters, ect.) 6) MoM LIB: This is a library module that contains supporting functions for the above modules. CEN4/ ELECTROMAGNETIC SOLVERS FOR HIGHFREQUENCY CIRCUITS Circuit Problem Modules 1) FEMJIESHI: Generates a tetrahedron mesh using bricks. 2) FEM.MAT, FEM.MAT.P: Generates the FEM matrix for shielded circuit problems. 3) FEM_MAT_ABS, FEM_MAT_ABS.P: Generates FEM matrix for open circuit problems. 4) FEM..SOLVE BiCG, FEM.SOLVE... BiCG.P: Solves the FEM matrix for lossy circuit problems using a conjugate Orthogonal Conjugate Gradient. 5)1 FEM NSOLVE_CG, FEM. SOLVE_.CG.P: Solves the FEM matrix for lossy circuit problems using a Conjugate Gradient method. 6) IMPLOAD, IMP_,LOADP: Creates the FEM matrix for circuit problems with lumped elements. 7) FEM POST FIELD: Plots field values in a variety of ways as requested by user. 8) FEMNPOSTSCAT: Plots scattering parameters.

CEN4/ ELECTROMAGNETIC SOLVERS FOR HIGHFREQUENCY CIRCUITS r MODULE | Analyzer | Partitioner | Simulation User Interface Parallelizing I I I t L i i I CEN4/ ELECTROMAGNETIC SOLVERS FOR HIGHFREQUENCY CIRCUITS I MODULE EX PIECT. C1OMPLETI AT.U. A MODULE I EXPECTED COMPLETION I ACTUAL.IC......1. I