034796-4-T Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design Linda P.B. Katehi January 1999 34796-4-T = RL-2478

Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design Graduate Students: Donghoon Chen and Eray Yasan Research Fellow: Jong-Gwan Yook Principal Investigator: Linda Katehi Radiation Laboratory Department of Electrical Engineering and Computer Science The University of Michigan, Ann Arbor, MI 48109-2122, U. S. A. Project Leader:: Leo DiDomenico and Dr. Barry Perlman Applied Communications-Electronics Laboratory US Army Communications-Electronics Command (CECOM) Command Control & Systems Integration Directorate AMSEL-RD-C2, Ft. Monmouth, N.J. 07703

Contents 1 Introduction 2 2 Project Goal 2 3 Integral Equation with Multiresolution Analysis [1, 2] 3 4 Parallelized FEM for MMIC Simulation [1, 2,5] 4 5 Active Circuits in FEM [6, 7] 6 A User Manual for FEM part 7 A. 1 Compilation and Execution............................ 7 A.2 User Manual for Phase I - Mesh Generation..................... 9 A.3 Sample Output Data Files for Phase I....................... 17 A.4 User Manual for Phase II - FEM Main Program................. 20 B Benchmarking 23 C User Manual for Hybrid (FEM + MoM) Method 28 C.1 Procedure..................................... 28 C.2 File flow chart................................... 29 C.3 File description.................................. 30 C.4 Compilation and Execution............................ 31 C.5 User Manual for FEM Mesh........................... 34 C.6 User Manual - MoM............................... 41 D Application of Parallel Computing in MoM 43 D. 1 Introduction.................................... 43 D.2 Parallelization Model of MoM Matrix Fill Process................ 44 D.3 Parallelization Model of FWT.......................... 48 D.4 Conclusion.................................... 49 E The Implementation of Parallel Programming in C++ 51 E. 1 Dynamic job allocation.............................. 51 E.2 Static Job Allocation............................... 53 F Publications 59 1

1 Introduction This report summarizes the effort on the development of accurate and computationally efficient codes for large scale electromagnetic problems with emphasis on three-dimensional monolithic circuits for high-frequency applications. This development will be successfully accomplished by two approaches: * increasingly sparse matrix techniques also known as near matrix diagonalization (NMD) techniques * code parallelization using Massage Passing Interface (MPI). In the following, description of CHSSI scalable software project goals and progress summaries for various efforts are provided. 2 Project Goal High-frequency radar systems for communication, detection and surveillance incorporate MMIC modules which are characterized by high density and substantial geometrical complexity. In most cases these modules are packaged for electrical and/or environmental protection and the resulting 3D structures exhibit a considerable complexity that impedes design and influences electrical performance due to unwanted parasitics. The design of these complex three-dimensional monolithic circuit (MMIC modules) with hundreds of vias and lines interconnecting a large number of active and passive circuit components to a rather large number of radiating elements, has been a problem of critical importance to DoD. The design and characterization of these circuits requires numerical approaches which can fully characterize the excited electromagnetic fields. Among the various techniques, the method of moments has demonstrated superiority in solving radiation problems but it has been hindered by long computational times and is limited, for this reason, to small circuits not exceeding more than a few components and interconnecting lines. In this study, we have concentrated on the development of codes which have the capability to accelerate numerical computations in electromagnetic problems with large computational domains and/or computationally intensive tasks. Among the existing full-wave methods which have been developed for the solution of EM problems are the Integral Equation Technique (IE) and Finite Element Method (FEM). These methods are numerical techniques which solve Maxwell's Equations in the frequency domain using large computational volumes and intensive numerical calculations which have to be performed repeatedly at all frequency points of interest. In addition to these problems, the numerical solution of Maxwell's Equations results in full (IE case) and huge sparse (FEM case) matrices which further limit applicability of the technique. With 2

regards to substantially increasing the computational efficiency of these techniques, we concentrate on improving IE method through near-matrix diagonalization techniques(multiresolution analysis) as well as code parallelization and FEM through task parallelization. The project leverages on existing synergism between DOD and the University of Michigan efforts on large-scale simulations of advanced software systems. The team will concentrate on the development, validation, documentation, visualization, and demonstration of scaleable algorithms for modeling critical DOD problems in high frequency electromagnetic circuits. Technology transfer to the DOD user community is an integral component of the project. In addition, cross-platform portability and software reusability will be emphasized using the MessagePassing Interface (MPI) standard. In our effort, the developement of scalable softwaresis based on the IE and FEM techniques to sustain the high accuracy capability while, at the same time, solve complex planar circuits including their packages in very short times, allowing for real time simulations. Such softwares can eventually lead to real time design and optimization. 3 Integral Equation with Multiresolution Analysis [1, 2] From our previous study, it has been indicated that the use of Multiresolution expansions in integral based formulations leads to the generation of highly sparse linear systems whose numerical solution is fast and efficient with minimal memory usage. This finding allows us to believe that application of wavelet techniques to the method of moments has made the modeling of large scale electromagnetic problems possible. The capability of the multiresolution analysis (MRA) to compete with the fastest up-to-date techniques, such as the Fast Multipole Method (FMM) and Multilevel Fast Multipole Method (MFMM) has been demonstrated through very thorough computer experiments. We expect this trend to sustain as the size of matrix increases to hundreds of thousands or millions. The roof top multiresolution basis is used for the expansion of planar currents, in a way that the relevant boundary and edge conditions are satisfied to an appropriate degree of accuracy. Through this project it is shown that wavelet-dominated expansion bases generate highly sparse moment matrices and the resulting sparse linear systems can be solved numerically using very efficient computational tools. The fast wavelet transform is exploited for the numerical integration of the Green's function to speed up the matrix fill process. This work is emphasizing results for microstrip patch antenna and finite size arrays of such antennas. In addition, special emphasis is being placed on the solution of large-scale problems, which could not otherwise be handled by conventional moment method implementations. These efforts should yield a unique design capability for maintaining technology leadership. The MRA allows the analysis software to perform a thresholding process on the entire ma 3

trix. An additional benefit in the development of this software will be the available theoretical foundation for the thresholding required for matrix sparsification. The result of this code development will be the use of scaleable parallel high performance computing systems to solve electrically large electromagnetics problems of critical importance to DOD. While in these problems we have been able to provide matrix sparsities of the order of 99.9% (a perfectly diagonal matrix has a sparsity of (1 - 1/N)% which is equal to 99.99% for N = 1000), numerical computations have become more complex and lengthy. A large portion of these computations are in the form of multiple nested loops where similar types of intensive numerical computations are repeated. Furthermore, a large number of branched or cascaded loops are performed in a serial fashion. Parallelization of these code segments improves the speed and efficiency to a large extend. Another approach to improve the computation time is through the fast wavelet algorithm. This algorithm performs all the computations at the highest resolution level only once and then reconstructs the rest of the computations in the remain resolution levels very efficiently. In addition to code parallelization, domain decomposition techniques and task parallelization have a great potential with this technique and are applied in addition to near-matrix diagonlization. Demonstration of scaleable simulations will address the critical DOD applications area of electrically large antennas integrated to multi-layer RF front end electronics which will itself be electrically large. Such systems need to be analyzed at the phenomenological level at one time in order to understand the details of device performance in the large subsystems environment. 4 Parallelized FEM for MMIC Simulation [1, 2,5] The numerical analysis of MMIC using parallel computer becomes powerful especially in FEMbased codes. The frequency parallelization of FEM achieves linearly scaleable performance with the number of processors being used. The frequency parallelization is completed using MPI standards for distributed memory machines such as IBM SP2. The application of frequency parallelized FEM code to microstrip feeding network for patch array antenna is accomplished and its field plot is available by a recently developed post field plotting Matlab file. Parallelization schemes will be further applied to the patch antenna structures including their microstrip feeding network using a hybrid method(MoM/FEM). This work will involve the parallelization of matrix generation for this hybrid method, the parallelization of far field calculation for each array antenna and the development of post-processor files for radiation patterns. The parallelized FEM scheme is also appropriate to the characterization of a microwave/millimeterwave package. As efficient packaging technologies are emphasized for high frequency and high performance microwave/millimeter-wave circuits, it is very important to provide design rules to avoid unwanted electromagnetic effects of the package on the circuit performance. In this study, the electromagnetic effects of a via fense on a microstrip/stripline and the suppression of higher order 4

modes in the cavity are investigated. In this study, three different types of via fences are considered: continuous metal filled via fence on both sides of the circuits, short section of a via fence on both sides of the circuits, and via fence cross to the circuits. In each case, the fence is moved between I h to 4h from the circuit where h is the height between the ground plane and the center conductor of the circuit. As the scattering parameters and radiation loss of the circuits do not vary by more than a few percent over the simulation frequency (10 GHz - 40 GHz), average values over this frequency range are calculated for each case. Moreover, field distributions are also generated. The result shows that when the via fence is closer to the circuits the EM effects are stronger resulting in increased reflection coefficient, insertion and radiation loss. Furthermore, the results show that the closely spaced via fence confines more electric field than the wider spaced via fence giving less radiation loss. It is noted that the stripline shows better EM performance than microstrip over the via fence in the package. The suppression of higher order packaging modes is studied for the cavity with two different types of striplines: (a) cavity with two isolated striplines and (b) cavity with one bent stripline. In the former case, via fence is placed between two isolated striplines and via gaps are changed from 57.228h to 13.932h. In order to visualize the suppression of higher order modes in a quantitative way, only one stripline is excited and the ratio, maximum magnitude of total electric field on the unexcited stripline over the one on the excited stripline, is calculated. Results show that the via gap should be less than 0.394 A, fora of 1 h x 1 h via to suppress unwanted cavity mode under -50 dB, where A, is the cutoff wavelength of the dominant mode in the cavity. The field distribution over the cavity is shown in the result. The cavity with bent stripline is simulated for different via gaps, from 56.76h to 2.6h. The scattering paramenters of the stripline are calculated for each case and field distributions over the cavity are presented. It is seen that the via gap should not greater than 0.248 A, for a lh x lh via to suppress cavity mode in the package. All simulations have been carried out by a parallelized 3D-FEM. The frequency parallelization scheme is implemented in this study, due to the independent nature of the problem in the frequency domain. The parallel computer used for the simulation is the distributed memory IBM SP2 at the University of Michigan. The performance of the parallelized 3-D FEM program is near linearly improved with respect to the number of processes that are used. This study shows the electromagnetic effects of a via fence on microstrip/stripline and the suppression of higher order modes in the package. The result can be used as guide lines for designing high performance and high frequency packages. 5

5 Active Circuits in FEM [6, 7] Based on our previous research, we have launched an effort to analyze active as well as passive circuits, such as a mixer circuit (with 4 diodes), by use of FEM. This is an extension of our previous effort in the sense that we have added capability of handling nonlinear elements to our codes. Using the FEM, the Z parameters (or Y parameters) of the passive part of the circuit are computed and then these values are fed into a software such as Libra in the form of a linear data element with inner and outer ports. The linear or nonlinear elements are connected to inner ports and the whole circuit is analyzed by harmonic balance in order to find in/output characteristics. The goal is to verify the experimental results of mixer circuits under consideration and make some further improvements. In the next step, the harmonic balance code will be integrated into the FEM method and will be parallelized. The direct inclusion of linear passive elements (resistors, capacitiors etc) into FEM code has already been accomplished. 6

A User Manual for FEM part A.1 Compilation and Execution There are 2 important files, one for meshing and the other for FEM main program as illustrated in Fig. 1, and these files are written in high level language Fortran. The first file, femnmesh. f, is a serial code, since it only takes a small fraction of overall solution process. However, the second Fortran code, femnmain. f, has been parallelized using Message Passing Interface (MPI) to enhance portability of the code on different parallel computer architectures and vendors. Here are some examples of how to compile these 2 codes on parallel computers, for example, IBM SP2 and SGI Powerchallenge Array (PCA): Compile meshing program conventional workstation f77 -0 -o femmesh.x fem-mesh.f Compile main program on IBM SP2 mpxlf -us -03 -Impi -o fem-main.x fem-main.f on SGI PCA f77 -64 -02 -o fem-main.x fem-main.f -L/usr/lib64 -Impi Executing these compiled codes is fairly straightforward for the meshing code, but not for the FEM main part. There are different methods for different parallel platforms. For IBM SP2, one can use LoadLeveler to submit jobs in a batch mode or can utilize interactive nodes for small size short running codes and for debugging purposes. Here are some examples on executing the FEM main code on IBM and SGI machines: Executing meshing program (phase I) conventional workstation femn-mesh.x fem-mesh.f < input-mesh-file Executing main program (phase II) with 8 CPUs on IBM SP2 fem-rnmain.x -rmpool 0 -procs 8 < input-main-file on SGI PCA mpirun -np 8 fem-main.x < input-main-file Note that these scripts are specific to the environment and operating systems. For example, Maui High Performance Computing Center (MHPCC) and Center for Parallel Computing (CPC) at the University of Michigan have similar IBM SP2 machines, but they have different batch systems and scheduling arrangements. Therefore, users need to be aware of what kind of system and environment they have to work with. 7

Start PHASE I: Mesh Generation T.......................................Files........................ Mesh Data Files............................................................................................... Input File I — ( Input File II PHASE II: FEM Main Routine 1 - }~ EM Field Data Files PHASE............... Visualization......................................................................... PHASE Il[: Visualization Finish Figure 1: Flow chart of the finite element method (FEM) solution process. 8

A.2 User Manual for Phase I - Mesh Generation In this section, a brief description of the input file for Phase I, (mesh generation) is given. As an example, we have chosen to use a simple patch antenna on duroid substrate operating at 4.0 GHz. The schemetic of the antenna is shown in Fig. 2. To model this geometry, first, one needs to define an outer terminating boundary as shown in Fig. 3. This terminating boundary is crucial for mesh termination and control of the overall problem size. As shown in the figure, the distance of the boundary from the structure to be modeled, Dide and Dtop, need to be at least 4 to 5 times greater than the substrate thickness or 0.2 x A, for correct result. The outer boundaries are assigned to be perfect electric or magnetic (PEC or PMC) surface defending on the application.; h duroid \. substrate Figure 2: Single element patch antenna on duroid substrate: L = 11.2 mm, W = 8.0 mm, h = 2.25mm, eduroid = 10.8 In addition, for simulation of open boundary problems, such as antennas, artificial absorbing materials need to be placed inside of the outer terminating boundaries as shown in Fig. 4. The material constants (fr and r) are chosen to minimize the reflections at the interface and to maximize electromagnetic field attenuation in the material. Please refer to [3] for more details. After the outer boundaries are defined, one needs to specify the location of the conducting surfaces, such as the antenna and its feeding network. As shown in Fig. 5, the antenna and feeding network are divided into number of rectangular sections that can be defined by any two diagonal points. 9

rDtop Figure 3: Illustration of total solution space and outer boundary. Outer Boundary Definitions 0.0, 24.0 Xmin,Xmax -10.0, 54.0 ymin, ymax 0. 0, 20.0 Zmin, Zmax Note 1. One has to define all of the above parameters regardless of whether the solution space is an open or close boundary. Note 2. For closed boundary problems, such as cavity problem, one can use all the outside surface as PEC. Note 3. For open boundary problems, such as antenna problem, one needs to design and place artificial absorbing material inside of the outermost boundary. Note 4. When there are symmetries in the problem, one can take advantage of the fact to reduce the overall problem size. 10

AA for -free space -antenna I substrate AA for sbustrate Figure 4: Artificial absorbing layers for simulation of open boundary problem. F,,-_r 111* Figure 5: Example of surface boundary condition application with appropriate rectangulation. - -i 11

I1 2 5 1.00000, 1.00000, 0.80000, 0.50000, 0.55000, 6 1.00000, 1.00000, 0.76625, 0.84500, 0.94000, 1.00000, 3 0.85000, 1.24500, 1.00000, 4 16 3 1 2 4 26 8 6 20 4 3 10 5 Discretization discretization method: [1] uniform discretization [2] non-uniform discretization number of divisions in x direction AX1, N1 xA2, N2 Ax3, N3 Ax4, N4 Ax5, N5 number of divisions in y direction Ay1, N' Ay2, Ny2 Ay3, Ny3 Ay4, Ny4 Ays, N5 Ay6, N6 number of divisions in z direction Az1, N1, AZ2, N2 AZ3, N3 I Note. Ax, y, z; ith subsection size in x, y, z-direction. Niy,: number of it subsection. 12

9 10.8, 0.0 Material Property: Real and Imaginary Parts of Material number of material Re(e, ), Im(eri): duroid substrate 1.0, 0.0 10.8, 15.0 1.0, 1.3889 10.8, 15.0 1.0, 1.3889 10.8, 15.0 1.0, 1.3889 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 5.0 1.0, 0.0 1.0, 0.0 Re(#,,), Iz(pr,) Re(e,2), Irn(r2): Re(,r2), I1 7(/l.2) Re(6,3), Im(6r3): Re(pi3), Im(/ir3) Re(er4), Im(er4): Re(ir4), Im{(r4) Re(ers), Im(,er): Re(C,6), Ir7 n(6r): Re (i6), Imr7r6) Re(,r7), Ir7z(#r7): Re(1r7), Int(9r7) Re(er,8), i7n(,ers): Re(,r8), Im(r,8) Re(er,), In((er,): Re(fr9), Im({r9) AA for duroid substrate AA for duroid substrate AA for duroid substrate AA for free space AA for free space AA for free space AA for free space AA for free space Note 1. For each material, real and imaginary parts of the relative permitivity (er) and permeability (pT) need to be defined as above. Note 2. Artificial absorbers (AA) for free space and any arbitrary material have to be chosen by user. The formulars can be found in [3]. 13

Material Allocations: (Xmin, Ymin, Zmin)i and (Xmax, Ymax, Zmax)i 2 points are diagonal of a cube containing given materal 4.0, -6.0, 0.0 24.0, 50.0, 2.55 0.0, -10.0, 0.0 4.0, 54.0, 2.55 4.0, -10.0, 0.0 24.0, -6.0, 2.55 4.0, 50.0, 0.0 24.0, 54.0, 2.55 0.0, -10.0, 2.55 4.0, 54.0, 15.0 4.0, -10.0, 2.55 24.0, -6.0, 15.0 4.0, 50.0, 2.55 24.0, 54.0, 15.0 0.0, -10.0, 15.0 24.0, 54.0, 20.0 4.0, -6.0, 2.55 24.0, 50.0, 15.0 (Xmin i Ymin Zmin)l (Xmax, Ymax, Zmax)l (Xmin, Ymin, Zmin)2 (Xmax, Ymax, Zmax)2 (Xmin, Ymin, Zmin)3 (Xmax, Ymax, Zmax)3 (Xmin Ymin, Zmin)4 (Xmax, Ymax, Zmax)4 (Xmin i ymin, Zmin)5 (Xmax, Ymax, Zmax)5 (Xmin, Ymin, Zmin)6 (Xmax, Ymax, Zmax)6 (Xmin, Ymin, Zmin)7 (Xmax, Ymax) Zmax)7 (Xmin, Ymin Zmin)8 (Xmax, Ymax, Zmax)8 (Xmin, Ymin, Zmin)9 (Xmax, Ymax, Zmax)9 Note. Any 2 diagonal points can be used to define a cube whose material property is homogeneous. 14

Surface Boundary Conditions: (Xmin, Ymin, Zmin)i and (xmax, Ymax, Zmax)i 2 points define the diagonal of a rectangle for ith surface 20.0, 22.4, 1 y 22.4, 24.0, 1 y 22.7, 24.0, 1 y 0.0, 24.0, 1 y 0.0, 24.0, 1 y 20.0, 31.2, 2.55 2.55 26.13, 2.55 31.2, 2.55 -10.0, 26.13, -10.0, 54.0, 2.55 2.55 0.0 0.0 20.0 20.0 (Xmin, Ymin Zmin )l patch 1 (Xmax, Ymax, Zmax)l "1" for PEC, "0" for PMC "y" for more surface, "n" for no more surface to define (Xmin, Ymin, Zmin)2: patch2 (Xmax, Ymax, Zmax)2 PEC more surface to define (xmin, Ymin, Zmin)3: input feeding line (Xmax, Ymax, Zmax)3 PEC more surface to define (Xmin, Ymin, Zmin)4 ' bottom surface (Xmax, Ymax, Zmax)4 PEC more surface to define (Xminmin, inZin)5: top surface (Xmax, Ymax, Zmax)5 PEC more surface to define (Xmin, Ymin Zmin )6 left surface (Xmax, Ymax, Zmax)6 PEC more surface to define (Xmin, ymin, 7nin)7: front surface (Xmax, Ymax, Zmax)7 PEC more surface to define -10.0, 54.0, 0.0, -10.0, 0.0 0.0, 54.0, 20.0 1 y 0.0, -10.0, 0.0 24.0, -10.0, 20.0 1 y 15

I 24.0, -10.0, 0.( 24.0, 54.0, 20.( 0 y 0.0, 54.0, 0.0 24.0, 54.0, 20.( 1 y Y 23.45, -5.0, 1. 23.45, -5.0, 2. 1 n Boundary Conditions (Surface): continued 0 (Xmin, Ymin, Zmin)8: right surface 0 (Xma:x Ymax, Zmax)8 PMC more surface to define (Xmin, Ymin, Zmin)9g back surface 0 (Xmax, Ymax, Zmax)9 PEC more surface to define 7 (Xmin, Ymin, Zmin)10: excitation place 55 (Xmax, Ymax, Zmax)10 PEC NO more surface to define I Note 1. PEC - Perfect electric conductor PMC - Perfect magnetic conductor Note 2. These boundary conditions can be utilized to define symmetric boundary conditions depending on the electromagnetic field distribution. II 1 23.45, -5 23.45, -5 Excitation Points: define 2 end points (Xmin, Ymin, Zmin)i and (Xmax, Ymax, Zmax)i number of excitation points.0, 1.7 (Xmin, Ymin, Zmin)l.0, 2.55 (Xmax, Ymax, Zmax)l Note. One has to define a PEC line for excitation edges. You may need to define more than 1 line as it is in the case of coplanar waveguide (CPW). f 0 O Impedance Load Definition "0" for no impedance loads "1" for impedance loads Note. One can use this option to define lumped circuit element, such as R, L, and C. __j 16

A.3 Sample Output Data Files for Phase I In this section, the first few lines of each of the output data files are shown for better understanding. In general, the user does not need to modify or look into these mesh output data files. These files will be read from the FEM main program as described in phase II. In terms of storage space requirement, these output data files range from several Mega-byte to 100 Mega-byte depending on overall mesh size. File Name: DATEDGY Description: edge label and x, y, and z components of the edge vectors. 1 0.0000000000000000E+00 1.000000000000000 0.8500000000000000 2 1.000000000000000 1.000000000000000 0.OOOOOOOOOOOOOOOOE+00 File Name: DAT-EGLB Description: material constants, number of nodes and tetrahedrons, and node labels for non-overlapping edges. 1 10.80000 0.OOOOOOOOOOOOOOOOE+00 1.000000 0.OOOOOOOOOOOOOOOOE+00 2 10.80000 15.00000000000000 1.000000 1.388900000000000 8 1.000000 5.000000000000000 1.000000 5.000000000000000 9 1.000000 0.OOOOOOOOOOOOOOOOE+00 1.000000 0.OOOOOOOOOOOOOOOOE+00 35397 159120 1 1 1 1891 2 1 2 1 29 2 1 3 1 28 2 17

File Name: DAT-ELMT Description: element labels, 4 nodes for each tetrahedron, and material index. 1 1 1891 29 28 2 2 29 1891 1865 1892 2 3 1 2 29 1865 2 4 1 1891 1865 1864 2 5 1 1891 1865 29 2 6 29 1865 1893 1892 2 7 1865 3 1893 1866 2 8 29 1865 3 2 2 9 29 3 1893 30 2 10 29 3 1893 1865 2 File Name: DAT-ESUR Description: surface edge definitions. 31996 6420 6421 1 32151 6420 6447 1 32004 6421 6422 1 32156 6421 6447 1 32158 6421 6448 1 32160 6421 6449 1 32328 6450 6477 1 File Name: DAT-EXCI Description: excitation position definition. 28045 File Name: DAT-GRID Description: surface edge definitions for later visualizations. 1 20.000000 20.000000 2.5500000 20.800000 20.000000 2.5500000 1 20.000000 20.000000 2.5500000 20.000000 20.766250 2.5500000 18

File Name: DAT-GRUP Description: node labels on each surface definition. PEC NODE ON SURFACE 1 1 1 2 3 4 5 60 6420 6421 6422 6423 6447 File Name: DAT-NODE Description: node coordinate table. 1 0.00000000E+00 -10.000000 0.00000000E+00 2 1.0000000 -10.000000 0.OOOOOOOOE+00 3 2.0000000 -10.000000 0.OOOOOOOOE+00 4 3.0000000 -10.000000 0.OOOOOOOOE+00 File Name: DAT-NUMB Description: total number surface, edges and edges on surfaces. 9 201436 21181 19

A.4 User Manual for Phase II - FEM Main Program In this section, a brief description of the input file for Pahse II, FEM main program, is given. This input file contains parameters for geometry selection and some other miscellaneous data as will be described below. Also, operating frequency range and output data file names for each frequency point are required. For electromagnetic field sampling in the solution domain, the user has to provide appropriate cross-section and coordinates for two diagonal points of the circuit cross-section. Finally, the user can select appropriate iterative solver convergence criteria to specify the residual norm. In most case, 10-3 to 10-4 will be enough. Problem Definition 1 What problem will be solved? [1] PEC/PMC/Artificial absorber [2] 1st order absorbing boundary condition [3] 2nd order absorbing boundary condition 1 What excitation mechanism (only for 2-port network)? [1] Even [2] Odd Note: For ]-port network, you can choose either I or 2. Miscellaneous Parameters: These parameters provide optional information for discretization 0. 05 Global mesh size 0. 5 Length of the structure 7. 5 Approximate effective dielectric constant Note: These parameters do not affect the accuracy and performance of the (parallel) FEM code 20

I Operating Frequency Range and Output File Names 3.80 Start frequency (fstart) 4.20 Stop frequency (fstop) 0.02 Frequency step (Af) ant-array-3. 80ghz.m Output data file names ant-array-3.82ghz.m ant-array-3.84ghz.m ant-array-3.86ghz.m ant-array-3.88ghz.m ant-array-3.90ghz.m ant-array-3. 92ghz.m ant-array-3.94ghz.m ant-array-3. 96ghz.m ant-array-3.98ghz.m ant-array-4. 00ghz.m ant-array-4.02ghz.m ant-array-4.04ghz.m ant-array-4.06ghz.m ant-array-4.08ghz.m ant-array-4.10ghz.m ant-array-4.12ghz.m ant-array-4.14ghz.m ant-array-4.16ghz.m ant-array-4.18ghz.m ant-array-4.20ghz.m Note 1. The number offrequency points are determined by (fstop - fstart)/Af. Note 2. For task parallelization scheme, the best linear scalability can be achieved when the number of CPUs is equal to the number offrequency points. I 21

Il 2 0.010, 23.99 -9.98, 55.99 2.56 Electromagnetic Field Sampling Which cross section do you want to cut? [ 1] X-Z plane(constant y plane) [2] X-Y plane(constant z plane) [3] Y-Z plane(constant x plane) Xmini Xmax Ymin:n Ymax Constant z Note. If you choose [1], you need to provide (Xmin, Xmax) first, (zmin, Zmax) second, and constant y. If you choose [3], you need to provide (ymin, ymax) first, (Zmin, Zmax) second, and constant x. __j Resistive Material 2 Resistive material? [1] Yes, [2] No Iterative Solver (BiCG or QMR) Convergence Criterion 1.0-4 Tolerence (residue) < 10-4 22

B Benchmarking In this section, some benchmarking examples of our parallel electromagnetic solvers are illustrated. As shown in Fig. 6, a 2-element patch array antenna has been simulated using different number of CPUs: 1 to 25. The direct output of the simulator provides electromagnetic field data as illustrated in Fig. 7. From thsese data, one can easily extract important circuits parameters, such as S-parameters, can be extracted as shown in Fig. 8. Figure 6: 2 element patch array antenna with corporate feeding network. Figure 7: Electromagnetic field distributions for 2 element patch array antenna with corporate feeding network. 23

r!. I...... I I X - -2\ -1 4................................................... -18.......... -1 24 26 28 30 32 34 36 _ _ I work. In this particular example, we have computed the electromagnetic field distribution at 25 different frequency points using our task parallelized solver and we also presented speedup curves for various number of CPUs(Fig. 9). As shown in the figure, we have achieved linear scalability on distributed memory machine using MPI (Message passing interace) standard. It needs to be pointed out that with our task parallelization strategy the best performance can be achieved when there are p = M/2' processors, where M is the total number of frequency points to be computed and n = {0, 1, 2, 3,...}. When M is not a multiple of 2, p =- int(M/2n) + mod(M, M/2n). The scalability of a parallelized code can be studied by examining the load balancing between different processors. As shown in 10, the best performance is achieved when we have good load balancing between those CPUs. For example, when one needs to obtain 40 frequency data, the best scalability can be achieved when there are 40 CPUs available. However, if there are only 30 CPUs available, the use of 30 or 20 CPUs will not make a big difference. As a last example, Fig. 11 shows the coupling between two radiating antenna element in an array environment. For this study, conventional and microaachined silicon substrates are used as depicted in the figure. Apart from the coupling issues, this examplee demonstrates the capability of our parallel electromagnetic codes to solvesolver and lize very complicate electromagnetic field distributions. This will be greatly useful to RF circuit and antenna designers and even to eng i ee rs in DoD community. engineers in DoD community. 24

35 30 25 20 30. 0 C, 0 15 10 5 o o M= 130,573 -.* * M= 183,816 Linear (Ideal) / / * / 0 /..................................................... / / ] e 0!1i 0 5 10 15 20 Number of CPUs 25 30 35 Figure 9: Speedup curves for two different problem sizes. M is the number of unknowns to be solved. 25

3.,. 2 3 4 6 7 a 0 10 Mooomm LlbW (a) 2.5 2.-.I..I.- I --- I- I......:... -.... —.I O.5. 0 1 2 3 4 5 a 7 8 9 10 11 12 13 14 1S Prooor Lc) (C) 1.8 -- (b) 1.6t 1.4: 1,2 -. w 0.8 0.4 0.2 OL I I I I I p..F. I I- -.::: a 2 a 8 10 12 14 16 is LoQu bmid~ (d) l I.e 1.4r IQ 0.6 0-4 02.. l 10 1s 20 25 30 (e) (f) Figure 10: Load balancing characteristics for different number of CPUs (N): (a) N = 5 (b) =N 10 (c) AN = 15 (d) N = 16 (e) NA = 20 (f) Ar = 25 26

(a) (b) Figure 11: Coupling between two adjcent radiating elements: (a) Antenna on conventional substrate. (b) Antenna on micromachined substrate. 27

C User Manual for Hybrid (FEM + MoM) Method C.1 Procedure In this section, a brief description of hybrid method is given. As can be seen in Fig. 12, the FEM and MoM solutions are combined to yield circuit parameters or electromagnetic field distribution. Start IF Input file Impedence Matrix & EM data from FEM MoM Routine --— MoM Input file Impedence Matrix from MoM Hybrid Method Routine I Input Impedence & Scattering Parameter Finish Figure 12: Flow chart of the hybrid method solution process. 28

C.2 File flow chart The hybrid method requires several different input files for FEM and MoM solution processes as described in Fig. 13. Also, some of output files are used in the next stage or contain final solution. Start hybridjfem.dat - - (FEM) hybridfem-input.dat hybridfem.c current. *.dat yps.*.dat 'x___ (MoM) hybridMoM-yb.c.. --- zmnin.dat hybridMoM-zmnb.c hybridMoM.dat hybnid-MoM-tmb.c ------- ^^ yb.tem zmnb.tem tmb.tem / (Hybrid Method) hybrid.c current.dat 1 sum.dat Finish Figure 13: File flow chart of the hybrid method solution process. 29

C.3 File description Here are descriptions of files, source codes, input and out files, for FEM, MoM, and hybrid method. FEM file hybridf em. dat FEM mesh input file. hybridfeminput. dat FEM input file. hybridf em. c FEM program. current. *. dat Output file: current information on microstrip for each excitation. File format: Re(Ex), Im(Ex), Re(Ey), Im(Ey), Re(Ez), Im(Ez) yps. *. tem Output file: Impedance matrix component for each excitation. Note. * = 0,1,2,.... Numbered according to each excitation. zmnin. dat hybridMoM. dat hybrid-MoMyb. c hybriddMoM-zmnb. c hybridMoM-tmnb. c yb. tem zmnb.tem tmnb.tem MoM file MoM input file. MoM input file. MoM program for slot. MoM program for patch. MoM program for interaction between slot and patch. Output file: Impedence matrix for slot. Output file: Impedence matrix for patch. Output file: mpedence matrix for interaction between slot and patch. I hybrid.c current.dat sum. dat Hybrid method file Hybrid program. Output file: current distribution on input microstrip, which is the final solution of the problem. Input impedence and scattering parameters are calculated with this information. Output file: current profile on slot and patch. File format: Map is magnetic basis function profile on slot. Ib is current basis function profile on patch. 30

C.4 Compilation and Execution In hybrid method there are 3 important steps. FEM, Method of Moment(MoM), and hybrid simulation which combines FEM and MoM results. All the programs are written in C++ language, which enables dynamic memory allocations and memory savings as a result. The programs are also parallelized using MPI to give portablility for different types of parallel computers. As a first step, FEM program, hybrid-fem. c, is simulated using its data file, hybrid-f em. dat. Compilation and execution examples are shown as follows. Compile FEM program * conventional workstation g++ hybrid-fem.c -DDOMAINS -DSUNC__ -o hybridfem.x -02 -Im * IBM SP2 mpCC -+ hybridfem.c -DDOMAINS -DPARALLEL -D-GNUC_ -o hybridfem.x -03 -Im If the structure is symmetry, we can utilize it by adding option -DSYMMETY when compile in both conventional workstation and IBM SP2. As mentioned before, FEM part of hybrid method can be simulated on two modes in IBM SP2, which is loadleveler mode and interactive mode. Here are some examples on executing the FEM code on conventional workstation and IBM SP2. Executing FEM program * conventional workstation hybrid-fem.x < hybrid-feminput.dat (fem input file) Executing FEM program with 8 CPUs * IBM SP2 hybridfem.x -rmpool 0 -procs 8 < hybridfeminput.dat (fem input file) Compile MoM program * IBM SP2 mpCC -+ hybridMoM-yb.c -DPARALLEL -D__GNUC__ -o hybridMoM-yb.x -03 -Im mpCC -+ hybrid-MoM zmnb.c -DPARALLEL -D-_GNUC_ -o hybrid-MoM-zmnb.x -03 -Im mpCC -+ hybridMoM-tmb.c -DPARALLEL -D__GNUC_ -o hybrid-MoM-tmb.x -03 -Im 31

Executing MoM program with 4 CPUs * IBM SP2 hybrid-MoM-yb.x -rmpool 0 -procs 4 < zmnin.dat (MoM input file) hybrid-MoMzmnb.x -rmpool 0-procs 4 < zmnin.dat hybridMoMtmb.x -rmpool 0 -procs 4 < zmnin.dat Compile Hybrid program * conventional workstation g++ hybrid.c -D-SUNC_ -o hybrid.x -Im * IBM SP2 mpCC -+ hybrid.c -DPARALLEL -DGNUC_ -o hybrid.x -03 -Im Executing Hybrid program * conventional workstation hybrid.x * IBM SP2 hybrid.x -rmpool 0 -procs 1 32

The structure, as an example, to be solved using hybrid method are shown below. FEM is used for microstrip feeding network where analytic Green's function is not known. Method of momentum(MoM) is used for patch antenna using Green's function. By combing the results from FEM and MoM, we can solve the whole structure, which is hybrid method. Patch -- t Substrate I I,M"rnofim," Ground Plane ivucUdSulp Figure 14: Single patch antenna with microstrip feeding network. 33

C.5 User Manual for FEM Mesh In this section, a brief description of input files for FEM simulation is given. ~ Wf --- i r Substrate Figure 15: Microstrip feeding network. termination -.boundary -.: > ' "-.Dsde --- -- sid, - Figure 16: Illustration of total solution space and outer boundary. 34

AA for free space Microstrip Substrate AA for substrate Figure 17: Artificial absorbing layers for simulation of open boundary problem. Top mesh \.. v -- Bottom mesh Figure 18: Meshing schemes for each layer.. i Figure 18: Meshing schemes for each layer. 35

I Discr 2 5 -39.45e-3 -:27.45e-3 -6e-3 6e-3 27.45e-3 39.45e-3 5 8 16 8 5 6 -202.55e-3 -182.55e-3 -0.55e-3 0.55e-3 20e-3 42e-3 54e-3 5 70 2 8 8 5 1 0 1.6e-3 2 7 -39.45e-3 -27.45e-3 -6e-3 -2.475e-3 2.475e-3 6e-3 27.45e-3 39.45e-3 5856585 6 -202.55e-3 -182.55e-3 -0.55e-3 0.55e-3 20e-3 42e-3 54e-3 5 70 2 8 8 5 FEM Part retization: file name (hybridfem.dat) Number of layer. * Mesh the bottom section offirst layer. NSbl 1 Xbl, *x I... 1.., Xb5 Xmax, End position NVl,...,Nk NSJ' 1 Ybl, ' **, Yb6 Ymax, End position. Nbyl...,Nby6 NSbl 1 Zb Zmax, End position. Nz1 N'bzl *Mesh the top section offirst layer* NStl 1 J2l7,... xt, 1 Xt7 Xmax, End position. NtIV1. Nt1 NSt' ty - 1 Yt*, '"..... Yt6 Ymax, End position. Nt1 NT 1 tyl' ' ty6 36

I 7 _39.45e-3 -27.45e-3 -6e-3 -2.475e-3 2.475e-3 6e-3 27.45e-3 39. 45e-3 5 8 56 585 6 -202. 55e-3 -182.55e-3 -0.55e-3 0.55e-3 20e-3 42e-3 54e-3 2 1.6e-3 22.4e-3 34.4e-3 8 5 7 _39.45e-3 -27.45e-3 6e-3 -2.475e-3 2.475e-3 6e-3 27.45e-3 39.45e-3 5 8 56 585 6 -202. 55e-3 -182.55e-3 -0.55e-3 0.55e-3 20e-3 42e-3 54e-3 5 70 2 8 8 5 Discretization (continue) *Mesh the bottom section of second layer.* NSA. 2 Xb1, 2 Xb7 Xmax, End position. NS N 2 Ybi' 2 Yb6 Ymax, End position. NS'Z Zb Zb2 Zmax, End position. NJ1, NJ1 Mesh the top section of second layer. NSt 2 xt1, 2 xt Xmax, End position. N2 Yty 2.....Yt6 Ymax, End position. Note 1. NSibx,byJbz/NS'tx,ty,tz layer in x, y, z-direction.:Number of uniform mesh subsection of bottom/top for 2"th 37

Note 2. x, y, zj/x, y, z: Starting position ofjth uniform mesh subsection in x,y,z-direction for bottom/top of ith layer Note 3. Nbxj,byj,bzj/Ntxj,tyj,tzj: Number of division of jth uniform mesh subsection in x, y, zdirectionfor bottom/top of ith layer Note 4. Top section and bottom section should be meshedfor each layer and number of top mesh should be equal to the number of bottom mesh. Note 5. For open boundary problems, such as antenna problem, one need to design and place artificial absorbing material inside of the outermost boundary. Note 6. When there are symmetry in the problem, one can take advantage of the fact to reduce overall problem size. Surface Boundary Conditions: (Xmin, Xmax, Ymin, Ymax, Z)i 2 points are diagonal of a rectangle for ith x-y surface 1 Number of connected PEC domain 1 Number of PEC subdomain per connected PEC domain -2.475e-3 2.475e-3 (Xmin, Xmax)1: microstrip -200e-3 20e-3 (yminYmax)i 1.6e-3 z 1 Add edge above PEC domain (default) Material Allocations: (Xmin, Xmax)i, (Ymin, ymax)i, and (Zmin, Zmax)i 2 points are diagonal of a cube containing given materal Material Property: Real and Imaginary Parts of Material Re(erl), /m(Erl), Re(rl), Im(/y(rl)l 11 number of material -27.45e-3 27.45e-3 (XminXmax)i -182.55e-3 42e-3 (YminYmax)i 0 1.6e-3 (Zmin, max)l 2.54 0. 0 Re(erl), In(6rl): substrate 1.0, O.0 J.e(rl), IT7l(rl) -27.45e-3 27.45e-3 (XminXmax)2 -182.55e-3 42e-3 (YminYmax)2 1.6e-3 22.4e-3 (Zmin Zmax)2 1.0 0.0 Re(er2), In(6T2): Free space 1.0, 0.0 Re(,r2), Inl(ft2) 38

Surface Boundary Conditions (continue) -39.45e-3 39.45e-3 (Xmin,Xmax)3 -202.55e-3 54e-3 (YminYmax)3 22.4e-3 34.4e-3 (ZminZmax)3 1.0 -5.0 Re(C,3), Inm(er3): AA for free space 1.0, -5.0 Re(,r3), Im(,r3) -39.45e-3 -27.45e-3 (Xmin Xmax)4 -202.55e-3 54e-3 (Ymin, Ymax)4 0 1.6e-3 (ZminZmax)4 2.54 -8 Re(6r4), Im(r4): AA for substrate 1 -3.14961 Re((r4), Irn(J(,r4) 27.45e-3 39.45e-3 (XminXmax)5 -202.55e-3 54e-3 (YminYmax)5 0 1.6e-3 (Zmin, Zmax)5 2.54 -8 Re(c6r), Irn(es): AA for substrate 1 -3.14961 Re(,rs), Im(,r5) -27.45e-3 27.45e-3 (XminXmax)6 -202.55e-3 (Ymin, Ymax)6 -182.55e-3 0 1.6e-3 (Zmin, Zmax)6 2.54 -8 Re(er6), Im(6r6) AA for substrate 1 -3.14961 Re({Tr6), Im(r.6) -27.45e-3 27.45e-3 (XminXmax)7 42e-3 54e-3 (Ymin, Ymax)7 0 1.6e-3 (Zmin Zmax)7 2.54 -8 Re(6r7), Im(6r7) AA for substrate 1 -3.14961 Re(ur7), Im(7rT7) -39.45e-3 -27.45e-3 (Xmin,Xmax)8 -202.55e-3 54e-3 (Ymin, Ymax)s 1.6e-3 22.4e-3 (Zmin, Zmax)8 1 -5 Re (e8), I7n(ers) AA for free space 1 -5 Re(Prs), Im(J8s) 27.45e-3 39.45e-3 (Xmin,Xmax)9 -202.55e-3 54e-3 (Ymin, Ymax)9 1.6e-3 22.4e-3 (Zmin, Zmax)9 1 -5 Re (gr), In(er9): AA for free space 39

1 -5 -27.45e-3 27.45e -202.55e-3 -182.55e-3 1.6e-3 22.4e-3 1 -5 1 -5 -27.45e-3 27.45e 42e-3 54e-3 1.6e-3 22.4e-3 1 -5 1 -5 Surface Boundary Conditions (continue) Re(179), Im(,T9) -3 (Xmin,Xmax)10 (Ymin, ymax) 10 (Zmin, Zmax)10 Re(ero), Irn(erIo): AA for free space Re({Prlo), Im(T4rlo) - 3 (Xmini Xmax)II (Ymin, ymax)fl (Zmin, Zmax)ll Re(rII), Im(eril): AA for free space Re(Prll), IM(Prll) Note. Any 2 diagonal points can be used to define a cube whose material property is homogeneous. Note 2. Artificial absorbers (AA) for free space and any arbitrary material have to be chosen by user. The formulars can be found in the Dr. Yook's Ph.D. thesis. Field Interpolation: define 2 end points (Xmin, ymin, Zmin)i and (Xmax, ymax, Zmax)i 7 0 Number of interpolation points. 0.001e-3 -162e-3 (Xmin,YminZmin)l 0. 801e-3 0.001e-3 19.9e-3 (Xmaz, Ymax, Zmax)l 0.801e-3 Excitation Points: define source as ractangular patch: (Xcenter Ycenter )i and (Ax, Ay), 0 -177.35e-3 (Xcenter, Ycenter ) i: center of the source 6e-3 5.2e-3 (Ax, Ay)-: width of the source FEM input file: file name (hybrid-feminput.dat) le-3 le-5 50000 Tolerence of integration routine Tolerence of iterative solver Maximum number of iteration 40

C.6 User Manual - MoM In this section, a brief description of input files for MoM simulation is given. Patch -. — ~~L Equivalent Magnetic Source Microstrip Ground Plane Figure 19: Geometry and parameters of patch antenna. -\ Substrate Ground Plane Figure 20: Dimension parameters of slot. 41

MoM Part: file name (hybridMoM.dat) 5 Nb: Number of basis on a patch. 3 Nap: Number of basis on a slot. 2.54 Erb: Re(ers) of substrate in microstrip 1. 6e-3 db: Substrate thickness of microstrip 40e-3 Lp: Patch length 30e-3 Wap: Patch width 0 0 Xos, Yos: Center position of the patch 12e-3 Lap: Slot length 1.le-3 Wap: Slot width 2 0e-3 Ls: Distance between center the slot and microstrip end 4.9 5 e-3 Wf: Width of microstrip 2.2 2 5e9 freq: Simulation frequency MoM Part: file name (zmnin.dat) 1e-5 Integration tolerence 1000 Maximum number of division for integration 1 Integration parameter: default 150 Truncate integration range at 150ko for oo 3 Use 3A9 for input traveling wave in microstrip 5 Integration parameter: default 42

This Appendix has been excerptedfrom Juiching Cheng's Ph.D. Dissertation, Chapter 6. D Application of Parallel Computing in MoM D.1 Introduction Due to advance in semiconductor industry, the price of high performance processors falls rapidly, making it economical to build parallel computers with large number of processors. The need of high performance computing not achievable by single processor also drives the development of parallel computing technology. During the past decade we saw a lot of commercial parallel computers made by IBM, SGI and INTEL, to name a few. Parallel computers are no longer the proprieties of government or university research laboratories, but a reality. However, there is one cause hindering its widespread adoption. To tap into the power of parallel computing, the user has to modify his program according to the programming interface provided by the vender. Every time the user changes the parallel computer he used, he will need to modify his program again. There is no common programming interface among parallel computers until the Message Passing Interface (MPI) was standardized in 1994. Since then, MPI has been widely adopted by the computer industry. To utilize parallel computers, first a programmer needs to design a parallelization model for his program. The idea of parallelization is to divide a single task to many independent jobs. A simple and easy way to implement is the Single Program Multiple Data (SPMD) model. In this model, different data sets are executed by the same program on different processors. These data sets have to be independent of each other, otherwise they cannot be executed in parallel. One simple example in electromagnetic problems is frequency response. In FEM or MoM, the computation process at each frequency point will not effect the outcome of other frequencies. Thus, we can simply submit the whole computation process of one frequency point to one processor. However, this may not achieve the best efficiency in terms of processor utilization. For instance, if there are 8 frequency points and 7 processors, when the last frequency point is computed on one processor, all other 6 processors will be idling. To achieve better efficiency, it is necessary to go to the inside of the computation process to seek out other independent operations which can be parallelized. Many numerical techniques for solving electromagnetic problems have utilized parallel computers. Especially, the FDTD has attracted a lot of effort in parallelization due to its computerintensive nature. A comprehensive study of its implementation can be found. A detailed description of a parallelized implementation of a MoM code is presented. A comparison of linear equation solvers for integral-equation formulations can be found. In the following sections, two processes in MoM are parallelized using MPI. One is the impedance 43

matrix fill process; the other is the FWT process. The detail of the implementation and the measurement of the turn around time are included. D.2 Parallelization Model of MoM Matrix Fill Process The formulation of MoM always leads to a matrix equation as follows Ax = B, (1) where A is a two-dimensional matrix, B is a column vector, and x is the unknown column vector we want to solve by direct inversion of A or by an iterative linear equation solver. The elements of matrix A are often the result of numerical integration of two basis functions with an integration kernel. The following equation is an example in one-dimensional space: aij = J J fi(x)g(x,x') fj(x)dxdx' where aij is the element of the matrix at i-th row and j-th column, fi(x) and fj(x) are the basis functions, and g(x, x') is the kernel. Depending on the form and the distance between the basis functions, the time needed to perform the numerical integration of one element may vary greatly. To fully utilize the processors of a parallel computer, a dynamical job allocation scheme is necessary. As shown in??, a processor is reserved as a dispatcher to send jobs and receive results from the workers. Other processors are served as workers to perform the numerical computation. 4 types of messages are passed between the dispatcher and workers. They are FREE: sent by a worker to inform the dispatcher that it is ready to accept new jobs, RESULT: sent by a worker to the dispatcher, containing the result of numerical integration, TERMINATION: sent by the dispatcher to inform the worker that all jobs are sent. JOB: send by the dispatcher, containing the information of a job. Following are the procedures performed by the dispatcher: 1. create a list of the unique elements, 2. wait for messages from the workers, 3. if the message is a RESULT message, store the result, otherwise the message is a FREE message and a JOB message is sent to the worker, 4. repeat step 2 and 3 until all jobs are sent and returned, 5. send TERMINATION message to inform each worker to quit. 44

For the worker, the procedures are: 1. send FREE message to the dispatcher, 2. wait for messages from the dispatcher, 3. if it is a JOB message, perform the numerical integration and return the result, otherwise it is a TERMINATION message and the worker quits. 4. goto step 2. Parallelization Model of Impedance Matrix Fill - Command path - Data path -1 -W -s^^ Worker Worker Worker Dispatcher * Send commands to Workers. N * Receive data from Workers. * Store data in impedance matrix. * Receive commands from Dispatcher. * Perform computation. * Send result to Dispatcher. Figure 21: A dynamical allocation scheme for matrix fill process. 1 shows the flow chart of the above process. The dashed lines in the figure indicate the communication paths. Two kinds of processes, dispatcher and worker, are concurrently executed. First, the process of the dispatcher sorts out the unique matrix elements and builds a job list after it starts and reads in the necessary parameters. Then, it enters a loop waiting for messages from the workers. The loop terminates only after all the jobs are sent and all the data are collected from the workers. Inside the loop, the dispatcher checks the messages from the workers. If the message contains data, the dispatcher will store it and then wait for new messages. If the message is a FREE message, the dispatcher checks if all the jobs have been sent. If not, the dispatcher will send a job to the worker which sends the message. If all jobs have been sent, the 45

Dispatcher Worker Figure 22: The flow chart of a dynamical allocation scheme for matrix fill process. 46

dispatcher will send a TERMINATION message to the worker. If all worker have been terminated, the dispatcher exits the loop, outputs the results and quits. If there are still active workers, the dispatcher will continue waiting for new messages. The worker process, after starts and reads in the necessary parameters, sends a FREE message to the dispatcher to indicate that it is ready to accept new job. Then, it waits for a reply from the dispatcher. If the reply is a TERMINATION message, the worker quits. Otherwise, the message contains a job and the worker starts to execute it. After the worker has finished the job, the result is sent back to the dispatcher via a RESULT message. After sending the RESULT message, the worker sends a FREE message to the dispatcher and waits for the reply from the dispatcher again. A short program section is included in?? to reveal the detail of implementation. A measurement of turn around time v.s. number of processors is shown in??. Here the "turn around time" means the elapsed time from the start of the program until completion. The measurement is performed on a 48 processor IBM SP2 machine. In the figure, the theoretical result is curve-fitted to the experimental result according to the following formula time(i) = P + Comm x n + Comp/(i - 1), for i > 1 (2) where time(i) turn around time, P preprocessing time, Comm communication time per message, Comp total computation time of the jobs, i number of processors, n number of transmitted messages. Note that Comm includes the communication latency and the actual transmission time of one message. Since the message size is fixed, Comm is a constant. Let N be the total number of the jobs. From the flow chart, there are N + i - 1 FREE,, N JOB, N RESULT, and i - 1 TERMINATION messages. Thus, the total number of transmitted messages is 3 x N + 2(i - 1). When N is much larger than i, the total communication time is proportional to N. From??, it is seen that when the number of processors increases, only the third term in the formula decreases. This means that the preprocessing time plus the total communication time governs the limit. The formula also indicates that the curve of the turn around time should approximate a hyperbola when N is much larger than i. The measured results show very good agreement with the above theoretical model. The improvement of turn around time is larger when the number of processors is small. However, when the number of processors is large, only a small improvement is achieved because the preprocessing and communication time dominate. Note that the turn around time of 2 processors is higher than 1 processor. This is due to the higher preprocessing time of parallelized program and the communication cost. Also note that, due to the programming model, one processor is always reserved for coordinating other processors. 47

When the communication cost is small, this processor will mostly be idle. It is possible to also utilize the idle time on this processor to perform numerical integration by more sophisticated programming. D.3 Parallelization Model of FWT As discussed in??, the FWT process transforms the matrix derived in MoM from one set of basis to another.?? and?? are repeated below for convenience. 0i=Z^PiN (3) if j=A1 Pij O [Z'] = [P]t[Z][P] (4) In??, Xi and q' are the members of the basis I and V' respectively, and N is the rank of the basis. In??, [Z] and [Z'] are the MoM matrices evaluated in bases 1 and V' respectively. The pij in?? is the i-th row, j-th column element of matrix [P] in??. Let z' be the m-th row, n-th column element of [Z'], from the above two equations, we can derive N N Zij = PmiZmnPnj, (5) m=l n=1l which is the form of the single operation in the FWT. The process of FWT fits to the SPMD model which is characterized by repeating the same operation, for example,??, on different data. Comparing to the previous matrix-fill process, each operation of FWT is shorter and fixed, however the total number of operations is much larger. If the same algorithm as the previous one is used, the overhead of communication will be too great. In stead, a fixed job-allocation scheme as shown in?? is devised. At the beginning, each processor computes the job list and equally divides it to the total number of processors. Each processor can determines its share of jobs by his "rank"'l. The only communication that takes place is when each processor sends its final computed results to one designated processor for post-processing. Following are the procedures performed by each processor: 1. create a list of the operations, 2. divide the list equally into n shares, where n is the total number of processors, 3. use the rank of the processor as an index to determine which share belongs to itself, 4. perform the FWVT, 1A terminology used by MPI to number the processors. If there is n processors, the rank of the processors will beO, 1, 2,...,n. 48

5. send the computed result to the processor with rank equal to 0. 1 shows the flow chart of the above process. A short program section is included in?? to reveal the details of the implementation. A measurement of turn around time v.s. number of processors is shown in??. The definition of "turn around time" is the same as in the previous section. The measurement is also performed on the same parallel computer. In the figure, the theoretical result is curve-fitted to the experiment result according to the following formula time(i) = P + Tran x N + L x i + Comp x N/i, (6) where time(i) turn around time, P preprocessing time, Tran actual transmission time of one unit of data, L communication latency, Comp computation time, i: number of processors, N number of unique matrix elements. Since each processor only sends computed results once, the total latency time equals to L x i. Comparing to other terms in?? the total latency is negligible since the latency of SP2 is only 40 /tS. As in the previous section, when the number of processors increases, only the last term in the formula decreases, while the first and second terms remain the same. This means that the preprocessing time and the total communication time is the limit. The formula also indicates that the curve of the turn around time should be a hyperbola. The measured results show very good agreement with the above theoretical model. The improvement of turn around time is larger when the number of processors is small. However, when the number of processors is large, only diminishing returns are obtained because the preprocessing and communication time dominates. D.4 Conclusion In this section, the matrix-fill process and FWT in MoM are parallelized. The measured improvement is consistent to the theoretical model of the turn around time. Another bottleneck not addressed in this chapter is the process of solving??. When the size of matrix A is large, direct inversion of A becomes impossible. Thus, iterative solvers need to be used. An iterative solver by nature is sequential. Each iteration relies on the result of last iteration. In this dissertation, a bi-conjugate gradient [?] iterative solver is used when the matrix size is large. In this solver, the most time-consuming part is the multiplication of a matrix with a column vector. Every iteration produces an improved guess of the solution vector. Thus, the communication overhead of 49

at least one solution vector in every iteration is not avoidable. When the dimension of the matrix is large, this communication overhead outweighs the improvement incurred by parallelizing the linear solver, since the communication speed is much slower than computation speed. This problem still needs to be addressed if further improvement on the speed of MoM is necessary. 50

This Appendix has been excerpted from Juiching Cheng 's Ph.D. Dissertation, Appendix C. E The Implementation of Parallel Programming in C++ The following two sections show the implementation of the parallel programming of dynamic and static job allocation schemes in C++ language. The definition of the MPI functions can be found in [?,?]. E.1 Dynamic job allocation int rank, process_size; mpidata mdata; /* common buffer for MPI messages */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank==0) /* Dispatcher Process */ int send_count=O, receive_count=0, actual_count; int *process=new int[process_size-l]; /* for storing the status of processors */ for (int i=0;i<process_size-l;i++) process[i]=0; intpair* ip;/* pointer to the job list */ int finished=0; do { MPIStatus status; /* receive message from workers */ MPIRecv(&mdata, sizeof(mdata), MPIBYTE, MPIANY_SOURCE, MPIANY_TAG, MPI_COMM_WORLD, &status); int sent=0; switch (status.tag) /* check message type */ { case TAGREQUEST: /* request for job */ if (finished) /* inform the work to end */ MP end&mdata, sizeof(mdata), MPIBYTE, statusourc MPI_Send(&mdata, sizeof(mdata), MPIBYTE, status.source, 51

TAGEND, MPI_COMM_WORLD); process[status.source-] =1; sent=l; }; while (!finished &&!sent) /* send job to the worker */ { mdata.job=ip[send_count] mdata.i=send_count++; MPI_Send(&mdata, sizeof(mdata), MPIBYTE, status.source, TAG DATA, MPI_COMM_WORLD); actual_count++; if (send_count>=count) { finished=l; }; sent=l; }; break; case TAG_DATA: /* returned results */ ip[mdata.i].v=mdata.v; receive_count++; break; }; } while (receive_count<actual_count finished==0); /* inform workers to end */ for (int i=l;i<process_size;i++) { MPI_Status status; if (process[i-l]==0) { MPI_Recv(&mdata, sizeof(mdata), MPI_BYTE, i, TAGREQUEST, MPI_COMM_WORLD,&status); MPI_Send(&mdata, sizeof(mdata), MPI_BYTE, i, TAG_END, MP I_COMM_WORLD); } }; /* output the results */ 52

} else { /* Work Process */ MPI_Status status; while ( MPI_Send(&mdata, sizeof(mdata), MPIBYTE, 0, TAGREQUEST, MPI_COMM_WORLD ), MPI_Recv(&mdata, sizeof(mdata), MPI_BYTE, 0, MPI_ANYTAG, MPI_COMM_WORLD, &status), status.tag==TAG_DATA) /* compute the job stored in mdata.job */ /* store the result in mdata.v*/ /* return the result in mdata.v */ MPI_Send(&mdata, sizeof(mdata), MPIBYTE, 0, TAGDATA, MPI_COMM_WORLD); }; }; MPI_Finalize(); E.2 Static Job Allocation int size; /* total number of jobs */ int rank, process_size; mpidata mdata; /* common buffer for MPI messages */ MPI_Commrank(MPI_COMM_WORLD, &rank); MPI_Commsize(MPI_COMM_WORLD, &process_size); /* compute the size of jobs for each processors */ int count=size/processsize; int remain=size % process_size; /* remaining jobs */ int *recvcount, *displs, offset; if (rank==0) { /* processor 0 is also in charge of collecting the results, so it needs to prepare the place to store them */ 53

recvcount=new int[processsize]; displs=new int[process_size]; for (int i=0; i<process_size; i++) { if (i<remain) { displs[i]=i*(count+l)*sizeof(intpair); recvcount[i]=(count+l)*sizeof(intpair); } else { displs[i]=(i*count+remain)*sizeof(intpair); }; }; } i; /* adjustment for the remaining jobs */ if (rank<remain) count++; offset=count*rank; } else { offset=count*rank+remain; } i; /* compute the jobs */ /* return the results */ MPI_Gatherv(&(z.ia[offset]),count*sizeof(intpair),MPIBYTE,&(z.ia[0]), recvcount, displs, MPI_BYTE, 0, MPI_COMM_WORLD); if (rank==0) { /* processor 0 output the total results */ }; 54

Impedance matrix fill, 240 matrix elements 1200 100 ^ 0 ~ Measured 130 800- I 1 I-Theory E 600 400 200 5 10 15 20 25 30 Number of CPUs Figure 23: The measured turn around time of the impedance matrix fill process. The theoretical result is plotted by curve-fitting the measured data to?? 55

Parallelizing Model of Fast Wavelet Transform -, Data path U U * Evenly divided by N. * Send each share to each CPU. * Gather results from each CPU. * Perform FWT. Figure 24: A static allocation scheme for the FWT process. 56

Worker 0 Worker 1.... *Worker n Figure 25: The flow chart of a static allocation scheme for the FWT process. 57

Fast wavelet transform, 3568 matrix elements L Theory 5 i 4 0 o- 3 * -- * -- * -- * -- 2 4 6 8 10 12 14 16 Number of CPUs Figure 26: The measured turn around time of the FWT process. The theoretical result is plotted by curve-fitting the measured data to?? 58

F Publications References [1] J.-G. Yook, J. Cheng, D. Chen and Linda katehi, "Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design," DoD HPCMP User Group Meeting, San Diego, CA. 1997. [2] J.-G. Yook, J. Cheng and Linda katehi, "Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design," DoD Mission Success Story from High Performance Computing. [3] J.-G. Yook, "Electromagnetic Modeling of High-speed High-frequency Interconnects," Ph.D. Dissertation, The University of Michigan, Ann Arbor, Michigan, 1996. [4] J. Cheng, 'Theoretical Modeling of MMICs Using Wavelets, Parallel Computing and a Hybrid MoM/FEM Method," Ph.D. Dissertation, The University of Michigan, Ann Arbor, Michigan 1998. [5] G. Ponchak, D.-H. Chen, J.-G. Yook and Linda katehi, "Characterization of Plated Via Hole Fences for Isolation between Stripline Circuits in LTCC Packages," IEEE MTT-S 1998 Symposium. [6] E. Yasan, J.-G. Yook and Linda katehi, " Generalized Method for Including Two Port Networks in Microwave circuits Using the Finite Element Method," ACES 97(presented). [7] E. Yasan and Linda katehi, "An FEM based Method on Full Wave Characterization of Distributed Circuits Including Linear and Nonlinear Elements," IEEE AP-S/URSI 1998 Symposium. [8] J.-G. Yook, D.-H. Chen, J. Cheng and Linda katehi, "Electromagnetic Solvers for High Frequency Circuits," DoD HPCMP Presentation, Nov. 1997. [9] D.-H. Chen, J.-G. Yook, and Linda katehi, ""Electromagnetic Solvers for High Frequency Circuits," DoD HPCMP Presentation, Nov. 1998. [10] D.-H. Chen and Linda katehi, ""Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design," DoD HPCMP Progress Report, Jan. 1999. 59

"DoD Mission Success Story from High Performance Computing: 1998 ed." PARALLEL ELECTROMAGNETIC SOLVERS FOR HIGH FREQUENCY ANTENNA AND CIRCUIT DESIGN J.-G. Yook, J. Cheng and L. Katehi Radiation Laboratory Department of Electrical Engineering and Computer Science The University of Michigan, Ann Arbor, MI 48109-2122, U. S. A. Tel: 313-764-0608, Fax: 313-647-2106 DoD partner: Barry S. Perlman, Army Communications - Electronics Commands. Computer Resource: IBM SP2 [MHPCC] Research Objective: The purpose of this effort is development of linearly scalable parallel electromagnetic solvers for large array antenna and complicated microwave/millimeter-wave circuits which have not been successfully characterized by other means. Fast and accurate characterization of these electromagnetic structures are of critical importance for a successful DoD mission in digitizing the battlefield and developing secure communication systems. Methodology: For parallel electromagnetic tools, we have developed wavelet-based method of moment (MoM) and tetrahedral finite element method (FEM) codes for arbitrary three-dimensional structures, and a message passing interface (MPI) has been utilized for parallelization on distributed memory parallel computers. Furthermore, fast wavelet transform and thresholding techniques are applied to achieve accelerated computation of the MoM solution. The linear scalable performance of the parallelized MoM and FEM is also investigated for real time design and optimization of the large array antenna and high frequency circuits. Results: Preliminary results of the parallel impedance matrix filling and fast wavelet transform for MoM and task parallelization strategy for FEM show linearly scalable performance improvement. This truly scalable parallel MoM and FEM code performs successfully due the minimal communication overhead between the computing nodes and is not subject to the bandwidth of the network or switches. The number of unknowns of the problem ranges from 15,000 to 20,000 in the MoM case and from 150,000 to 200,000 in the FEM case. Figures below shows three-dimensional electric field distributions at far region from the 6x6 rectangular patch array antenna which are computed on parallel computers very efficiently. Significance: Design and optimization of large electromagnetic structures, such as array antennas and high density circuits and packages, are of critical importance in the development of the advanced sensors and tactical communication systems. Through our research, we have demonstrated linearly scalable parallel strategies for various numerical electromagnetic codes and solved various class of problems which were not possible or extremely time consuming with conventional method and computers. This project has been funded by NASA Lewis Research Center, Army Research Office and the DoD High Performance Computing Program., * ui - -- an - Figure: Far field radiation patterns for 6x6 rectangular patch array antenna. Authors: Jong-Gwan Yook, Juiching Cheng and Linda P.B. Katehi in the University of Michigan. Tel: 313-764-0608.

"Contributions to DoD Mission Success from High Performance Computing: 1996 ed." CO-FIRED CERAMIC PACKAGE FOR A KA-BAND MIMIC PHASE SHIFTER Jong-Gwan Yook and Linda P. B. Katehi Radiation Laboratory Department of Electrical Engineering and Computer Science The University of Michigan, Ann Arbor, MI 48109-2122, U. S. A. Tel: 313-764-0502, Fax: 313-747-2122 Email:yookjong@engin.umich.edu DoD partner: J. Harvey, Army Research Office, RTP, NC. Computer Resource: IBM SP2 [MHPCC] Research Objective: The purpose of this effort is the full-wave characterization of Microwave/Millimeter wave packages for discrete MMIC components. While these packages find widespread use in MMICs, their modeling has been limited by the large size and complexity of the problem leading into long design cycles, low yield and high cost. Methodology: To model the MMIC package we have e parallelized a full-wave finite element (FEM) code on the distributed memory parallel computer, IBM SP2, in Maui High performance Computing Center (MHPCC). Among the many parallelization schemes, we have selected "task parallelization" based on the considerations that FEM is a frequency domain technique and each frequency point needs to be computed independently in order to obtain the complete package response over the desired frequency spectrum. Results: Preliminary results of the "task parallelization" strategy show linearly scalable performance improvement. This truly scalable parallel FEM code performs successfully due the minimal communication overhead between the computing nodes and it is not subject to the bandwidth of the network or switches. The number of unknowns of the problem ranges from 150,000 to 200,000 and the global matrix equations are simultaneously built and solved at all frequency points within an hour. The same code would normally require 40 hours of CPU time if it were run in serial computer. Significance: Hermetic packages are frequently utilized in microwave integrated circuits to provide protection from a hostile environment, reduced electromagnetic interference and minimal radiation leakage. A variety of packages based on a cofired ceramic technology have developed for use in high frequency application by DoD. The presented effort has allowed, for the first time, the complete and accurate high-frequency characterization of a MMIC package and the understanding of the electromagnetic interference it produces in the form of parasitic resonances. This package has been developed by Hughes at Torrance under a contract from NASA Lewis Research Center in Cleveland, OH. ' Electic Field stribution in Package at 29.5 G-0 -10 -20 -30 s 0 -40 Via Holo ' By -50 -50, A- t-80 Ground Plane Figure: The 18 - 40 GHz MMIC package for phase shifter chip and electric field distribution in the package. Authors: Jong-Gwan Yook and Linda P.B. Katehi in the University of Michigan. Tel: 313-747-1796.

"Contributions to DoD Mission Success from High Performance Computing: 1996 ed." W-BAND COPLANAR WAVEGUIDE PROBE STRUCTURES Emmanouil M. Tentzeris, Jong-Gwan Yook, and Linda P. B. Katehi Radiation Laboratory Department of Electrical Engineering and Computer Science The University of Michigan, Ann Arbor, MI 48109-2122, U. S. A. Telephone: 313-764-0502 Fax: 313-747-2122 Email: yookjong@engin.umich.edu DoD partner: J. Harvey, Army Research Office, RTP, NC. Computer Resource: Convex SPP-1 [NCCOSC San Diego] Research Objective: Significant attention is being devoted now-a-days to the analysis and design of W-band waveguide structures which are used either to probe the modal propagation inside the waveguides or as active-element mounts. This effort aims at the full-wave characterization of these highly complex waveguide probe structures. Methodology: We have applied the Finite Difference Time Domain (FDTD) technique in the calculation of the S-parameters of a W-band coplanar waveguide-to-rectangulat waveguide probe. The FDTD code has been parallelized on the Convex SPP-1 [NCCOSC] using "task parallelization". Results: A parametrical study has been performed in an effort to maximize the coupling between the probe and the waveguide over the widest possible frequency range. A waveguide absorber based on an analytical Green's function has been developed to minimize the reflections over a wide band of frequencies. As an application, a membrane coplanar waveguide-to-rectangular waveguide probe has been characterized and designed for less than -20 dB return loss from 75 GHz to 115 GHz. The mesh size is 480x 477x 52 and it would require about 250 hours of CPU time if the FDTD code were to run on an HP 735 Workstation, and 72 hours in a Cray C90 supercomputer. The parallelized version of the same program on Convex SPP-1 requires about 31 hours CPU time for completion. Significance: The presented effort has allowed, for the first time, the complete and accurate highfrequency characterization and design of a W-band back-to-back waveguide probe configuration. Until recently, W-band waveguide probes for diode mounting or waveguide probing have only been designed experimentally through very long and costly design cycles. I E I att - 6000 ' dt on const-Z plane 0 0 5 — 10 NPUT - i20 1 5 -30 20 -40 — L -50 -60 -70 O=T 3-80 -90 45 -100 100 120 'Of[mm] 1=0 180 Figure: The membrane coplanar waveguide probe and the electric field distribution for the structure Authors: Emmanouil M. Tentzeris, Jong-Gwan Yook, and Linda P.B. Katehi in the University of Michigan. Tel: 313-747-1796

DoD HPC 1998 User Group Meeting (June 1-5th, 1998 at Rice University, Houston, TX) Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design J.-G. Yook, D. Chen, J. Cheng and L. Katehi Radiation Laboratory Department of Electrical Engineering and Computer Science The University of Michigan, Ann Arbor, MI 48109-2122 Tel: 734-764-0608, Fax: 734-647-2106 B. Perlman US Army Communications-Electronics Commands(CECOM) Command/Control & System Integration Directorate AMSEL-RD-C2 Ft. Monmouth, N.J. 07703 Tel: 723-427-4883 Introduction The purpose of this effort is development of linearly scalable parallel electromagnetic solvers for large scale array antennas and complicated microwave/millimeter-wave circuits which have not been successfully characterized by other means. Fast and accurate characterization of these electromagnetic structures are of critical importance for a successful DoD mission in digitizing the battlefield and developing secure communication systems. Methodology For parallel electromagnetic tools, we have developed wavelet-based method of moment (MoM) and tetrahedral finite element method (FEM) codes for arbitrary three-dimensional structures, and a message passing interface (MPI) has been utilized for parallelization on distributed memory parallel computers. Furthermore, fast wavelet transform and thresholding techniques are applied to achieve accelerated computation of the MoM solution. The linear scalable performance of the parallelized MoM and FEM is also investigated for real time design and optimization of the large array antenna and high frequency circuits. Results Preliminary results of the parallel impedance matrix filling and fast wavelet transform for MoM and task parallelization strategy for FEM show linearly scalable performance improvement. This truly scalable parallel MoM and FEM code performs successfully due the minimal communication overhead between the computing nodes and is not subject to the bandwidth of the network or switches. The number of unknowns of the problem ranges from 15,000 to 20,000 in the MoM case and from 150,000 to 200,000 in the FEM case. Figures below shows three-dimensional electric field distributions at far region from the 6x6 rectangular patch array antenna which are computed on parallel computers very efficiently. Significance

Design and optimization of large electromagnetic structures, such as array antennas and high density circuits and packages, are of critical importance in the development of the advanced sensors and tactical communication systems. Through our research, we have demonstrated linearly scalable parallel strategies for various numerical electromagnetic codes and solved various class of problems which were not possible or extremely time consuming with conventional method and computers. This project has been funded by NASA Lewis Research Center, Army Research Office and the DoD High Performance Computing Program. Figure 1: Far field radiation patterns for 6x6 rectangular patch array antenna

PARALLEL ELECTROMAGNETIC SOLVERS FOR HIGH FREQUENCY ANTENNA/CIRCUIT DESIGN Barry S. Perlman Jong-Gwan Yook*, Juiching Cheng Army Research Laboratory Donghoon Chen, and Linda P. B. Katehi Sensors and Electronics Directorate Radiation Laboratory Fort Monmouth, NJ 07703 The University of Michigan, Ann Arbor, MI 48109 yookjong@engin.umich.edu Presented/Published in DoD HPCMP User Group Meeting, San Diego, CA, 1997 Abstract In this paper, several different parallelization strategies for the method of moment (MoM) and the finite element method (FEM) are implemented on a distributed memory parallel computer (IBM SP2) having 48 CPUs. For the portability of the parallel codes on various platforms, standard message passing paradigm (MPI) has been adapted. Having in mind that a major bottleneck of the parallelized codes is the interprocessor communication overhead, data independence is fully exploited for efficient parallelization and as a result near linear scalability has been achieved. I. Introduction High-frequency radar systems for communication, detection and surveillance incorporate MMIC modules which are characterized by high density and substantial geometrical complexity. In most cases these modules are packaged for electrical and/or environmental protection and the resulting 3D structures exhibit a considerable complexity that impedes design and influences electrical performance due to unwanted parasitics. The design of these complex three-dimensional monolithic circuit (MMIC modules) with hundreds of vias and lines interconnecting a large number of active and passive circuit components to a rather large number of radiating elements, has been a problem of critical importance to DoD. The design and characterization of these circuits requires numerical approaches which can fully characterize the excited electromagnetic fields. Among the various techniques, the method of moments has demonstrated superiority in solving radiation problems but it has been hindered by long computational times and is limited, for this reason, to small circuits not exceeding more than a few components and interconnecting lines. In this paper, a novel technique based on the multiresolution expansion and the hybrid combination of the integral equation technique and the finite element method is parallelized to solve very large problems. I. Parallelization Strategies In this section, several different parallelization strategies for MoM and FEM have been presented. In the former case, parallel MoM impedance matrix generation and parallel fast wavelet transform (FWT) are 1

Parallelization Model of Impedance Matrix Fill Parallelizing Model of Fast Wavelet Transform Commad patm --—. o satc pm -— Worker -- - Da pad, Send dd by t. *Ruc - looma Dispatrx. *P::onn FWT. *Worker P coTransformed -itMatrix (a) () Send each s to each CPU. Ste data in in:edance ment eih CPUm ~ Receive coamamds from [rDispatch. - Perfm FWT. * Send result to Dispatcw. (a) (b) Figure 1: Parallelization schemes for MoM: (a) Parallel impedance matrix filling. (b) Parallel fast wavelet transform. implemented, while parallelization of iterative linear equation solver and frequency parallelization schemes are tested for the later case. A. Parallel MoM Matrix Element Generation In method of moments, each matrix element is produced by the integration of a source function and a test function with an integration kernel. When the closed form representation is not available, numerical integration, which is time-consuming, must be used. However, the procedure of the integration of each matrix element fits quite well to the Single Instruction Multiple Data (SIMD) parallelization model, enabling the utilization of parallel computers to speed up the process. Because the computation time of each matrix element can vary in a great range, an algorithm that dynamically assigns jobs to each CPU is devised as illustrated in figure l(a). In this algorithm, one CPU is reserved as a dispatcher to send jobs to and gather results from other CPUs. Jobs are sent whenever there are free CPUs. To achieve performance gain, the communication time between the CPUs must be much smaller than the computation time which is true in most impedance matrix calculation. B. Fast Wavelet Transform In order to take the advantage of multiresolution analysis [I], the MoM matrix is further transformed by FWT to its equivalent representation in wavelet bases. The FWT also fits to SIMD model, but comparing to the generation of MoM matrix elements, the computation time for each element is much smaller and the total number of iterations is much higher. If the same algorithm as the MoM matrix element generation is used, communication overhead will be too high. Instead, another algorithm is devised to reduce the communication overhead as shown in figure I (b). In this algorithm, jobs are evenly divided to each CPU. Only one communication is needed when each CPU completes its entire job list and sends the results back to the designated CPU for further processing. C. Task Parallelization 2

Pr-processing: Mesh Generation Pre-processing: Mesh Generation ------------------- '1 — - - Frequlem ent Matrix Computpralleliation Eement Matri Comptonf f2 Element Matrix Computation fn1 fn ' --- —- r -------; --- —------- I --- —------- atrix Ambly: Ax=b Matrix Assembly: Ax=b -------------.. —... --- ---------------- r --- —-----— I --- —------- Linear Equation Solver: BiCG i Parallelization, Unear Equation Sover: BICG ' -- -- ' --- —------------ --------- - i --- —-,,......... i. I -----—. --- - S-Paramete Sameters -- - - - - -- - -- - - -- - - Output )Output (a) (b) Figure 2: Parallelization schemes for FEM: (a) Parallelization of linear equation solver- BiCG. (b) Task parallelization. The generation of the matrix entries in the finite element method is in general a much easier and faster process compare to that of the MoM [2, 3]. Under most practical situations, FEM matrix generation takes less than 10% of total FEM solution time. As a result, the parallelization of linear equation solver may result in an efficient method (refer fig. 2(a)). On the other hand, based on the property of the frequency domain finite element method, each frequency point can be parallelized with minimum inter-process communication overhead which might lead to scalable FEM code as shown in figure 2(b). In the following section, the performance of two different parallelization schemes for FEM is investigated and the speedup curves are presented. Il. Numerical Results A. Parallelized MoM Code for Antenna Applications In order to demonstrate the performance gain of the parallelization models discussed in the previous section, the impedance matrix of a 6 x 6 microstrip patch array antenna is calculated and solved by taking the FWT. The individual patch size, the dielectric constant of the substrate, and the thickness of the substrate is 59.5 mm x 50 mm, 2.62, and 1.59 mm, respectively. The rooftop function is used as the basis function of the field expansion. The number of cells are 16 x 16 on each patch and the resulting total number of unknowns is 17280 for the whole array. The radiation patterns for Ek and Eg are calculated at 1.536 GHz. Figure 3(a) and (b) show the measured computation time v.s. number of CPUs and theoretical result for both the impedance matrix filling process and FWT using the models presented in Section II. As can be seen from the figures, the measured result matches very well to the theoretical prediction. The radiation patterns of the antenna are computed as shown in figure 4 for E0 and Es plane patterns. The thresholding is taken after 3

the FWT and the resulting sparse matrix is solved with iterative linear equation solver, such as biconjugate gradient method. Impedance Matrix Fill, 240 matrix elements Fast Wavelet Transform, 3568 matrix elements 1200. I f..... 0 i C *4 2 |. I Measured -— Theory Tim(i) * P + Comm x N + Comp x N I(1-1), for i >1 P: preprocesing time. Comm: commnunication time. \ Comp: computation time. i: number of CPUs. \ N: number of matrix elements. I -l C s 04 I.3 -Measured -Theory Time(i) 0 + Comm x N + Comp x N / 0: overhead. Comm: communication time. Comp: computation time. i: number of CPUs. N: number of matrix elements. 200 01-... - - - -. fl ~ I I.. vl 1 5 10 15 20 25 30 Number of CPUs Note: the turn around time of 2 CPUa la higher than 1 CPU. This is due to the higher preprooming timne of paralkliad program and the communication cot Akao not that, due to the programming model. one CPU ia alway raserved for coordinating other CPUa, which laves only one CPU to perform the computation when only two CPUa ara used. (a) 2 4 6 8 10 12 14 16 18 Number of CPUs Note: the overhead includes the extra function calls for measuring time and parallel processing. (b) Figure 3: Parallelization performances for MoM: (a) Parallel impedance matrix generation, (b) Parallel fast wavelet transform. A. Parallelized FEM Code for Packaging Applications To measure the perfoance of the performance of the parallelized Bi-CG routine on distributed memory machine, two different size of problems are solved. As shown in figure 5(a), parallelization efficiencies are minimal, even though the larger problem size shows slightly better result. With this observation, the task parallelization scheme is tested as shown in figure 5(b) and reveals excellent result. The poor performance of the parallel Bi-CG routine is mainly due to heavy communication overhead. On the other hand, the linearly scalable performance of the task parallelization scheme is coming from the minimum data dependency between the CPUs. As an application of the parallelizaed FEM code, we have characterized two typical microwave/millimeter-wave packages and the internal electric field distributions are calculated as presented in figure 6. IV. Conclusions In this paper, single instruction multiple data (SIMD) model have been fully implemented for the finite element method and the method of moment using the MPI standard on distributed memory machine. Due to the data independence between the processes, linearly scalable parallel codes are achieved. In particular, matrix element generation in MoM has been parallelized in view of its heavy computational burden. In the FEM code, frequency parallelization scheme has been implemented after the observation that the performance of the parallelized linear equation solver has been poor. 4

E, (o - O) Paern kor e6 Pasch Army E, (# - 00) Pa.m kn r PfIS Afi ay --- Thdd 10. - kww I I i I I e{dog (a) (b) Figure 4: Radiation patterns for 6 x 6 rectangular patch array: (a) E< pattern, (b) Es pattern Acknowledgment This work was partially supported by ARL under contract QK8820. Also, the authors would like to thank the Maui High Performance Computing Center (MHPCC) and the University of Michigan Center for Parallel Computing (CPC), which is partially funded by NSF grant CDA-92-14296 and the Ford Motor Company, for the use of their computational facilities. References [1] Kazem F. Sabet, Jui-Ching Cheng and Linda P. B. Katehi, "Efficient Wavelet-based Modeling of Printed Circuit Antennas and Arrays," submitted to IEEE Transactions on Antennas and Propagation. [2] Jong-Gwan Yook and Linda katehi, "Characterization of MIMICs Packages Using A Parallelized 3D FEM Code," 1996 ACES Symposium, Monterey, CA., pp. 954-961. [3] Jong-Gwan Yook and Linda katehi, "Parallelization of the Finite Difference Time Domain Code on Shared Memory Machine:' PIERS 1996, Innsbruck Austria. 5

1600 1400 1200. I,, I I I,, I I I.1. 1-G- M -38109 *G — M =76709 -~1000 -- 0.. 207 800 600 1 10 400 200 7 0 2 4 6 8 10 1214 1618 20 0 5 10 15 20 25 30 35 Number of Processors Number of Processors (a) (b) Figure 5: Parallelization performances for FEM ("M" is the number of unknowns to be solved): (a) Parallel Bi-CG routine, (b) Frequency parallelization scheme. MAcuatrip Lin jI~,A4ij -.: - -.-. -a. -- — Vk5.III rum- It....-~ ~ ~ - -. -7-, ----. f,i - -- II -- -- - -1 -;I -- Tjim - I -.1 - - - -. i.a..: - -t —~- 1 - - I I 'I1 1/tH Subgrak Er _.I. (a) (b) Figure 6: Two different packaging structures and electric field distributions inside of the structures. 6

Characterization of Plated Via Hole Fences for Isolation Between Stripline Circuits in LTCC Packages George E. Ponchak', Donghoon Chen2, Jong-Gwan Yook2, and Linda P. B. Katehi2 1. NASA Lewis Research Center, 21000 Brookpark Rd., MS 54/5, Cleveland, OH 44135 2. University of Michigan, 3240 EECS Building, Ann Arbor, MI 48109-2122 Abstract Reduced coupling between adjacent striplines in LTCC packages is commonly accomplished by walls made of plated via holes. In this paper, a 3DFEM electromagnetic simulation of stripline with filled via fences on both sides is presented It is shown that the radiation loss of the stripline and the coupling between striplines increases if the fence is placed too close to the stripline. Introduction Smaller packages with more circuitry are required for the advanced RF systems being built today and planned for tomorrow. These new packages house a variety of high density circuits for data processing, biasing, and memory in addition to the RF circuits. While the size is being reduced and the complexity increased, the cost of the package must also decrease. To accomplish these contradictory goals, new packaging technologies are required. Low Temperature Cofired Ceramics (LTCC) is an ideal packaging technology. The material has a moderate dielectric constant, 4<Er<8, which permits wider strips and thus lower conductor loss than circuits on Si, GaAs, or Alumina. In addition, the loss tangent is on the order of 0.002 at 10 GHz which yields an acceptably low dielectric loss. LTCC packages are comprised of many 0.1-0.15 mm thick ceramic layers with transmission lines on each layer. This increases the level of integration by allowing bias, digital routing, and RF transmission lines and interconnects to be built up in three dimensions. The multilayer character of these circuits leads to RF transmission lines which are not of microstrip type but of stripline type. Even with the high levels of integration LTCC offers, designers are required to decrease the spacing between striplines to meet the new size and cost requirements. In doing so, coupling between adjacent striplines severely limits the overall Ah Metal-filled h (a) PEC Metal-filled Via (b) Fig. 1 (a) Stripline with a continuous filled via fense on both sides (b) Cross section. packaged circuit performance. As a solution to this problem, package designers have included filled via fences adjacent to the stripline to confine the electromagnetic fields around the center strip, partition the package and reduce electrical couplings [1-2]. In this paper, we utilize a 3D-Finite Element Method (FEM) [3] to evaluate stripline structures in the vicinity of filled via fences. It is shown for the first time that filled via fences increase the radiation loss of the stripline and decrease the coupling between adjacent striplines. Design guidelines are given to minimize these parasitic effects which degrade overall performance. Results The cross section of the stripline structure is shown in Figure 1. Throughout this paper: the relative These values are all standard for typical LTCC

'prescn ted at A C' 97" A GENERALIZED METHOD FOR INCLUDING TWO PORT NETWORKS IN MICROWAVE CIRCUITS USING THE FINITE ELEMENT METHOD Eray Yasan, Jong-Gwan Yook and Linda P. Katehi Radiation Laboratory Dept. of Electrical Engineering and Computer Sci. The University of Michigan, Ann Arbor, MI 48109-2122, U.S.A. Tel: 313-764-0502, Fax: 313-747-2106, Email: eray@engin.umich.edu Abstract This paper discusses the methodologies required to introduce basic passive elements into an FEM modelled microstrip circuit. Three different approaches have been investigated and are presented herein. The first two methods use the basic principles of circuit theory while the third employs the so called zeroth order approximation or volume current method. One of the EM-circuit method uses the voltage-current relations applying at the passive element(s) in the circuit. This is done through the impedance of that element and requires a modification of the functional of the FEM equations. The second method uses the S-parameters of the passive element(s) to provide the required circuit relation. By using these methods the effect of the presence of the passive element(s) on the microstrip circuit can be observed. 1 Introduction The Finite Element Method (FEM) has been established during the past ten years as an accurate and versatile frequency-domain technique for passive circuit problems. Despite the capability of the technique to treat a broad variety of circuit geometries, it has been limited to only distributed elements that are mostly passive and linear. To be able to make the technique applicable to more complete microwave and millimeter wave circuits, its capability needs to be extended to handle passive and active elements. Some studies about these issues have been made by using different techniques such as FDTD, TLM and FEM [1]-[3]. The techniques presented herein are divided into circuit element methods and volume current methods. In the following, a short description of each technique is given.

1.1 Circuit Method # 1 IThllis method uses circuit concepts such as the current -volt age (I - \ ) relations thirough tlie nodes connected to the element[6]. From circuit theory thle following relations for thle resistor, capacitor and inductor hold respectively V = RI I' = jLI I = j ICV. (1) In terms of field quantities the voltage and current are expressed in the following way 1 = E.dl (2) 7I=.H dl (3) where the integral is evaluated along the element for the voltage and around the element for the current flowing through the nodal points. Since tetrahedral based FEM is used throughout the analysis, lumped elements are located along the edges of the tetrahedrons. 1.2 Circuit Method # 2 In the second circuit method, S-parameters of the lumped elements are known and given to the system a priori. In order to use (I - V) relations, S-parameters are converted to Z-parameters. Referring to Fig.1, the following relations hold V1 = Z11 1 + Z1212 (4) V2 = Z2111 + Z2212 (5) In the microstrip case V1 is the voltage between the signal line and the ground. If this is a symmetric microwave circuit, then V1 is equal to V2. I1 and 12 are the currents flowing through the conductors towards the element as shown in Fig. 1. In terms of field expressions V1 and V2 are the line integrals of electric fields evaluated through the edges from conductors to the ground. '1 and I2 are the closed line integrals of magnetic fields around the conductors. 1.3 Volume Current Method The third method [7] uses the relation Ji = aE (6) in order to account for the lumped elements' contribution by introducing fictitious conductivity regions in the volume. The conductivity a in these regions is given by (a = 1/(ZLS) (7) where ZL,, and I represent load imi)e(ance in ohIns(Q), cross-sectional area and length of the load respectively.

1 2 + 2' - 2 + V, [Z] _V2 Figure 1: Two port network with Z parameters 2 FEM Formulation Starting with Maxwell's equations, the following wave equation is weighted according to Galerkin's method and discretized V x V x E - w2[eE = -jwlJi (8) where Ji is the impressed current source. Discretization is done in a conventional way using tetrahedral elements. The following weak form of Maxwell's equations is obtained J/ J [(V x E) (V x P)-w2jicE P]dv= jwjL(H x P).- nds- /J jiwP Jidv. (9) Expanding the electric field as a summation of linear basis functions and choosing the weighting function as the same edge basis function give 6 E = aiW i (10) i=1 P = Wj, j = 1,...,6 (11) where ai are the unknowns to be determined, and W's are the first order edge based Whitney functions. For the circuit method #1 the first term on the right hand side of the FEM equation (9) is expressed in such a way that the effect of the lumped element is included. The resulting term in the FEM equations becomes -jL [eI2 E6(r- r,)][ E dl]ds (12) ZL where e12 is the unit vector pointing from node 1 to node 2 along which the element is placed and ZL is the load impedance. According to the circuit method #2, having additional set of equations from 1 - V relations eIiables us to write another matrix equation having N unknowns not necessarily being equal to the number of additional equations. By careful examination, elimination of some of the

iuIiknowns. especially through the edges along which tle elenierit is located, by use of t hese e(uatioris decreases the size of the original FE.N niatrix. Thisi i mattrix equation is solved uisirig iterative nmatrix-solving algorithms. The third method uses the second terni on the right hanll side of the FEM equatioIn (9) in order to introduce tile effect of the lumped element to the equations. The term takes the forim -jL EE. (13) Similar to the first method, this term only contributes to the diagonal elements in the FEM matrix. The most efficient way to solve these matrix equations is found to be the Pre-Conditioned Conjugate Orthogonal Conjugate Gradient (COCG) Method. 3 Results A shielded symmetric two port microstrip structure shown in Fig. 2 was examined. Using circuit method # 1 and volume current method, the case of 100Q between the lines was studied. The S parameters obtained out of these two methods are shown in Fig. 3. For S11 the results agreed with theoritical ones (|S11| = IS221 = -6.02dB) quite reasonably but for S12 they are degraded in the high frequency region. This might arise because of circuit discontinuity at the element and also insufficient discretization. For circuit method # 2, the same geometry with a PEC wire is used in place of the resistor(Fig. 4). The results are shown in Fig. 5 and are in good agreement with the regular FEM results. A variety of results involving complex arrangement of passive elements will be presented and discussed at the conference. 4 Conclusions Three different approaches have been proposed and applied to the shielded two port microstrip geometry. These approaches can be applied. to similar geometries which include inductors and capacitors as well. Generalization of these approaches might lead us to analyze structures having active lumped elements and to utilize harmonic balance technique with FEM. Acknowledgement This has been partially supported by the Army Research Lab at Fort Mamouth and the Army Research Office under Contract # DAAH04-96-1-0390 References [1] T. Shibata and 11. Kimura,"Comrputer-aided engineering for microwave and nillimeterwave circuits using the FDTD technique of field simulations,"Int. Journal of Microwave and

MIillinlmter-Wave (oniputer-Aided Ellgineering. Vol.3, No. 3 pp. 238-2501. 1993. [2] \.A. Thomras, I.E. Jones.NI. Piket-Mlav, A. Taflove and E. tlarriganl" Tlhe use of SPICE lumped circuits as sub-grid mnodels for FDTD analysis." IEEE Mlicrowave Guided %Wave Ltters,vol. 4, pp. 141-143, May 1994. [3] A. Jahnsen and V. Hansen.'Spectral analysis of multiport microstrip circuit with active and passive lumped elements,"EMC, p. 1053,1990. [4] R.H. Voelker and R. Lomax, "A Finite Difference Transmission Line Matrix method incorporating a nonlinear device model," IEEE Trans. Microwave Theory Tech., vol.38, March 1990,pp.302-312. [5] K. Guillouard, M.-F. Wong, V. Fouad Hanna, J. Citerne, "A new global Finite Element analysis of microwave circuits including lumped elements," Digest of the 1996 IEEE Microwave Theory Tech. Symposium,vol. 1. [6] E. Yasan, J.-G. Yook and L. Katehi, "Modeling the interface between fields and devices using the Finite Element Method," Digest of 1996 Progress In Electromagnetic Research Symposium (PIERS). [7] Jong-Gwan Yook, "Electromagnetic modeling of high-speed high frequency interconnects," PhD Thesis,The University of Michigan 1996.

w V Figure 2: Shielded microstrip circuit with Zo = 50Q, er = 10.5 w= 0.34mm R = 100Q h=0.38mm n. I -2h -4 m T5) c) Tn Ven) - 6 C —;- -;- - - - ~ - ~ ~ ~ ~ - - - - - - - - - - 8 0 -S11(..) Circuit Method #1 S11(-.) Volume Current Method S12(-) Circuit Method #1 S12(*) Volume Current Method 0i -1 -1 10 15 frequency (GHz) 20 25 Figure 3: S-parameters of the shielded microstrip circuit with a resistor

PEC PEC PEC PEC Figure 4: Shielded microstrip circuit with a PEC wire between the conductors S 1 (-.) Circuit Method #2 S11(-) Regular FEM S12(o) Circuit Method #2 S12( —) Regular FEM I i frequency (GHz) Figure 5: S-parameters of the shielded nicrostrip circuit with a PEC wire

"accepted for ('RSI-98" An FEM Based Method on Full Wave Characterization of Distributed Circuits Including Linear and Nonlinear Elements Eray Yasan' and Linda P. Katehi Radiation Laboratory Dept. of Electrical Engineering and Computer Sci. The University of Michigan, Ann Arbor, MI 48109-2122, U.S.A. Tel: 313-764-0502, Fax: 313-747-2106, Email: eray@engin.umich.edu Theoritical characterization of microwave circuits including linear and nonlinear elements has been an interest in many ways using numerical techniques including Finite Element Method (FEM). Since the lenghts of elements connected to the microwave circuit are very short so that Kirchoff's las can be applied, the conventional scheme to find S parameters by adding extra line lengths is no more applicable. In this case the circuit is treated as a combination of inner (internal) ports where the linear or nonlinear elements are connected and the outer (external) ports where the input and output of the circuit are taken. Work has been done in time domain using "Compression Approach"(J. Kunish,M. Rittweger, S. Heinen and I. WolffProc. 21st European Microwave Conf.,1296-1301,1991). The linear passive part of the circuit is analyzed using the FEM. The ports are excited one at a time while the others are left open to derive the appropriate voltages and currents at all ports. The linear part of the circuit is thus characterized by the impedance matrix (Z-matrix). To reduce the computational cost, circuit symmetries are used and assuming circuit lossless, S parameters can be found using the simple relation Following the computation of scattering matrix, a simulation software such as HP EEsof Libra and connecting the linear or nonlinear elements to the inner ports yields the input and output S-parameters of the microwave circuit. As an example a microstrip line with a gap where a lumped element is connected has been analyzed. Simple lumped elements like resistors,capacitors and inductors have been connected and the resulting circuits have been modeled successfully. A diode mixer on a coplanar waveguide (cpw) designed for higher power handling capability lias been analyzed using the above method in connection with Harmonic Balance Method (HBM). Results from these studies will be presented and discussed extensively.

electromagnetic field while the wider spaced via holes permit a significant leakage of power. Note that AS/h= I in both of these field plots. In case where we need to locally improve isolation, the continuous via fence may be replaced by a short via fence as shown in Figure 5 [2]. Figures 6 (a) and (b) show the simulated performance for this line. It is seen that the scattering parameters and radiation loss vary with AS/h in the same way as the continuous via fence. However, upon comparison with Figures 2 and 3, there is 5 dB degradation in the line characteristics. Figure 7 shows that this is due to the perturbation of the fields upon reaching the via fence on either side of the strip. The electromagnetic fields that escape the confinement of the via fences and result in radiation loss cause coupling between adjacent striplines as shown in Figure 8. Figure 9 shows the magnitude of the total electric fields for two striplines separated by a via fence with AS/h=0.615 and Ah/h=0.764 (h=0.7366mm). The stripline on the left has an imposed voltage on it while the stripline on the right is coupled. Separation of the field into its components shows that electric field in the plane of the strips is approximately 10 dB greater than the electric field normal to the strips near the via fence. The large magnitude of this parallel field component is due to the small via size and large value of Ah/h, resulting - 10 dB coupling between the strips. Although perfect conductors were assumed in this study, the current distribution on the strip with a close via fence is expected to cause an increase in conductor loss compared to the conventional stripline case. Conclusions It has been shown that when filled via fences are placed too close to the stripline, radiation loss increases resulting in coupling between adjacent striplines. To minimize the radiation loss, the via fence should be kept at least four times the ground plane separation away from the strip, or AS/h>2. Furthermore, the use of via fences locally only where isolation is required, can actually degrade the stripline characteristics by causing a large perturbation in the electric fields that normally extend on either side of the strip. Acknowledgement The authors would like to thank the University of Michigan Center for Parallel Computing (CPC), which is partially funded by NSF grant CDA-92 -14296 and Ford Motor Company, for the use of computational facilities. References 1. J. W. Gipprich, "EM modeling of via wall structures for high isolation stripline," Digest for Workshop WWFC, Interconnect and Packaging Technologies and Issues, IEEE Int. Microwave Symp., San Diego, CA, pp. 78-114, June 1994. 2. J. Gipprich, C. Wilson, and D. Stevens, "Multilayered 3-D packaging issues," Digest for Workshop WFC, Interconnects and Packaging for RF Wireless Communications Systems, IEEE Int. Microwave Symp., Denver, CO, June 8-13, 1997. 3. J.-G. Yook, N. Dib, and L. Katehi, "Characterization of High Frequency Interconnects using Finite Difference Time Domain and Finite Element Methods, " IEEE Trans. Microwave Theory Tech., pp. 1727-1736, vol. 42, Sep. 1994. Fig. 9 Electric field distribution of two striplines separated by a continuous fence f=1OGH., AS/h=0.45, and Ah/h=0.5).

Ah Metal-filled Via Ground Plane Ground Plane Fig. 5 Stripline with a short section of a filled via fence on both sides. - ISI (A h = 1.3h) -o- IRad. Lossl (A h = 1.3 h) I! r- a 3.-. c -5 --10 --15 --20 --25 --30 --35 -A A - Fig. 7 Electric field distribution of stripline with a short via fence at 25GHz with AS/h=l and Ah/h=1.3. AS -4 1 2 3 z AS/h (a) o_0.,,. -0.2 --0.4 --0.6 --0.8 --1.0 - — * IS,I Ah= 1.3h - PEC, 3 4 AS/h Fig.8 Adjacent striplines separated by via fences. (b) Fig. 6 (a) Return loss and radiation loss, 1-ISl112-IS2112, of stripline with a short via fence as a function of AS/h, Ah/h=1.3, (b) Insertion loss of stripline with a continuous via fence as a function of AS/h, Ah/h=1.3.

,-.. -o 2. aI ~' t. AS/h (a) Fig. 3 Electric field distribution of stripline with a continuous via fence at 25GHz with AS/h=l and Ah/h=5.2. u I. -0.2 -r. -0.4. bS -0.8 - -ISli (Ah=5.2h) -o —IS2l (Ah=1.3h) - A! ~ I 2 2 3 AS/h (b) Fig. 2 (a) Return loss and radiation loss, of stripline with a continuous via fence of Ah/h and AS/h, (b) Insertion loss. 1-IS1112-IS2112, as a function diameter, Dv, is 0.25 mm; the width, W, of the stripline is 0.19 mm to yield a 50 Ohm characteristic impedance; and the thickness, h, is 0.25 mm. The distance between the stripline and the vias is kept greater than 0.25 mm and the via-to-via spacing ranges between 1.3 and 5 times the via diameter. These values are all standard for typical LTCC processes and thus make these results directly applicable to the packages being designed today. The first structure investigated is a single stripline with a via fence on both sides of the strip as shown in Figure 1. This is a typical structure to ensure suppression of the parasitic parallel plate waveguide modes. The transmission lines are first analyzed over the frequency range of 10 to 40 GHz. It is found that the scattering parameters do not vary by more than a few percent over this frequency range as long as 2AS+W<%d/2 where d is the wavelength in Fig. 4 Electric field distribution of stripline with a continuous via fence at 25GHz with AS/h=1 and Ah/h=1.3. the dielectric. This condition is necessary to avoid the excitation of dielectric filled rectangular waveguide modes. Based on this finding, the results are presented as a function of the line geometry with the average of the scattering parameters plotted. Figures 2 (a) and (b) show the magnitude of SIl, S21, and the radiation loss, 1-IS1!12-lS2112, as a function of the transmission line geometry. It is seen that the reflection coefficient, the insertion loss, and the radiation loss all decrease as the separation between the strip and the via holes, AS/h, increases. Reduction of the parasitic effects levels off as AS/h approaches 2. Furthermore, it is seen that the insertion loss decreases as the distance between the via holes decreases. Field distribution for a stripline with large narrow via hole spacing are shown in Figures 3 and 4 respectively. These plots indicate that the closely spaced via fence completely confines the

- - - CEN4/ ELECT ill * Jong-Gwan Yook and Linda P. Ka The University of Michigan * Barry Periman Army Research Lab Jim Harvey Army Research Office * Leo D Dominico Army Research Lab butors: rCheng and Donghoon Chen,iversity of Michigan CEN4/ ELEC,, I I - - —r-s > Goal ) Real Time Design and Optimization of' Electromagnetic Problems with Emphasis on Dimensional MMICs/Antennas -"?n Approach > Develop Computationally Efficient Codes by Use of A uti-Resolution Analysis) -' (Fast Wavelet Transform) for Accelerated us irallelization Schemes for Maximum Speed

I 9 CEN4/ El Relevance to A }BCIS-DS (Battlefield Combat Iden Systems-Dismounted Soldier).Need for Better Design Iu:Ned for Faster Turnaround raLower Cost 7|flt:r i!System Evaluation, CEN4/ ELEC _.__ I Relevance to Army/DoD an LongBow Point-to-Point Communicatio ns-:Need for Conformal Flat Array Design Analog/i gital Circuits Iteration Desins and Verification --- s eIrtneisaVef —i —a-to r:>ngle IXtion Desi*,ns and Yerification.-.i

- | By CEN4/ El m Relevance to Army/DoD )- 21st Century Land Warrior- Digitized Battlefield —...Need for Conformal Small High-Effciency Antennas for i-Density Low Weight Circuit

CEN4/ _ &Di - 900unknowns _ 9,500x9,500 matrix =90,250,000 ad 99.9% sparsity 902,50 non-zero elemh~7W` *Coupling Slot Layer: 6x6 slots 540 unknowns 540x540 matrix= 291,600 entries 85% sparsity: 43,740 non-zero elements *Distrbibution layer: 9,000 unknowns 9,00x9.000 matrix=81,000.000 entries ~ sparsity:810,000 nonzero elements -dIwith Coupling Termo: __j. - I —. *Parailelizaion Strtge _ Parallel Computation of Matrix Entries ______ cmue45,00 nonzero eeet _ Parallel Matrix Solver Frequency Parallellization * opuaionalRequirements ~Required Core Memory > 512 Mbyte ReurdCommunication Latency < 10 pse ReurdBandwnidth > a few hundred Mbytes/sec 9>_iComputers: IBM SP2, Cray T3D, Cray Y-MP, alenge _____j

_ ___ CEN4/' T. I _ ~ *Separate Problem Domain into Subdonm * Replace Green's Function with FEM Solution Areas where Green's Function is Unknown * Match Boundary Conditions Between Domains * Apply Galerkin's Moment Method -:Build Impedance Matrix A ly FWT and Thresholding gg Sparse Matrix Equation CEN4/ Tih-e Match boundary conditions G Use- FE \ Gree Green's function: known Use the Green's function..f —~ —=1~- I: --- --- --- = _.i — _- - = -r-0 I "

- z.4t, zra No i,i; -~ ~ 1~ iij t I 1I i~, ~ i!!! I I 1 i ril i^ i. J JL iiSSl

I CEN4/ Arry Ias Threshold'n 0.0101: & 6 Arvqo. 6hu*WSI6 0.0001. opmal0.2010 ft a 10010 4000 5000 - 6000 - 0 2000 6000 6aco 6000 OM 702000 CEN4/ -Par m l -— D Command path ICPUT #1 Impedance CP 0 PU# IDispatcher -L Sud Conmnands to Workmr ___tive Datafrom Woorker - I~awnoDi in Imnnea= Mautr CPU #NI Worker 1. Rmidve Commands from Dispatdier 2. Par~im Computation ___ 3~Send built to Dispatdaer

II CEN4/ ~r'.i iw.U CPU #1 - Data path Impedance Matrix CPU #2 ' Tratnsformed -- Impedance Matrix ~ Gather results from each CPU > CPU #N Form FWT

CEN4/ App~ruiiah PAr CEN4/ Ti )rb at once r multiplication EI -i *Use frequency donmin -nature of FEM: No data dependency tN ommunication r --- —------ fl f Element Nhtilx Computation iN~trixAssembIy:Ax~bj ~ Z ------------- I: ear Equation Solver: BiCG -wV V - - - - - - - - - - - - - - - - - - -.T ------------------ pjQ ----------------- A

CEN4/ Impedance Matrix Fill for (240 matrix elements,' 120. "',1000 0 e0o S1 Tirae(I P + Com x N + Comp xNI( i - 1), fori >1 P pp i time. Comn: comrmnuncation Uima. Comp conputtion time. nutmber of CPUs. nbert wofmatmrix eemrits. i i I 15 20 snber of CPUs 25 20 I~-: - i ii Fast Wavelet Transform for S (3568 matrix elements, Not a - W. I-,a 'I \ *,,- -^fig ' ^ ~ * * ~Thowy. ^: Ti(i) 0+CommxN+CoCmp x N/i \ 0: overhead. Comm: communication time. \ Comp: computation time.: number ot CPiUs. = - N: number of matrix elements * a 10 12 14 16 18. I=:: Number of CPUs xtrafunction calls for measuring time and

CEN4t m __j

I -1

__ I CEN4/ I Parallel Matrix-Vector ) 1600 1400.pproach is Limited 1200 munkiation Latency. "0:: 600 ~ -— ~ —~:~- L~', _~~~:_.400_: a : 4 '~LI,..,p,...,, p.. f~l~Qt~~~l~r-~~~ I a. I. I I I. i I - I 4 6 8 10 12 14 16 18 20 Number of Processors - --

Iw CEN4/ * Operating Frequency Rangi * Number of unknowns = 150 * Solution time = 2*0 hours / Ah Vad /F /.................... A........................ Substrate Substrate Er

I CEN4I ELEC Milstones adConclusions__ > —Completion of a Frequency Parallelized Code I~ ]I-Modeling of a 2in x 2In Ka-Band Single Layer PowerC_ Network - About 10-5 Unknowns 12196 __ )Completion of a Matrix Element Parallelized. Code 12/97 __ )-Mqdelin of a 3in x 3m' Ka-Band Double Layer 3-D Co-Fired Ceramic'-_~~_~ - S~ruit-About 5X105 Unknowns 12/97 ofa Fully Parallelized Code 12198 CEN4/1I ELECTROMAGNETI-_ )-Teamiing )o-ARL, Univ of Michigan, NASA Lewis]I Air Force. Wright Patterson Laboratory DiDomenico, ARL, (908) 427-2377 lomenicoqftmon.arlxni'l [aPR Katehi, UofLM, (313)647-1796 mt~i~ecsumch~edu

v CEN4/ ELE( I The University of Michigan Jim Harvey y Research Office o DiDomenico Research Lab CEN4/ ELEC GOAL * Design and optimize large scale electr structures with emphasis on three-din MMIC's and antennas. code is capable of efficiently solving for: commercial solver can not process easily. rmance against actual hardware. i

CEN4/ ELCT * Quick analysis time for designers -w development cost. New capability for exploration of exotic structures such as multi-layer fractal antennas for ultra-broad - band EW. i-efficiency antennas for ~d solder systems. I CEN4/ ELEC - I I.. ......... * Use Hybrid method (Finite Element Method for antenna feeding network and Method of Momei - patch array antenna) parallelization schemes and optimize =and speed optimization) for maximum e scale problem

1 CEN4/ ELE STATUS ANDM ' 9 * Completion of frequency and excitation parallelized 2x2 array patch antenna 9/98 - = * Optimization of the speed of code (FEM solver & MOM nuni routine), and about 45% memory saving FEM code 10/98 Mioeng of 3-D double layer 2x2 array patch antenna - about half 3 n ' nmu 11102 parallelization for solving 1.5 million unknowns (An er current value) 6/99 0 qitay.er multi-array patch antenna - about 1-1.5 million re u frequeency circuit - about 1.5 million or more CEN4/ ELECT m I__ _ ___ This effort can greatly contribute to a number under study: - DARPA/HUGHES (1996-1999) ($1 M) ' i "W-Band Power Cube" -L (1998-2003) ($2.5M) n -_on a chip" fire full-wave characterization of fairly complex he use of the parallelized codes developed by this ifirstime theoretical solutions to these

- CEN4/ EL I Frequency parallelization of FEM code (Electric field I X Modeling of 3-D single layer high frequency circuits - r iunknowns T (Fast Wavelet Transform) code pedance matrix calculation code r 6x6 patch array antenna using parallelized ation code. ] m!. w. CEN4/ ELEC CAPABILITY IMPROi 1998 I Frequency and source parallelization of FEM code for 3-1D array patch antenna (Magnetic field based, C++):-:Modeling of 3-D double layer high frequency circuits - about m 1999 solving 1-1.5 million unknowns layer high frequency circuit - about 1.5 million or

CEN4/ ELEC P Prograin name Function Latig uaagc A Aray-FEvv~cpp FEM4 C++ Programing modle/librarav I I I caicuh-mon for array feeding TM-twnrl MPI Standard 1. - ACSII files - -I - I - - Mctxod of moment calculation Fftx Vatch C++ I MPI Standard ACSII files IBM SP2 - I I i. - - I - - - - i. I C++ i MPI Standard ACSH files IBM SP2 Jukichg Cheng, Donhomn Chen uii. f.4 ii ---— I CEN4/ ELECTROMAI.I." -- -- - -- - - Teaming * ARIL, Univ. of Michigan, NASA Lewis Researd Center, Air Force Wright Patterson Laboratory — Leo iDomenico, ARL, (734) 764-4239 _ @engide ~ niich.edu atehi, UofMl, (313)647-1796.edu

CEN -—. - - I..-. - -.......... 1.1- -....... I.. I --- Wl:*- 0.405mm I I k I a. II I 15mm I12.945mm7T 1.1WI 3.282mm::5.455mm 1.168mm

CEN-Project # 4, Parallel Electromagnetic Solvers for High Frequency Antenna and Circuit Design January 1999 Scope This document is an update of the progress of CEN-Project # 4 for Electromagnetic Solvers based on parallel processing and multi-resolution analysis. NOTE: The information in this report supersedes all previous monthly reports. The following paragraphs will identify a software system for accurate and computationally efficient codes for the solution of large scale electromagnetics problems with an emphasis on planar circuits for high frequency applications. The codes will be developed for use in both stand-alone and parallel processing environments. Identification The required software system is an analysis program for accelerated numerical computations in electromagnetics where there exists large computational domains and/or computationally intensive tasks. The output of the software will provide critical information that a designer may use to develop complex RF systems. Typical outputs would include: current distributions, scattering parameters, and field distributions. The overall objective of the system is to use multi-resolution analysis (MRA) in computational electromagnetics to achieve large sparsity in the integral equation (IE) matrix formulation. Multi resolution expansions in integral based formulation leads to the generation highly sparse linear systems whose solution is fast and efficient with minimal memory usage. In addition, the application of wavelet techniques to the method of moments (MOM) will allow the modeling of large-scale electromagnetics problems to be undertaken. It is expected that matrix sparsities on the order of 99.99% or better will be achievable with the MRA method. A large number of the computations associated with the MRA are in the form of nested loops where similar types of calculations are being performed many times. Therefore, parallelization of the code will also be performed in order to improve the speed and efficiency of this electromagnetics solver. From a users point-of-view the output of this development effort will be software that will be able to analyze in real-time complex three-dimensional monolithic circuits (MMIC modules) with hundreds of vias and lines interconnecting a number of active and passive circuit components to a large number of radiating elements.

Module Definitions Open - Space Radiation Problems 1) IMP-SORT This module uses the symmetries or anti-symmetries of the impedance matrix to sort out a list of unique matrix elements. In non-parallized version only. 2) IMPMAT, IMP_MAT.P: This module ( in regular and parallel versions) generates the elements of the impedance matrix for a microstrip patch fed by a slot using the single layer greens function. This program creates a vector of unique non-zero elements. 3) FWT, FWT.P: This module applies the fast wavelet transform to generate the interactions (impedance matrix) for a given module. 4) MAT-SOLVE, MAT-SOLVE.P: This module performs thresholding of the matrix, prepares the matrix for solution and solves it using either Gaussian elimination or BiCG as requested by the user. 5) MoMPOST: This module plots the data calculated by the previous modules in a variety of ways ( near field, far field, S-Parameters, ect.) 6) MoM_LIB: This is a library module that contains supporting functions for the above modules. Circuit Problems 1) FEM_MESH: Generates a tetrahedron mesh using bricks. 2) FEM-MAT, FEM_MAT.P: Generates the FEM matrix for shielded circuit problems. 3) FEM-MAT-ABS, FEM-MAT-ABS.P: Generates FEM matrix for open circuit problems. 4) FEMSOLVEBiCG, FEMSOLVE_BiCG.P: Solves the FEM matrix for lossy circuit problems using a conjugate Orthogonal Conjugate Gradient. 5) FEMSOLVECG, FEMSOLVECG.P: Solves the FEM matrix for lossy circuit problems using a Conjugate Gradient method. 6) IMP-LOAD, IMP-LOADP: Creates the FEM matrix for circuit problems with lumped elements. 7) FEMPOST_FIELD: Plots field values in a variety of ways as requested by user. 8) FEM_POST_SCAT: Plots scattering parameters.

Module Progress Matrix MODULE Start 1" 1st 2nd version 2nd 3d version Comp Date version version complete version complete lete complete debug debug IMP SORT 3/1/96 5/1/96 7/1/96 9/1/96 NA NA Y IMP MAT.P 7/15/96 2/25/96 4/1/96 3/4/97 3/18/97 NA Y FWT.P 5/1/96 6/1/96 8/2/96 3/25/97 4/3/97 NA Y MAT_SOLVE.P 5/15/96 6/25/96 10/20/96 12/17/96 NA NA Y MoMPOST 5/25/96 6/15/96 7/15/96 9/30/96 12/1/96 NA Y MoMLIB 7/15/96 6/25/96 8/20/96 12/1/96 2/1/97 NA Y FEM MESH 5/6/95 12/3/95 3/5/96 5/15/96 NA NA Y FEMMAT.P 3/1/96 4/1/96 5/15/96 11/1/96 NA NA Y FEM MAT ABS.P 4/20/95 12/1/95 4/16/96 9/12/96 3/1/97 NA Y FEM_SOLVECG.P 7/1/95 8/30/95 11/30/95 1/15/96 NA NA Y FEM SOLVE BiCG.P 1/1/96 3/20/96 5/1/96 9/13/96 NA NA Y IMP LOAD.P 4/23/96 7/2/96 12/1/96 1/15/97 NA NA Y FEMPOST FIELD 2/13/96 4/17/96 6/18/96 10/17/96 10/10/97 10/30/97 Y FEMPOSTSCAT 2/15/96 5/1/96 7/1/96 10/25/96 9/30/97 11/20/97 Y IMP SORT SLOT 8/15/96 10/11/96 11/29/96 1/19/97 NA NA Y IMP_MAT_SLOT.P 9/25/96 12/1/96 2/27/97 3/11/97 12/15/98__ Y MoM_FEM_ MAT.P 12/15/96 2/28/97 4/15/97 6/3/97 2/19/98 5/30/98 Y MoM FEMMAT ABS.P T 3/12/97 3/27/97 4/15/97 5/26/97 8/30/98 12/26/98 MoM FEM MAT SOLVE.P 6/3/97 12/27/97 2/25/98 5/30/98 8/18/98 1/4/99 FEM LUMPED 7/20//97 9/30/97_ FEM LUMPED.P 10/15/97 MoM FEMLUMPED.P 12/20/96 MoMFEMHARM.P MoMFEMPOSTFIELD 9/3/97 9/21/97 10/10/97 12/20/97 2/20/98 NA Y MoM FEM POST-SCAT 3/15/97 3/29/97 4/4/97 5/12/97 5/26/97 1/30/98 Y MoMFEMARRAY.P 9/15/97 12/30/97 3/1/98 11/20/98 MoM FEM ARRAY SORT 10/7/97 MoM FEM ARRATMAT.P 10/3/97 3/15/98 6/30/98 11/20/98__ MoMFEM_ARRAY_MAT SOL V.P MoM-FEMARRAY HARM.P MoMFEMARRAYPOST 10/15/97 3/1/98 7/1/98 NA NA NA Y FEM_SOLVEQMR 3/15/97 4/20/97 6/30/97 NA NA NA Y FRM_SOLVECGS 3/10/97 4/10/97 6/15/97 NA NA NA Y

Summary of Progress 1) MOM PART - New integration scheme/routine is applied for better convergence. As a result, calculation time for numerical integration becomes shorter than before. - The scheme of generating only non-zero unique element of MoM matrix is applied and the speed-up is achieved more than two times. - New parallelization of MoM matrix generation routine is implemented. The new version includes new MPI commands to improve calculation time and optimize parallelized MoM codes. As shown in the following diagram, the non-zero unique MoM matrix elements are assigned to and generated from each processor, and gathered into root processor. CPU #1 cPu I assign CPU #O assign^ _ CPU #2 N. gather CPU #N 2) FEM PART - When the feeding network of array structure is solved by FEM, the magnetic sources are placed in coupling slots and impedance matrix is calculated for each magnetic source. In this aspect, parallelization of solving FEM problem for each magnetic source is necessary and it is implemented using task parallelization scheme. CPU #O (Source #0) CPU #0 (Source #1) (Soue #2) (. CPUe #N (Source #2) (Source #N)

As FEM matrix elements have symmetry property, we can save about 40% of memory by storing diagonal and upper triangle element of FEM matrix. With this FEM matrix storage scheme, we could solve about half million unknowns. FEM Matrix Saved Not Saved L L ------------------- - In order to solve more than half million unknowns, we needed to parallelize FEM routine further more in addition to the above scheme. We found out two major parts of work should be completed: FEM matrix must be decomposed to processors and communication setup between processors when matrix to vector or vector to vector multiplication/addition is occurred in iterative FEM solver. After implementing this parallelization scheme, we could solve 1 million unknows. We are making more effort to solve more than 1 million unknowns and 1.5 million unknowns will be tested very soon. Following is the FEM matrix decomposition scheme applied in this project. FEM Matrix Stored -| CPU #O CPU #N..............................................E FJ.......... EFJ

3) Application of progress on Modules - IMP_MATSLOT.P A new scheme has been applied in evaluating integrals involving singularities when MoM an impedance matrix is generated. The scheme mostly involves a new pole extraction method for the Green's functions which have a singularity over the defined integration range. This implementation is important in for the reduction of computation time, since the new scheme improves convergence when integration is numerically calculated. Toplize property of MoM matrix element can be used for reducing MoM matrix computation time, because all of the matrix element can be known with non-zero unique matrix elements. The new version of code determines non-zero unique matrix element and computes them. The achieved speed-up with this scheme is more than two times, which is quite significant. As a result of the above two schemes the MoM part is much improved in terms of computation time. - MoMFEMMAT.P This was an effort for saving computer memory and produced an improvement in the software's efficiency. An advantage in memory required was achieved by utilizing the fact that the FEM matrix has a symmetry property, the new program stores half of the actual FEM matrix. As a result, the size of memory needed to store FEM matrix is reduced by half. This is a sizable advantage in memory when the size of problem becomes large. We could generate FEM matrix for half million meshes with this scheme. Further effort was made recently for generating FEM matrix for 1 million meshes. It involves FEM matrix decomposition to processors: root processor divides FEM matrix domain and assigns matrix domains to processors, and each processor generates assigned part of FEM matrix. The scheme is successfully implemented and we can generate FEM matrix for 1 million meshes. 1.5 million mesh case is under test now. - MoM-FEM-MAT_ABS.P This module is an extension of MoM_FEM_MAT.P for open FEM problem. We can adapt same memory saving scheme as before and achieve same improvement as was the case of MoMFEM_MAT.P. FEM matrix for half million meshes can be generated with this scheme. Similarly, FEM matrix decomposition scheme is applied to open FEM problem and the new code can generate FEM matrix for 1 million meshes. - MoM_FEM-MAT_SOLVE.P When the FEM part of a computation is being simulated, most of the solution time is consumed in solving for the FEM matrix elements. It is very important to speed up FEM iterative solver to reduce the resulting CPU time to solve a problem. Most of the operations in the FEM solver are by nature of the physics and computation algorithm of the form of multiplication's and additions of vectors with vectors and vectors with matrices. Therefore, a smart management strategy of the memory for matrix-vector multiplication and the iterative loops was realized using an

assortment of techniques that are utilized under the optimal condition for a particular algorithm. Currently these improvements have allowed computations at twice the previous rate. When 1 million unknowns are to be solved, the decomposed FEM matrixs are generated from each processor. These matrixs in each processor must be communicated with other processors when multiplication/addition of vector with vector or vector with matrix is occurred in iterative FEM solver. This communication time takes most of FEM computation time when the problem size becomes large. Therefore, careful communication schemes should be applied to reduce overall FEM computation time for large FEM problem. Using most effective communication schemes between processors, the FEM iterative solver is optimized for least communication time. As a result, 1 million unknowns can be solved with this least communication time scheme. - MoM-FEMPOSTFIELD When the field distribution of FEM part is plotted, we need to control the plot, for example, by zooming to a certain area, flipping up/down or left/right, and smoothing. The new program is quipped with all these functions using MATLAB's internal functions. In addition, the program can be customized by the user. - MoM-FEM ARRAY.P The FEM solver for the patch antenna problem has been optimized. Most of the optimization involved the efficient management of iteration loops and arithmetic for vector/matrix operations. As a result of this optimization, the speed of the FEM solver has been improved by an additional two times. When the array antenna problem is simulated by hybrid computational techniques, a number of sources are placed in the slot for FEM simulation. The feeding network is simulated for each source using FEM. Therefore, in addition to frequency parallelization, source parallelization is also necessary. To this end, the new program is source parallelized. As a result, the program may be both frequency and source parallelized at the same time. FEM matrix decomposition scheme and effective communication setup between processors in iterative solver are applied to array antenna problem to solve large size of array feeding network. Currently, 1 million unknowns in array feeding network section can be solved using new FEM routine. - MoM-FEMARRAY-MAT.P The memory saving scheme of the FEM matrix is adopted in the new program for the ARRAY antenna problem. Using symmetry properties of the FEM matrix, we store only half of the FEM matrix when the feeding network of array antenna is solved by FEM. Considering the large size of typical problems in this case, we can save significant amounts of memory with this new technique. FEM matrix of 1 million meshes of array feeding network can be generated using FEM matrix decomposition scheme. 1.5 million case is under testing now.

- MoM-FEM-ARRAY-POST The post-processing file for field distribution of feeding network in array antenna system is customized for easy and convenient use of third parties. The field distribution of feeding networks can be rotated, flipped, smoothed, and magnified. - FEM-LUMPED In this program, we have implemented a combination of full-wave electromagnetic simulator (FEM) and active circuit elements. This is an extension of the previous work on the inclusion of passive lumped element into FEM tool. As an example, a single diode mixer with a strip-line configuration has been designed and solved using the program. This enhancement of capability of the code allows the user the ability to simulate any type of two port active circuits in fullwave electromagnetic solver. We are currently working on the generalization of this code to simulate multi-port active circuits.

Requirements Traceability Matrix (Module vs. Requirement) MODULE Analyzer Partitioner Simulatio User Parallelizing n Interface IMP SORT X X IMP MAT.P X X X FWT.P X X X MAT SOLVE.P X X MoM_POST ___X X MoMLIB ___ FEMMESH X X FEM_MAT.P X X X FEM_MATABS.P X X X FEM_SOLVECG.P X X X FEMSOLVEBiCG.P X X X IMPLOAD.P X X X FEM POST FIELD X X FEM POST SCAT X X IMPSORTSLOT _ X IMPMATSLOT.P X X X MoMFEM MAT.P X X X MoMFEMMATABS.P X X X MoM FEM MAT SOLVE.P X X X MoM FEM LUMPED.P X X X MoMFEM HARM.P X X X MoMFEMPOSTFIELD X X MoMFEMPOSTSCAT X X MoM FEM ARRAY.P X X X MoM_FEM_ARRAY_SORT X MoMFEMARRAT_MAT.P X X X MoMFEMARRAYMAT_SOL X X X E.P._____ _ MoM FEM ARRAY HARM.P X X X MoM FEM ARRAY POST X

Fault Profiling Fault profiles will be provided for each module after the module design has officially stabilized. For a module to be considered stabilized the schedule of events table must show a module as complete. Any problems with the module after the completion of the module will therefore be considered a fault. Additional information may be provided on the resolution of significant coding errors if this information is considered significant by the primary researcher, however, these errors will not be considered module faults. In summary, "errors" are associated with modules before the module completion and "faults" are associated with modules after the module completion. No faults or errors to report for this period.