The University of Michigan College of Engineering Technical Report Operator Approximation and Its Application to Root Finding and Minimization Problems by G. Scott Dixon, Jr. o t:. D. Pr oj ct D irector.Elm'-er -'C. Gilbert ORA'Proj ect 0 645 07 Supported by United States Air Force, Air Force Office of Scientific Research Grant No. AFOSR-69-1767 Arlington, Virginia June 1973

ACKNOWLEDGMENTS The work represented by this dissertation has been made possible by the cooperation and guidance of a number of individuals. The author wishes to express special appreciation to Professor Elmer Gilbert who, as Chairman of the Doctoral Committee, provided expert advice and critcal review throughout the period of research. The author is also greatful to the other members of the Doctoral Committee for their interest in the work and for their helpful comments. The computer programs used in the numerical experiments were written by Alan Dohner. His expert help is appreciated. The manuscript itself was prepared by Chris Angell. It would be hard to imagine a more cooperative and efficient typist. This research was supported by an NDEA Title IV Fellowship from the University of Michigan, by the United States Air Force under Grant AFOSR-69-1767 and by the Reliance Electric Company. Recognition is due to Dr. Edward Gilbert of the Reliance Electric Company for his patience and cooperation. Finally, a special measure of gratitude is due to the author's wife, who not only offered encouragement and understanding, but also aided in the preparation of the manuscript. ii

TABLE OF CONTENTS Page ACKNOWLEDGMENTS.........................................ii TABLE OF ILLUSTRATIONS....................... CHAPTER 1 INTRODUCTION..................................... 1.1 Background and Motivation 1.2 Organization CHAPTER 2 EMULATION AND APPROXIMATION...................... 5 2.1 Notation and Definitions 2.2 Approximations 2.3 Emulations 2.4 Sequential Characterization of Approximation and Emulation 2. Summary CHAPTER 3 ROOT FINDING AND MINIMIZATION ALGORITHMS......... 34 3.1 Introduction and Notation 3.2 Newton's Method 3.3 Secant Methods 3.4 Minimization Methods 3.5 Recursive Formulas for Approximating Sequences 3.6 Summary CHAPTER 4 FUNCTION MINIMIZATION USING THE PSEUDOINVERSE.... 66 4.1 Introduction, Definitions and Preliminary Lemmas 4.2 Fundamental Definitions and Lemmas 4.3 General Minimization Algorithm 4.4 Convergence and Rate of Convergence 4.5 Conclusion CHAPTER 5 NUMERICAL EXPERIMENTS............................ 99 5.1 Introduction 5.2 Davidon's Algorithm 5.3 The New Algorithms 5.4 Rosenbrock's Banana Valley 5.5 Wood's Function 5.6 Powell's Function 5.7 A Six Dimensional Function 5.8 Summary CHAPTER 6 CONCLUDING REMARKS................................ 143 APPENDIX I DERIVATIVES OF TRACE FUNCTIONS.................. 146 iii

TABLE OF CONTENTS (continued) Page APPENDIX II FORTRAN PROGRAM DAVIDON'S ALGORITHM........... 150 APPENDIX III FORTRAN PROGRAM FOR ALGORITHM 5.3.1........... 157 BIBLIOGRAPHY.......................*................. 165 iv

TABLE OF ILLUSTRATIONS Page FIGURE 5.1 Abnormal conditions in linear search subalgorithm................................ 102 FIGURE 5.2 Log10 lf(x) vs. iteration number for Davidon's algorithm applied to Rosenbrock's function.............................. 109 FIGURE 5.3 Log10 f(x)I vs. iteration number for algorithm 5.3.2 applied to Rosenbrock's function with a = = 10-4.................. 110 FIGURE 5.4 Log10jf(x)j vs. iteration number for algorithm 5.3.1 applied to Rosenbrock's function starting from (-1,-1)...................... 112 FIGURE 5.5 Log10lf(x) vs. iteration number for algorithm 5.3.1 applied to Rosenbrock's function starting from (1,-1)..................... 113 FIGURE 5.6 Age of the oldest column of Uk and Vk vs. iteration number for algorithm 5.3.1 applied to Rosenbrock's function starting from 1 FIGURE 5.7 Age of the oldest column of Uk and Vk vs. iteration number for algorithm 5.3.1 applied to Rosenbrock's function starting from (1,-1)....................................... 115 FIGURE 5.8 Loglolf(x) vs. iteration number for Davidon's algorithm and algorithm 5.3.1 applied to Wood's function starting from (-3,0,-3,-1)................................ 118 FIGURE 5.9 Log10 f(x)j vs. iteration number for algorithm 5.3.1 applied to Wood's function starting at (-3,-l,-3,-1) with a = 10-1............... 119 FIGURE 5.10 Logl0jf(x)j vs. iteration number for algorithm 5.3.1 applied to Wood's function starting from (-3,-1,-3,-1) with a = 10-2............ 120 FIGURE 5.11 Logl0jf(x)l vs. iteration number for algorithm 5.3.1 applied to Wood's function starting from (-1,-3,-1,-3) with a = 10-......... 121 FIGURE 5.12 Age of the oldest column of Uk and Vk and rank of Uk vs. iteration number for algorithm 5.3.1 applied to Wood's function starting at (-1,-3,-1,-3) with a 10-1...... 122 v

TABLE OF ILLUSTRATIONS (continued) Page FIGURE 5.13 Age of the oldest column of Uk and Vk and rank of Uk vs. iteration number for algorithm 5.3.1 applied to Wood's function starting at (-1,-3,-1,-3) with a = 10-2....... 123 FIGURE 5.14 Age of the oldest column of Uk and Vk and rank of Uk vs. iteration number for algorithm 5.3.1 applied to Wood's function starting at (-1,-3,-1,-3) with a = 10-4....... 124 FIGURE 5.15 Log10 f(x) vs. function (gradient) evaluations for Davidon's algorithm and algorithm 5.3.1 applied to Wood's function starting at (-3,0,-3,-1)................... 126 FIGURE 5.16 Log10lf(x)I vs. function (gradient) evaluations for Davidon's algorithm and algorithm 5.3.1 applied to Wood's function starting at (-3,-1,-3,-1)..................... 127 FIGURE 5.17 Log1o f(x) vs. iteration number for Davidon's algorithm applied to Powell's function........ 129 FIGURE 5.18 LogllIf(x) vs. iteration number for algorithm 5.3.1 applied to Powell's function with a = 3 = 10-4.................................. 130 FIGURE 5.19 Age of the oldest column of Uk and Vk and rank of Uk vs. iteration number for algorithm 5.3.1 applied to Powell's function with a = @ = 10-4............................ 131 FIGURE 5.20 Log10lf(x)l vs. iteration number for algorithm 5.3.1 applied to Powell's function with a = 3 = 10-8............... 133 FIGURE 5.21 Age of the oldest column of Uk and Vk and rank of Uk vs. iteration number for algorithm 5.3.1 applied to Powell's function with a = 3 = 10-8....................... 134 FIGURE 5.22 Log10 If(x) I vs. iteration number for algorithm 5.3.1 applied to Powell's function with a = 3 = 10-8 and with the age of the oldest column of Uk and Vk bounded less than 2n = 8.................................. 135 vi

TABLE OF ILLUSTRATIONS (continued) Page FIGURE 5.23 Age of the oldest column of Uk and Vk and rank of Uk vs. iteration number for algorithm 5.3.1 applied to Powell's function with a = = 10-8 and with the age of the oldest column of Uk and Vk bounded less than 2n = 8..............................136 FIGURE 5.24 Log0o f(x){ vs. iteration number for Davidon's algorithm and algorithm 5.3.1 applied to the six dimensional function with a = 6 = 10-4 starting at (1,-1,-3,-1,-3,-1)...............138 FIGURE 5.25 Log 0 f(x)l vs. iteration number for Davidon's algorithm and algorithm 5.3.1 applied to the six dimensional function with a = 8 = 10-4 starting at (0,0,0,0,0,0)...........139 11(:URE 5.26 Loglof (x) vs. iteration number for Davidon's algorithm and algorithm 5.3.1 applied to the six dimensional function with a = = 10-4 starting at (.9,.9,9,.9,..9) 1.......... 10 vii

CHAPTER 1 INTRODUCTION 1.1 Background and Motivation The research presented in this dissertation is concerned with two closely related problems which are encountered in computational mathematics. They are the determination of the unconstrained minimum of a function f(x): DCRn - R1 and the determination of a root of a function F(x): DcRn + Rn; that is, an x e D such that F(x) = 0. As is usual Rn denotes the space of n-tuples of real numbers. Although both of these problems have been extensively investigated over the years they deserve further research for at least two reasons. First, because many practical computations reduce to these problems, even a modest improvement in solution techniques can result in a large reduction in computational cost. Second, there exist similarities in certain current algorithms which suggests that a unified theory can be developed for these algorithms. Such a theory, besides clarifying existing algorithms, can be expected to yield insight into the development of new algorithms. The functions treated will be assumed to have a Taylor series expansion. For instance, in the root finding problem F(x) = F(x ) + J(xo)(x - xO) + h.o.t. (1.1.1) where J(xO) is the derivative of F at x0 and h.o.t. represents the higher order terms. A similar expansion for the gradient of f(x) pertains in the minimization problem. In all the algorithms to be considered it is implicitly assumed that, when the higher order terms are neglected, the Taylor series expansion gives rise to an estimate of the root of F(x) given by 1

2 x = x - J (X)F(xo). (1.1.2) Rather than directly evaluating and inverting the matrix J(xO) as in the Newton method these algorithms estimate J (xO) by determining a linear transformation which, under appropriate conditions, transforms a set of vectors u. = F(x.) - F(Yi) into the set of vectors v. = x. - y, i = 1,...,m, with m < n. From equation 1.1.1 we see that if the higher order terms are negligible, if m = n, and if J(x ) is -1 nonsingular then this linear transformation is J (x ). Notice that the existence and uniqueness of such a transformation is not guaranteed. This leads to convergence difficulties in many algorithms. In certain minimization algorithms it is necessary for the objective function to be quadratic, that is for the higher order terms to be identically zero, in order for the algorithm to determine such a linear transformation. The set of vectors u. and vi, i = 1,...,m, can be viewed as forming a data set containing information about the local character of the function F(x). The algorithms which have been developed to date consider data sets which are formed either by refreshing the entire data set at each stage or by generating one data pair (ui,vi) at each stage and using data generated in the previous m-l stages of the algorithm. Those algorithms which refresh the data set at each stage, such as certain secant methods for root finding, demonstrate rilce tlleoret.i calL terminal rate of convergence properties (ORT 70, section 11.3) butl niy not converge globally since at some stage a matrix may not exist which transforms the vectors ui into the vectors vi. Also these algorithms require a large amount of computation at each stage since the function must be evaluated m = n times. The sequential methods which generate only a single pair ui and

3 v; at each stage require far less computation. However, in many cases these metthodsl: cannot be shown to converge either globally or locally (ORT 70, theorem 11.3.5). Davidon's minimization algorithm is one sequential algorithm which can be shown to converge superlinearly if the objective function is twice differentiable and if the eigenvalues of H(x), the second derivative function or Hessian of f(x), are greater than some positive constant (POW 71). However, Davidon's algorithm computes a matrix which transforms the vectors u. into the vectors v. only when the objective function is quadratic. Several other sequential minimization algorithms have been proposed (MUR 70, FLE 70, PEA 69, BRO 70, 70a) which, when the objective function is quadratic, compute a matrix which transforms the ui into the vi. No extensive theoretical results have been presented for these algorithms when the objective function is not quadratic. Further, many of them exhibit a tendency to have convergence difficulties in practical problems. The fundamental objectives of this work are: 1) the theoretical unification of certain algorithms which utilize sequences of-data sets.to estimate the linear transformation J (x0) or H (x0) and 2) the investigation of new algorithms for generating the sequence of data sets. Some of the material concerning arbitrary data sets may be of value in other problems which involve the estimation of a linear transformation. 1.2 Organization In Chapter 2 data sets whose elements are pairs of vectors (ui,vi) are considered. These data sets are assumed to approximately characterize an unknown linear transformation. Although the data sets may arise in the root finding or minimization problem this is not necessary to the development. For a particular data set the family of approximations. to

4 this unknown linear transformation is characterized in a useful way. When two data sets have common elements, results are obtained which characterize the family of approximations based on one data set in terms of the family of approximations based on the other data set. This leads to a result which allows the Davidon algorithm and other minimization algorithms to be viewed in a general framework. In Chapter 3 the general results of Chapter 2 are applied to existing algorithms for root finding and minimization. In particular, the relationship between the secant algorithms for root finding and the material of Chapter 2 is considered. Also, using the results of Chapter 2 the properties of a class of minimization algorithms on a quadratic surface are investigated. This class contains new algorithms in addition to many existing algorithms. In Chapter 4 an entirely new class of minimization algorithms is proposed. This class is constructed to have desirable convergence and rate of convergence properties while allowing considerable latitude in the choice of a specific algorithm within the general class. Theoretical results are presented concerning the behavior of these algorithms on nonquadratic functions. In order to verify the theoretical results of Chapter 4 and to further establish the properties of the new class of algorithms, two particular algorithms were tested on a variety of objective functions. A version of Davidon's algorithm is used as a standard of comparison. The results of these computer runs are presented in Chapter 5. Chapter 6 summarizes the significant results of the dissertation and relates the work to results obtained by other researchers.

CHAPTER 2 EMULATION AND APPROXIMATION 2.1 Notation and Definitions In this chapter a general structure is presented which unifies many recent algorithms for function minimization and root finding. Although the unification of these algorithms has been the primary motivation for this work some of the material may be of value in the solution of other problems. We will be concerned with a sequence of sets {Si}, i e Ip, where the index set I is some subset of I, the non-negative integers. The elements of the sets Si are ordered, and consist of data obtained from some process such as the evaluation of a function or the gradient of a function. More specifically each set S. contains a finite number k(i) > 0 of elements from the product space Rm x Rn, n,m > 0. Each set Si will be viewed as defining a relation on some elements of Rm and Rn Rn. That is u e Rm is related to v e Rn under the relation Si if and only if (u,v) e Si. When not empty the set Si will also be viewed as a pair of matrices (Ui,Vi) whose j th columns u. 6 Rm and vj e Rn respectively make up the j th element (uj,vj) of Si. Clearly, if k(i) > 0, Ui is m x k(i) and Vi is n x k(i). In most cases the ordering of the set Si is not important. If the ordering of Si is significant it will be specified. If SX is any subset of a set S. then the elements of Si will retain the relative ordering assigned in Si. This subset will also be viewed as a relation on R x Rm or as a x x pair of matrices (Ui,Vi) formed as just described for Si. If Si is the empty set then the relation on Rn x Rm as well as x x the matrices (Ui,Vi) will be undefined. 5

6 The linear space of all linear transformations from Rm into Rn will be denoted by T. Uk and V will denote the linear spaces of all linear transformations from R into R and from R into R respectively. Of course, if elements of Rm, R and R are viewed as column vectors, elements of T, Uk and Vk have as their concrete representation matrices of the appropriate size and conversely any matrix of the appropriate size represents a linear transformation in one of these spaces. In some instances a matrix will be interpreted as an ordered set of elements in some vector space where the i th column of the matrix is the i th element of the set. By the notation R(A) we mean the range of the linear transformation A. By the notation R (A) we mean the orthogonal complement n of R(A); that is, the set of x e Rn such that <x,y> = Z x'y = 0 i=l for any y e R(A) where x and y are the i th components of x and y respectively. The notation N(A) will denote the null space of A and the notation N (A) will denote the orthogonal complement of N(A). By A' or x' we will mean the transpose of the matrix A or the vector x. In the discussion of Chapter 1, section 1.1, it was seen that if the data set S. = (Ui,Vi) was derived from a root finding or minimization problem then a linear transformation that takes the columns of U. into 1 the columns of V. is an estimate of the inverse Hessian or Jacobian matrix. The following definitions formalize this notion in a slightly more general context. It is assumed that {Si}, i e Ip, is a sequence of data sets with k(i) elements in Rm x Rn, that Vk(i) is equipped with some norm denoted by II Ik(i) and that A is some subset of T. Definition 2.1.1 A. e A approximates the data set or relation Si = (UiVi) over A iff

7 IIAiUi - Vii k(i) min AU () (2. AeA If S. is empty then any A. e A approximates Si over A. Definition 2.1.2 A. e A emulates the data set or relation Si = (Ui,Vi) over A iff AiUi - iJ k(i) = 0 (2.1.2) If Si is empty then any Ai e A emulates Si over A. Definition 2.1.3 The sequence {Ai} e A approximates (emulates) the sequence of data sets {Si} over A iff A. approximates (emulates) S. over A for all i e Ip p Definitions 2.1.1 and 2.1.2 can be given a geometric interpretation. The product space Rm x Rn is itself a linear space of dimension m + n and a linear transformation A. e T defines a subspace of Rm x Rn with dimension m whose elements are all the pairs (u,v) e Rm x Rn where v e Rn is the image under A. of u e Rm and where u is arbitrary. If ul,...,um is any basis for Rm then the pairs (ul,Aiu1),...,(u,A.u ) are a basis for this subspace. A transformation A. approximates S. over A if the "vertical distance" between i 1 the elements of S. and the subspace defined by A. is minimized over all 1 1 elements of A. If all elements of S. lie in the subspace defined by A. then A. emulates Si. By vertical distance we mean the set distance induced by the norm II k(i) which, of course, weights all elements of Si. If the matrix norm [ Ilk(i is invariant with respect to the ordering of the columns of the matrix then the ordering of the elements of Si is of no consequence. However, if the norm is dependent on column ordering then this vertical distance, and the family of approximation to Si over it will vary with the ordering of Si. In all cases treated herein the norm JV |k(i) is invarient with respect to column

ordering. It may be that the data sets S. are better approximated by an affine transformation than by a linear transformation. That is, there may exist a pair (A.,b.) e T X R such that IIA.u. + b.e - VJ <(i) IIAU. VIJ (2.1.3) i i - ilk(i) - i ik(i) for all A e T where e is the row vector (1,...,1). However, this approximation is only superficially different from the approximation being considered since A.U. + b.e = [A.,b.] U1. (2.1.4) 11 i 1 i l i Here and in the rest of the text the square brackets denote a partitioning of a matrix. Thus the problem of determining an affine transformation such that Rm - Rn can be viewed as the problem of determining a linear transformation such that Rm+l Rn where the matrix Ui is augmented by the row vector e and A. is augmented by the column vector bi. In the particular case of the minimization and root finding problems if the elements of Si are obtained by differencing then it is known, a priori, that -s is an element of Si whenever s is an element of Si. In this situation a linear transformation is the best approximation to Si over T. The existence of an Ai which approximates Si over A is not guaranteed by the definition. However, the sets A considered herein will always contain an approximating A.. Since the A. which approximates S. over A may not be unique, reference will be made to the set of A. which approximate S. over A. An Ai which emulates Si over A may 1~~~~ 1 1i

9 not exist. If one does exist it may not be unique, in which case reference will be made to the set of Ai which emulate Si over A. Clearly, since an A. which approximates (emulates) S. over A may not be unique then neither is a sequence {Ai} which approximates (emulates) the sequence {Si} over A necessarily unique. Notice that if the norm I ) is chosen to be independent of column k(i) ordering then the set of Ai which approximate Si over T is invarient over all orderings of the set S.. The set of A. which emulate S. over T is invariant over all orderings of Si for any norm II Ilk(i) Extensive use will be made of generalized inverses of a matrix. At this point algebraic definitions of a generalized inverse and the pseudoinverse are given. Definition 2.1.4 Let M be an arbitrary n x m matrix. A generalized inverse (g-inverse) of M (denoted by M-) is any m x n matrix such that MM M = M. Definition 2.1.5 Let M be an arbitrary n x m matrix. The pseudoinverse (p-inverse) of M (denoted by M+) is the m x n matrix such that MM+M = M, M+MM+ = M, (M+M)' = M+M and (MM+)' = MM+ Some elementary properties of g-inverses are worthy of note. 1. If M- is any g-inverse of M then for all x 6 R(M), y = M-x satisfies My = x. 2. If M- is any g-inverse of M the operator MM- projects vectors onto R(M) and the operator (I-MM-) projects vectors onto RI(M). Therefore MM- is the identity operator when restricted to R(M) and (I-MM) is the identity operator when restricted to R (M).

..6 W 3. If M- is a g-inverse of M such that (M-M) M M then M-M is the identity operator when restricted to N (M). 4. If M- is a g-inverse of M such that (MM) = MM then MM projects vectors orthogonally onto R(M) and (I-MM-) projects vectors orthogonally onto R (M). 5. If M- is a g-inverse of M such that M-MM- = M- then M-x = 0 for all x e R (M). 6. The p-inverse M+ of M is unique, y = Mx minimizes (x - My) (x - My) and among all vectors which minimize f + (x - My) (x - My), y = M x is the unique vector which minimizes y Y. 7. The following identities hold for the p-inverse: M'MM+ = M M+ M + M+ M++ = M M M+ M+ =M+ (M M)+ = M+M+t A more complete treatment of generalized inverses is given in (PEN 55), (PEN 54), (RAO 71), (DES 63) and (ALB 72). 2.2 Approximations All of the concrete results which follow. involve a particular norm on Vk(i) and on T which will be called a P-trace norm. If M is an arbitrary element of Vk(i) or T then the P-trace norm of M is

11 given by jMlp1 = (trace (MPM))1/2 (2.2.1) where P is square, positive definite and symmetric. It is easily verified that this norm is independent of column ordering and therefore that the results are invariant over all orderings of the elements of Si. The theorems are stated only for a particular element of the sequence {Si}. The results can easily be restated for the entire sequence since by definition a result holds for the sequence {Si} iff it holds for each element of the sequence. Theorem 2.2.1 If the norm on Vk(i) is a P-trace norm and the data set Si = (Ui,Vi) is not empty then the set of approximations to Si over T is the set of Ai which satisfy the matrix equation A.U.U = V.U!. (2.2.2) - 1 1 1 1 Proof: Under this norm IAU - V. l = (trace (AU. - V.)' P(AU. - V.))/2 (2.2.3) Since this function is non-negative, its minima are unchanged if we deal with IIAUi - Vij P. Clearly IIAUi - Vjil2 is quadratic in the elements of Ai and differentiable with respect to the elements of Ai. Therefore A. e T is an approximation to S. over T iff IidA i i 11P dA trace (AUi - Vi)' P(AU - V) i = Ai Ai (2.2.4)

12 where d. f(A) denotes the matrix whose ij th component is the partial dA derivative of f with respect to the ij th element of the matrix A. From the rules for differentiation of the trace function in Appendix I we have d ( II AU Vi ) = 2PAUiU' - 2PV U. (2.2.5) dA P i ii Since P is nonsingular Ai e T approximates Si over T iff I t A.U.U. - V.U. = 0. (2.2.6) i 1 1 1 1 Q.E.D. Although it seems most reasonable to seek an approximation to S. over the whole space T, the structure of the problem may lead one to restrict the class of admissible transformations. In the minimization problem the Hessian matrix is symmetric everywhere and is positive definite and symmetric at any isolated minimum. The following theorem deals with the approximation problem on the subspace T of symmetric matrices and on the subspace IT of skew-symmetric matrices. Theorem 2.2.2 If the norm on Vk(i) is a P-trace norm and the data set Si = (Ui,Vi) is not empty then the set of approximations to Si over s is the set of symmetric A. which satisfy the matrix equation ~s i~~~1 PA.U.U! + U.U'A.P = PV.U + U.V'P. (2.2.7) 1 11 1 iii 1 1 1 ii 1 and the set of approximations to S. over T is the set of skew1 Ss symmetric Ai which satisfy the matrix equation PAiUiUi U U'A.P =PViUi - UVP (2.2.8) Proof: Again we may deal with the function IIAUi - Vil. without affecting the location of any minima. Since IJAU. - V. 2 is quadratic in the components of A and differentiable with respect to the elements of A, and since TS and TIo are subspaces of T at any

13 minimum A of AU - V restricted to T (or T) it must be i i i Pi s ss that trace ( d AU 2 ) A 0 (2.2.9) for differentiation of trace functions equation 2.2.9 is equivalent to requiring that trace (A U U - VU)'P A = 0 (2.2.10) iii i A for all A e Ts (or ). This follows since at minimum the gradient s ss must be orthogonal to the subspace Ts (or Tss Because the trace is a linear operation, equation 2.2.10 becomes requiring that trace (AiUiUi V )P A (2.2.10) i i1 i ki for all A 6 T (or Is). Let {A.} be a basis for (or T) trace (A.U.U -V.u )'P A =0 (2.2.11) for all Akj e {Akj}. Let {A,.} be the set of matrices whose kj th element equals one and all other elements equal zero. Then a basis for T is {Akj + Ajk} and a basis for TS is {Akj - Ajk} Using the basis for T we have that at a minimum A. of IAU. - Vilj S i Iz P trace (A.U.U: - V.U) P (A + A) = 0 (2.2.12) I1II 1ii iJ Aki jk or trace (A.UUi - ViUi) P A + trace (A.U.U - ViUi P A 111 11 k 11 1I II J O (2.2.13) This is equivalent to requiring that P(A.UiU. - V.U.) be skew-symmetric or that P(A.U.U. - ViUi) + (AiUiUi - V.Ui)P = 0. (2.2.14) A similar argument for Tss completes the proof. Q.E.D. It is not easy to solve equations 2.2.7 or 2.2.8 either explicitly or computationally. In minimization algorithms perhaps an even more

14 useful set over which to seek an approximation is the set of positive definite or positive definite and symmetric matrices. If the matrices are constrained to be positive definite then it is guaranteed that the search direction will be downhill. A result parallel to theorems 2.2.1 and 2.2.2 has not been obtained for these classes of transformations. The solution of the approximation problem on other subsets of T is a topic which is worthy of further research. In this dissertation we pursue in depth only the approximations to S. over T. In order to explicitly characterize the family of approximations to Si over T it is convenient to apply a result from the theory of g-inverses. Theorem 2.2.3 The matrix equation BXC = D has a solution for X iff BB-DC C = D (2.2.15) for any g-inverses B- and C- of B and C respectively. If a solution exists then the general solution is given by X = B DC + Y - B BYCC (2.2.16) where Y is arbitrary and B- and C are any g-inverses. A proof of this result may be found in (RAO 72). The following theorem is a consequence of theorem 2.2.3. Theorem 2.2.4 If the norm on Vk(i) is a P-trace norm and if the data set Si = (Ui,Vi) is not empty then the set of approximations to Si over T is a non-empty linear manifold in T of dimension n x (m - r) where r is the rank of Ui. This manifold is given by the formula Ai = ViU + Yi(I - UiUi) (2.2.17) where Y. is n x m and arbitrary. Further 1 il < + (Ii - U Ui)I p (2.2.18) with equality only when (1 - ii) y Pi-trace norm or. with equality only when Yi(I - UiUi) 0 for any Pl-trace norm or T.

15 Proof: Since U+ is uniquely defined for any matrix, equation 2.2.17 yields a set of A. which is not empty. The matrix (I- U.U.) operating on row vectors in Rm projects elements of Rm orthogonally onto R (U.) which is a subspace of Rm with dimension (m - r). Viewing Y. as n arbitrary row elements of R the dimension of the subspace of T formed by Y.(I - U.U+), Y. arbitrary, is n x (m - r) which establishes that equation 2.2.17 defines a manifold of dimension n x (m - r). From theorem 2.2.1, Ai approximates Si over T iff AU.U. = V.U. (2.2.19) 1 11 1 1 Applying theorem 2.2.3, with the p-inverse as the specific g-inverse, there exists a solution of equation 2.2.19 for A. iff V.U(U.U.)+(UU.) = v... (220) 1 1 1 1 1 1 1 1 However, by property 7 of a p-inverse?+ + t Vi (UU.) +(UiU.) = ViUi (Ui Ui) UiU (2.2.21) 1 1 1 1 + = V U.U. V Us Therefore there always exists an approximation to S. over T. Also by theorem 2.2.3 the set of approximations to Si over T is given by Ai = ViUi (UiUi) + Yi(I - UiUi (UiU )) (2.2.22) = ViU.t U + Yi(I - UUiU U it) = v.ut + Yi(I - U.U) 11 1 1 where Yi is arbitrary. It remains to be shown that ViUi is the approximation to Si over T of minimum norm under any P1-trace norm on T.

We consider the square of the norm, noting that the minima are unchanged by the squaring operation. Substituting viUi + Yi(I - U.U) = trace (V U+ + Y (I - U U))'P2P/2 1ii P1 ii i ii P1 (V.u+ + Y.(I - UiU+)) (2.2.23) = trace P1 (V.U. + Yi(I - Ui))(ViU + Y (I - U ))'P 1/2 1/2 + / 1/2 = trace P1 (V.u+)(V.U+) Pl + trace P1 (Y (I - UiU.))(Yi(I - UU+)) /2 i P11 P = liv i IY.(I - U U)ll 2 1 i 1 which proves the final assertion. Q.E.D. Notice that direct application of theorem 2.2.3 yields the following representation for the family of approximations to S. over T. I Ai = ViUi(U.iU- + Yi(I - UiUi(Uii) (2.2.24) where Yi is arbitrary and (UiUi) is any g-inverse of UiUi. The simpler form of theorem 2.2.4 only holds if the g-inverse chosen is the p-inverse. The following two theorems characterize the set of approximations to Si over T in terms of a specific approximation A..

17 Theorem 2.2.5 If S. = (U.,V.) is not empty and if A. approximates.,vor T thrnn the set of all approximations to Si over T is given by A. = A. - W.(I - U.ui) (2.2.25) where W. is n x m and arbitrary. Proof: If the solutions to equation 2.2.17 are viewed as any particular solution plus the set of solutions to the homogeneous equation the result follows. We have from theorem 2.2.4 A = V.U+ + Y (I - U U+) (2.2.26) i i 1 iii for some matrix Y.. Substituting this into the proposed solution for an arbitrary Ai gives A. V.U + Y.(I - U.U+) + W.(I - U.U+) (2.2.27) = V.ui + (Y. + W.)(I - U.iU) where W. is arbitrary. This is the set of all approximation to S. over T. Q.E.D. Theorem 2.2.6 If S. = (Ui,Vi) is not empty and if Ai approximates S. over T then the set of all Y. in equation 2.2.17 which yield A. is a 1 1 1 linear manifold in T of dimension n x r where r is the rank of U.. This manifold is given by Y. = A.+ X.U.U+ (2.2.28) 1 1 1 1 L where Xi is n x m and arbitrary. Proof: We seek the set of Y. which are, solutions to the equation Yi(I - UiU ) = A. - ViU (2.2.29) i 1 i11. By theorem 2.2.3 a solution for Yi exists iff (A. ( - V.U)( I - UU)( - U U) = (A. - V.U). (2.2.30) 1 11 11 1 1 1 1 1 Because (I - UU+) is a projection, (I - UU+) = (I - UU ), see (PEN 54), and therefore this is equivalent to requiring that

18 AU U = V U. (2.2.31) i 1 i i Since A is an approximation to S. over T i l1 A.UU = V U. (2.2.32) iii ii'+ + Postmultiplying by U. U. and using property 7 of a p-inverse we have that 1 1 _ + + A U+ V.U (2.2.33) i 1 ii and therefore a solution for Y. must exist. This simply verifies the result of theorem 2.2.4. Using theorem 2.2.3 the family of solutions for Y. is given by Y. = (A. - V.U+)(I - U.U+)+ +.(I - (I - U.U+)(I - U.U)+) (2.2.34) 1 1 1 1 1 1 1 1 1 1 1 where Xi is arbitrary. Because (I - U.U.) is a projection operator Y. = A. + X.U.U+ (2.2.35) i 1 1 1 where X. is arbitrary. The dimension of this manifold is clearly n x r since UiUi projects the n rows of X. onto R(U.). Q.E.D. 2.3 Emulations If an A. exists which emulates S. over T then any emulating matrix 1 1 is an approximating matrix and any approximating matrix is an emulating matrix. Although any emulation is also an approximation, the family of emulations can be analyzed further. The set of Ai which emulate Si over T is the set of solutions to the matrix equation AU. = V. (2.3.1) These solutions are characterized by the following theorem. Theorem 2.3.1 If S. = (U.,V.) is not empty the set of A. which emulate Si over T is not empty iff the following equivalent conditions are satisfied. V.UTu. = V. (2.3.2) 1 1 1 1

19 Ui Rank = Rank V1 (2.3.3) Vi Further, if not empty the set of Ai which emulate Si over T is given by Ai = ViUi + Yi(I - Uii) (2.3.4) where U- is any g-inverse of U. and Y. is n x m and arbitrary. Proof: Direct application of theorem 2.2.3 yields the equation 2.3.2 as well as the formula which characterizes the set of A. which emulate 1 Si over T. Equation 2.3.3 is simply a statement that a solution exists for A. iff a solution exists for each row of A.. This rank condition for vector eqruations is well known. Q.E.D. Nutice that'ince any emulation is also an approximation to S. over T all of the results of section 2.2 remain valid for emulations. A particular condition for the existence of an Ai which emulates Si over T is that U. have independent columns. The conditions of theorems 2.2.5 and 2.2.6 can be relaxed if one is dealing with a set S. for which an emulation exists over T. Theorem 2.3.2 If S. = (UiVi) is not empty and if A. emulates S. over T then the set of all Ai which emulate Si over T is given by Ai = Ai Wi(I - UiUi) (2.3.5) where Wi is n x m and arbitrary and where Ui is any g-inverse of U.. Theorem 2.3.3 If Si = (UiVi) is not empty and if A. emulates S. over T then the set of all Y. in equation 2.3.4 which yield A. is a linear manifold in T of dimension n x r where r is the rank of U.. 1 This manifold is given by Yi = A. + XiUiUi (2.3.6) where X. is n x m and arbitrary and U- is any g-inverse of U.. 1 1 1

20 The proofs of both these results are entirely parallel to the proofs of theorems 2.2.5 and 2.2.6. 2.4 Sequential Characterization of Approximation and Emulation We have thus far been concerned with the determination of an approximation to a particular set Si, a member of a sequence of data sets. If two data sets S. and S. have no common elements, then the problems of determining an emulation or approximation to Si and Sj must be treated independently. However, if Si and Sj are not disjoint considerable simplification may be possible. It is assumed here that Si and Sj are any two data sets. In the applications of Chapter 3, Si will be the successor to Sj in some sequence of data sets. We will denote the elements of S. also contained in Sj by Sp. and the elements of S. q not contained in Sj by S'j. Elements of the set Si will be assumed to be ordered so that all elements of S. contained in Sj. precede all elements of S. contained in Si.. The elements of the sets SP. and Sij will retain the relative ordering assigned them in S.. The ordering of S. is of no consequence. Associated with the sets Sp ij and Sq. are the pairs of matrices (UP.,V. ) and (Uq,V) which 13 1J iJ ij 1J will be viewed as their concrete representation. Before proceeding, two theorems from the theory of generalized inverses are presented which are useful in the computation of an emulation or approximation to S. over T. n Theorem 2.4.1 Let the n x m matrix A and the vector a e R be given. If a % R(A) a g-inverse of [A,a] is given by [A,a] = A(I - ab')1 (2.4.1) b' where A is any g-inverse of A and where b' = a'(I - AA-) (I - AA-) (a'(I - AA ) (I - AA )a). If a e R(A), a g-inverse of [A,a] is

21 given by [A,a- = A(I -ab')I (2..2) b' where A- is any g-inverse of A and where b is arbitrary. Further, if a 0 R(A), the p-inverse of [A,a] is given by +, [A[,a]+ = -ab)1 [A,a] = A~(I - ab) (2.4.4) l b' A ++ where A is the p-inverse of A and where b = a AI AA+( + a A+ A+a)l. given. Let [A,a] be any g-inverse of [A,aI. If a 0-R(A) then B(I + a(l - a b)1lb ) is a g-inverse of A. Further, suppose Aa If a $ R(A) then B( - b(b) is the p-inverse of [Aa] is given by I I j A. If a 6 R(A) and a b Z l then B(I + a(l - a b) D ) is the where A the-inverse of A. + a Therem 2..2 sults of boet th these m matheorems ae presented in (Rectionbe 3.6). Although no formal proof is given, the author states that the results follow by straightforward computation. The parts of theorem 2.4.1 and 2.4.2 dealing with pany g-inverses have also been published by thRE 60) and (CI. ab ) respectively. Thes ults of theorem 2.4.2 can be formulated for the cases where tne vector a is a matrix. This is done by Cine (CI 64). Such generalizations are invof A. Further, supposestionable computational value. [A,a]+ = 1^1. If a ~ R(A) then B(I - b(b b)-^b') is the p-inverse of A. If a g R(A) and a b 1 1 then B(I + a(1- a b) ) is the p-inverse of A. The usefulness of both these theorems arend 4.2 is easily demonstrated. Theorem 2.6).4.1 yields a directhough no for obtaining a g-inverse, or the results follow by straightforward computation. The parts of theorem (GRE 60) and (CLI.64) respectively. The results of theorem 2.4.2 can be formulated for the cases where the vector a is a matrix. This is done

LL p-inverse of any matrix. For instance, consider the n x m matrix A = [a,...,am] where m and n are arbitrary. It can be directly verified that the p-inverse of the single column matrix al is given by aCala).- Denoting the matrix [a,...,a 1] by A., if some g-inverse of Ai1 is known, in order to apply theorem 2.4.1 to the matrix [Ai,aiJ it is only necessary to determine if a. 6 RCAi ). The matrix A.i- Ai 1 can be used to check if 1 i-l i-I i-i ai e RCA1) since this operator projects vectors onto R(Ai1). Specifically, a. G RCA ) iff CI - A A )a. = 0. 1 a-RCf To determine the family of approximations to a data set S. = (UI,V.) over T it is necessary to compute the p-inverse of Ui. If no information about the set S. is available, repeated application of the procedure just outlined will yield the p-inverse of U.. However, if the p-inverse of a matrix U. is known, where U. arises from a data set S. = (U,V.) and SP. is not empty then the p-inverse of U. can be utilized to simplify the computation of the p-inverse of U.. To accomplish this, notice that if P is any permutation matrix and U. is the p-inverse of U. then the'+ p-inverse of U.P is given by P U. Therefore, given the p-inverse of U. under some ordering of S. the p-inverse of U. can easily be found (U:t 11) withthe sameett I t where U. arises from a data set S. = (U.,V.) with the same elements as S. but with a different ordering. If the ordering for S. is chosen such that the elements of S. also contained in SP. precede all other t t elements of S. then, having computed the p-inverse of Uj, the compu3J tation specified by theorem 2.4.2 can be performed repeatedly until the p-inverse of UP. is formed. Having determined the p-inverse of UP. 3 1)j

23 the p-inverse of JU can be computed by repeated application of theorem 2.4.1. The same process is valid for g-inverses of U. other than the 1 p-inverse. This procedure is discussed in greater detail in section 3.3 and is employed in the computations of Chapter 5. The procedures discussed up to this point for calculating the p-inverse of Ui from the p-inverse of U. are useful in determining the family of approximations to S. over T whether or not these approximations also emulate S.. If it is known a priori that an A. exists which emulates 1 S. over T and an A. exists which emulates S. over T then further results 1 J J can be obtained. Theorem 2.4.3 Supose S. = (U.,V.) and S = (U.,V.) are not empty and that A. emulates S. over T. If SP. and Si. are not empty then A. 37 3 ij jae 13. 13 1 emulate Si over T iff (Ai- A.)U = 0 (2.4.5) and A.U = v.. (2.4.6)' ij'' Further, if Sq. is empty then A. emulates S. over T iff (A - Aj)U =. (2.4.7) 3 1 q Proof: By definition if SPi and Sj are not empty A. emulates S. 133i j 1 1 over T iff AiUP=. V^= (2.4.8) 11) and A.Uq. = Vq. (2.4.9) and if Sqj is empty Ai emulates Si over T provided

24 w ^ij ij (2.4.10) Since Aj emulates Sj over T and SPi is contained in S Aj - V, (2.4.11) from which the result of the theorem follows immediately. Q.E.D. P q The case where SJ. and Si are both empty is excluded since this implies Si is empty. The case Sij empty produces no simplification since if S.i is empty then Si and Sj are disjoint. Because of the desirability of determining an emulation or approximation to Si given an emulation or approximation to Sj a particular choice for the matrix Yi in equation 2.2.17 seems logical. If Aj approximates S. over T then the choice Yi = Aj yields the following particular approximation to S. over T + + Ai = Aj + ViUi - AjUiUi. (2.4.12) If it is known that emulations to both S. and S. over T exist, a different but similar formula can be obtained. Theorem 2.4.4 Suppose Si = (Ui,Vi) and Sj = (Uj,Vj) are not empty and that Aj emulates Sj over T. Suppose an A. exists which emulates Si over T and that S and S are not empty. If the elements of S. 1J

25 are ordered such that all elements of Si contained in S'i precede all elements of Si contained in Sqj and if U= [U. p = q rul (2.4.13) i=,Uij] j = ~U2i is any g-inverse of Ui where u1 has as many rows as UP. has columns and Ui has as many rows as Uq. has columns then q 2 q 2 A. = A. + V. U. - AU. Ui (2.4.14) emulates S. over T. Further, if Sqj is empty then A. = Aj emulates Si over T. Proof: We will verify directly that the equation A.iU = V. is satisfied. We have A.U. = A. U + UU A UiUi (2.4.15) -q 2 q 2 =A.U. + AUU.U.U.- A.Ui.UiU. Ji 1 iJ 11 I ]1J1 where A is some matrix which emulates S, over T. It follows from theorem 2.4.3 that Aii = AjUi + AUiUiUi- AjUiUUi (2.4.16) -p 1 - AUijUiUAi - ij Ui A.U. + AU. - A - A.U. - (A A.)UP.U.U. J 1 1 1 J j l1 = V q The second result of the theorem is evident since if Sij is empty then Si CS. and if Aj emulates S. over T, A. must also emulate Si over T. Q.E.D

26 Notice that theorem 2.3.2 can be applied to obtain the entire family of Ai which emulate S. over T from the particular Ai given by either equation 2.4.12 or equation 2.4.14. Similarly theorem 2.3.3 can be applied to obtain the family of Y. in equation 2.3.4 which yield the particular Ai of equation 2.4.12 or 2.4.14. Theorem 2.4.14 is interesting because equation 2.4.14 is quite similar in form to some of the recursive formulas used in minimization algorithms and because equation 2.4.14 promises to reduce the number of calculations required to determine an A. which emulates Si over T. However, the problem of determining Ui without first determining a g-inverse of U. has not been solved. Another formula for a specific emulation which circumvents this difficulty will now be obtained. Although different from the formula of theorem 2.4.4 the result is very similar. As will 1 2 be shown in Chapter 3, specific choices for the matrices Z. and Z. in the following theorem yield generalizations of the recursive formulas used in several minimization algorithms. Theorem 2.4.5 Suppose Si = (Ui,Vi) and Sj = (Uj,Vj) are not empty and that Aj emulates S. over T. Suppose S. and S. are not empty and that the elements of S. are ordered such that all elements of Si P q contained in S.. precede all elements of S. contained in S... If there exist matrices Z1 (r x n) and Z2 (k x n) for some r,k > 0 i i which satisfy Z1Up = 0 (2.4.17) i ij Z2UP = 0 (2.4.18) i ij 1 1] Vq.(Z.) (ZU.) = V ). (2.4.19) 1] 1 ] 1 i1I 13

27 A.U9.CZ. Uf. CZU.L) = A.U. (2.4.20) 3 1] z 1]3 z z7 13 1 1 2 where CZ.U.L) is any g-inverse of Z.U. and (Z.U..) is any L ]J 2 1] 1 13 2 g-inverse of Z.Uq. then i 13' A. = A. + Vq. (Z.Uq z - A.Uq (Z ) Z. (2.4.21) 1 3 13 1 13 1i 3 1 1i emulates S. over T. Further, if SP. is empty and there exist It~~~ ~~1] matrices Z1 and Z2 such that equations 2.4.19 and 2.4.20 are satisfied 1 1 then A. given by equation 2.4.21 emulates S. over T. Proof: The proof verifies directly that the equation A.U. = V. is satisfied. If SP.. is not empty 13 A.U. = A.U. + V. ( ) Z [u?.,u.] (2.4.22) i i 3 i j 13 i: i' j 1 -A.Uq. (Z.U.)- Zi. C,Ui. 3 i ij i ij =A.UP. A [UV.] - V. A.Uq.] = CA+ V..V. - A.Ug.YI 3 13 z 13 3 3 13 =[vP.,Vq.] 13' 13 V.. If SP. is empty the proof follows the same argument. Q.E.D. 13 It is interesting to consider when certain of the four conditions of theorem 2.4.5 are necessary. Suppose Si, Sj, S, and Sz. are not empty and that A. emulates S. over T. If either equation 2.4.17 or equation 2.4.18 is satisfied then it is necessary that the other condition be satisfied if A. given by equation 2.4.21 is to emulate S. over T. Similarly, if either one of equations 2.4.19 and 2.4.20 are satisfied

28 then it is necessary that the other condition be satisfied if A. given by equation 2.4.21 is to emulate S. over T. Also of interest is the case where Z1 = Z2 If Z1 = Z2 then equations 2.4.17 and 2.4.18 are i i3 1 1 equivalent and equations 2.4.19 and 2.4.20 may be restated as follows: (Vq - A.V.) ) (Z (Z.U.) = (Vq - A.Uq). (2.4.23).3 313 iJ] 13 13 313 Also in this case it is necessary that equation 2.4.17 (or 2.4.18) and equation 2.4.23 be satisfied if A. is to emulate S. over T. It is possible to state equations 2.4.19 and 2.4.20 in a slightly different manner as the following lemma shows. Lemma 2.4.6 Let C be any g-inverse of the n x m matrix C. The equation DC-C = D is satisfied iff there exists a solution for X of the matrix equation XC = D or equivalently iff Rank C = Rank. The proof of lemma 2.4.6 follows immediately from theorem 2.2.3 when B is chosen to be the identity. Equations 2.4.19 and 2.4.20 therefore are equivalent to the existence of a solution for X of the equations XZ.Uq. = Vq. (2.4.24) i 13 1i and XZ.U. = A.US. (2.4.25) 3 13 3 13 respectively. Using the rank condition, equations 2.4.19 and 2.4.20 are equivalent to requiring that q an Rank (Z.U..) = Rank ZiU. (2.4.26) Vq and 2 rql (2.4.27) Rank (ZiUq.) = Rank ZUZ (2.4.27) A UJ Clearly, these rank conditions will always be satisfied provided Z.U.

29 and Z.U. have independent columns. 1 2 The question of when matrices Z. and Z. exist which satisfy the conditions of theorem 2.4.5 is partially answered by the following theorem. Theorem 2.4.7 Suppose S = (U.,V.) and S. = (U.V.) are not empty, 1'i ] j that A. emulates S. over T and that there exists an A. which emulates S. over T. Suppose SP. and Sq. are not empty and that the elements of S. are ordered such that all elements of S. contained in S?. precede all elements of S. contained in Sq. If R(UP ) and R(Uq ) are 13 1j ij 1 2 disjoint then there exist matrices Z. and Z., given by z. = [0q.j] + Y(I - UiUi) (2.4.28) 2 q - 2 Zi = C[0,Uj]i + Yi(I - UiUi) (2.4.29) i 1 2 where U. is any g-inverse of U. and Y. and Y. are arbitrary, which satisfy equations 2.4.17, 2.4.18, 2.4.19, and 2.4.20. Further, if P. 1 2 Sp. S-j is empty then there always exist matrices Zi and Zi, given by. Ui + Yi(I - UiUi) (2.4.30) 2 Z2 = UU + Y(I - UiUi) (2.4.31) ~ i 1 1 where U. is any g-inverse of U. and Y and re arbitrary, ~ ~ are arbitrary, which satisfy equations 2.4.19 and 2.4.20. Proof: If S.j is not empty and R(UP.) and R(Uq) are disjoint then any dependent column of U. = [UP,U..] which is a column of 1'J 1J Uij can be written as a linear combination of columns of UPj and any P q dependent column of [Upj,Uij] which is a column of U. can be written as a linear combination of columns of U.. Therefore Rank [UP. ] Rank.,U (2.4.32) since dependent columns of the augmented matrix since dependent columns of the augmented matrix

30 1i iiS 0 T J have the same representation as dependent columns of the matrix [UP.,Uq.]. This rank condition is satisfied iff there exists a solution 1j 13 for Z of the matrix equation Z[U^P U] = C[oU] (2.4.33) From theorem 2.3.3 the family of all such solutions is given by Z = [0o,]Ui + Y[I - UiUi] (2.4.34) where Ui is any g-inverse of U. and Y is arbitrary. Choosing Z1 and 2 Zi as any such solution it can be directly verified that conditions 2.4.17, 2.4.18, and 2.4.20 are satisfied. Because it was assumed that an A. exists which emulate Si over T there exists a matrix A such that q q AU.. = Vi. Substitution of this equation into condition 2.4.19 verifies that this condition must also be satisfied for such a choice of 1 2 Z. and Zi Q.E.D. 1 2 Notice that it is not necessary that the matrices Zi and Zi satisfy the matrix equation 2.4.33 in order to satisfy equations 2.4.17 through 2.4.20. Because of the similarity between equation 2.4.14 of theorem 2.4.4 and equation 2.4.21 of theorem 2.4.5 it is natural to ask how the two theorems are related. Observe that equation 2.4.14 admits only a single g-inverse Ui = U of Ui. That is to say the proof of theorem 2.4.4 uLiJ 2 q 2 does not go through if the Ui postmultiplying V.. and the U. postmultiplyq ing AjUi. are the bottom halves of different g-inverses of U.. One 11- 1 would expect that under certain conditions the matrices (Z.U..) Z. i 1] Z 2q -2 and (Z.U..) Z. of equation 2.4.21 are the bottom halves of some g-inverse of Uand that ( ) Z = ( ) Z If this is the case g-inverse of Ui. and thatis is the case 1 11] 1 11] 1i~i

31 then the equation 2.4.21 is of the same form as equation 2.4.14. Clearly, if Z = Z. then ( ZU.)-Zl = (Z2. )-Z2. Thus far no more general 1^ 1 1 j i 1 ij 1 condition has been obtained under which this equality holds. The question of when the matrix (ZU.) Z is the bottom half of some ij g-inverse of Ui is answered by the following theorem. Theorem 2.4.8 If U = [], R(UP) and R(U.) are disjoint and the matrix Z satisfies ZUP. = 0 (2.4.35) 12 U5.(ZU3 ) (ZUON) = U (2.4.36) 13 1j 12 ij then there exists a family of g-inverses of Ui of the form U. = X z (2.4.37) ( (ZUj)-J with X given by p- p - p-p - X = (Uij) [Uij,0]U. + Y - (Uij) U.YUiUI(2.4.38) where (U..) and Ui are any g-inverses of Uj and U. respectively ij1 1 and where Y is arbitrary. Proof: From the definition of a g-inverse we seek matrices X which satisfy [Uji. j] [ ] [U.ijUj] = [U ij.U. (2.4.39) 12~ 1' U Z )Z 1 13 Expanding this expression gives UP.xup. + U. (zUqj)ZUp. = UP. (2.4.40) 1] 1] ij 1j 13 1j and P q q q q q UijXUi + Uij(ij) ZUij =. (2.4.41) 2. 1 3 11 12 12 12 Because of equations 2.4.35 and 2.4.36 this reduces to Upjq = Up~ (2.4.42) UP XUq = (2.4.43) UiXij = o~

32 or UP.XU. = [U.,0]. (2.4.44) From theorem 2.2.3 there exists a solution for X iff Uj j.U) -)0 = Fru? -.o (2.4.45) j ij ij' Ui = [U. j0 or [uP.,o]UiU. [up,0] (2.4.46) By lemma 2.4.6 this is equivalent to requiring that Rank [Up.Uq.] = Rank UP.Uq (2.4.47) 13 1] 13 ij p q uij: Because R(U..) and R(Uq.) are disjoint this rank condition must be satisfied by the same argument that was employed in the proof of theorem 2.4.7. Again, employing theorem 2.2.3 the family of solutions for X is given by equation 2.4.38. Q.E.D. 2.5 Summary In this chapter we have considered in detail the problem of approximating a set of data pairs by a linear transformation. In theorem 2.2.4 the family of approximations to a data set Si = (Ui,Vi) over T was characterized using the p-inverse. If an approximation Ai to a data set Si transforms Ui into Vi then we say that Ai emulates Si. In this case a less restrictive characterization of the family of emulations of a data set S. is possible. This characterization, involving an arbitrary g-inverse, is presented in theorem 2.3.1. If two data sets S. and Sj have common elements it is possible to obtain further results about emulations or approximations to Si over T when an emulation or approximation to S. over T is given. In the first part of section 2.4 we consider how this can be accomplished using the results of theorems 2.4.1 and 2.4.2 which are standard results from the

33 theory of g-inverses. If an A. is given which emulates Sj over T and if it is known that an A. exists which emulates S. over T then it is 1i 1 possible to state several formulas which determine a matrix Ai that emulates Si over T. These are presented in theorems 2.4.3, 2.4.4, and 2.4.5. The conditions of theorem 2.4.5 are particularly important since they can be utilized to derive new recursion formulas for minimization algorithms as well as most of the currently known formulas. The remainder of Chapter 2 is concerned with the existence of matrices which satisfy the conditions of theorem 2.4.5 and with the relationship between theorem 2.4.5 and theorem 2.4.3. The material of Chapter 2 suggests two other directions for research which have not been pursued in depth. In the minimization problem it would be advantageous to consider approximations over the set of positive definite symmetric linear transformations. A preliminary result toward that end was presented in theorem 2.2.2. An analysis of different norms on V(, such as the operator norm defined for k(i) m A e V ()by iA 1 = max Z l aij, may lead to computationally useful <j<N i=l results.

CHAPTER 3 ROOT FINDING AND MINIMIZATION ALGORITHMS 3.1 Introduction and Notation We are concerned in this chapter with the relationship between the emulation and approximation problems considered in Chapter 2 and some of the algorithms which have been proposed for function minimization and root finding. In the minimization problem an objective function f(x):DCRn - R1 is to be minimized over the interior of its domain denoted by D. It will be assumed that f(x) has a derivative g(x):DcRn - Rn and a second derivative (Hessian) H(x):Dc:R R. In the root finding problem an x e D is sought such that F(x) = 0 where F(x):DcRn + Rm. It will be assumed that F(x) has a derivative on D denoted by J(x):DcRn + Rnm Although these two problems are distinct they are closely related. If f(x) is strictly convex on D then the problem of determining the unique minimum of f(x) on D is equivalent to the problem of determining the unique root of g(x) on D. Even if f(x) is not strictly convex on D the minimization of f(x) is often approached by seeking roots of g(x) on D since it must be that, at a minimum of f(x) on D, g(x) = 0. If a minimum of f(x) is being sought by solving for roots of g(x), then additional structure is present in the root finding problem since H(x), the derivative of g(x), is square and symmetric on D. Further, at a minimum of f(x) on D H(x) is positive semi-definite. On the other hand the problem of finding a root of F(x) on D can be transformed into a minimization problem by considering a function h(y): Rm + [0,o) such that h(0) = 0 and h(y) > 0 for all y f 0. The problem of determining a root of F(x) on D is equivalent to the 3L4

35 problem of determining a minimum of f(x) = h(F(x)) on D. The choice h(y) = y'y is particularly appealing since the derivative of F(x)'F(x) is 2J(x)F(x). The roots of F(x) may not be unique, and if m > n, a root of F(x) will usually not exist. In any case the minimization of F'(x)F(x) yields a point in Rn at which F(x) has minimum Euclidean norm. 3.2 Newton's Method It is of interest to state what constitutes an algorithm for the solution of the problems under consideration. An algorithm is simply a rule which utilizes computable expressions to generate a sequence of points {xk}CD which converge to a solution of the problem. In most algorithms of practical interest, given an initial point x0 e D, the rule can be expressed as a recursion of the form xk+l = Gk(xk xk-l, X (3.2.1) where {Gk} is some sequence of operators whose k th element Gk is derived from the points x0,...,xk and maps the points x0,...,xk into the point xk+l. Newton's method is the foundation for all the algorithms considered in this chapter. It is based on the truncated power series expansion F(x) = F(xk) + J(xk)(x - xk). (3.2.2) An approximation to a root of F(x) is arrived at by assuming that the approximate equation 3.2.2 is valid in a neighborhood containing a root of F(x). If J(xk) is nonsingular this leads to the iteration Xk+l = xk - J(xk)F(xk). (3.2.3) If a minimum of a function f(x) is sought, Newton's algorithm can be applied to g(x), the derivative of f(x), in which case the iteration becomes

36 k+l = xk - H-(xk)g(xk). (3.2.4) There are several drawbacks to the Newton procedure. For even simple functions F(x) (or f(x)) there is no guarantee that the Jacobian (or Hessian function) is nonsingular on D. Hence, the sequence {Xk} may not be defined. If the sequence {Xk} is defined and remains in D there is no guarantee that, starting at an arbitary x0 e D, the sequence {xk) will converge to a solution. Fortunately, certain convergence results can be obtained. Let x e D be a solution point and assume that J(x) (or H(x)) is nonsingular. Then for x0 e D and x - x0 sufficiently small, the sequence {xk} generated by Newton's method converges to x and furthermore the rate of convergence is quadratic (ORT 70, theorem 10.2.2). That is Limit IIXk+l- xI1 ki X ~k = C <,o. (3.2.5) k -+ ll-x ll In terms of the general recursion 3.2.1, the operator Gk of Newton's method is derived only from the point xk and operates only on the point xk to generate the point Xk+l. While this makes the recursion simple in form, the calculation of J (or H) and its inverse may pose great difficulties in practical problems. Thus a large number of algorithms have been developed which utilize a set of function (or gradient) evaluations to estimate the inverse of J (or H) at xk. It is these algorithms with which we are concerned in this chapter. Newton's method itself cannot be put into the context of Chapter 2. 3.3 Secant Methods A large class of algorithms, known as secant methods, has been developed for determining the root of a function F(x):DCRn - Rn. In order to describe the secant methods it is useful to introduce the notion

37 of points in general position. Definition 3.3.1 The n + 1 points xo,...xn e Rn are in general position if the vectors x - xi, i = l,...,n, are linearly independent. The notion of points in general position leads to the following result. Theorem 3.3.2 Let x0,...,xn e Rn and YO,' -,Yn e Rn be given. Then there exists a unique affine transformation L(x) = Ax + a such that L(xi) = i', i = O,...,n, iff xO,...,xn are in general position. Further A is nonsingular iff yO,0. Yn are in general position. The proof of this theorem is found in (ORT 70, p. 192). If the points YO,..'Yn are taken to be evaluations of the function F(x) at the points xo,...,xn, then the basis for all secant algorithms is contained in the following definition. Definition 3.3.3 If x0,...,xn e DCRn and F(x0),...,F(xn) e Rn are in general position and if A and a satisfy a + Ax. = F(xi), i = 0,...,n, (3.3.1) then the point = A a (3.3.2) s is the "basic secant approximation" with respect to xo,...,xn. Theorem 3.3.2 insures that the basic secant approximation is defined and unique provided x0,...,xn and F(x0),...,F(xn) are in general position. In all secant algorithms a secant step (a basic secant approximation) is computed by determining an affine function that has the same value as F(x) at the points xo,...,x and choosing the root of the affine function as the estimate of the root of F(x). If the problem of determining a secant step is attacked directly it is necessary to first obtain the affine transformation Ax + a and then to solve the equation Ax + a = 0. The following result, known as the

38 Newton formulation of the secant step, simplifies the calculation of a secant step and at the same time makes apparent the relationship between the basic secant approximation and the material of Chapter 2. Theorem 3.3.4 If x0,...,x and F(x),...,F(x ) are in general position then the basic secant approximation with respect to x0,...,xn is given by X = x - VU-1F(x) (3.3.3) where V =[ - xn - x 1 V n 0 and U = [FCx) - FCx0)..,F(xn) - F(x0)]. Proof: For any column v. of V Av. = Ax - Ax = F(x.) - a - F(x ) + a = F(x.) - F(x ) it follows that AV = U. The basic secant approximation must satisfy the equation Ax + a = 0 (3.3.4) s Substituting yields ACx - VU -F(x )) + a = F(Cx) - AVU F(x 0) (3.3.5) = F(x) - F(x0) = 0. Q.E.D. Many, although not all, of the secant methods make use of the Newton formulation. In order to completely specify a secant algorithm it is necessary to define a procedure for the generation of the points x,..,xn at each stage k of the algorithm. A formulation of one secant method with a particular procedure for generating the points x,...,xn

39 is as follows. Algorithm 3.3.5 Assume the two initial points x_,x0 G D are given and set k = 0. 1. If F(xk) = 0 stop. 2. Set kj = x x)ej j = 1,...,n, and set XkO = Xk where x2 is the j th component of x and where e. is the vector whose components are zero except for the j th component which equals one. Set Xk+l =k kk (k) where Vk= [Xkl - XkO', Xkn - XkO k kl kO kn kO n k [F(xkl) - F(XkO),...,F(Xk ) - F(xkO)]. 4. Set k = k+l and return to step 1. Many other specification procedures have been proposed for generating the auxiliary points xkO,...,xkn (ORT 70, pp. 195-200). Almost all of these procedures either completely refresh the data set at each stage k or, as in the sequential secant algorithm, utilize the points Xkn...'Xk generated during the last n stages of the algorithm. In Chapter 4 procedures will be explored for allowing a more complex choice of auxiliary points in the context of the minimization problem. Notice that if a secant algorithm is applied to determine the root of a function F(x):DcRn + Rn there is no guarantee that the sequence {k} generated by the algorithm will be defined or will remain in D. Unless some auxiliary procedure is employed the secant algorithms become stuck if it ever happens that the points F(XkO),...,F(xkn) are not in general position. Even if the sequence {xk} is defined there is no guarantee that convergence to a root of F(x) can be obtained from an

40 arbitrary set of initial points in D. What can be shown (ORT 70, sections 11.2 and 11.3, theorem 11.3.4). is that if the auxiliary points are chosen in an appropriate fashion, if the initial points are close enough to a root x e D of F(x) and if J(x) is nonsingular then the sequence {xk} is defined and converges to x. Moreover, the rate of convergence is superlinear. That is Limit Ixk+ -.x - 1^ 1 l= 0. (3.3.6) k + oo x\ kIn Chapter 2 results were presented concerning a sequence of data sets {Sk}. Each data set Sk was represented by a pair of matrices (Uk,Vk). In the secant methods which utilize the Newton formulation of theorem 3.3.4 the data sets Sk contain n elements and these elements are of the form s. = (uj,) = F(kj k) k - ( XkO) j =1,...n.(3.3.7) The matrices (Uk,Vk) which arise in the Newton secant algorithms represent a data set Sk of the form discussed in Chapter 2. In most Newton secant algorithms the elements of S are constrained so that the matrix V is nonsingular by choosing the auxiliary points xkO,...,xkn in general position. If a Newton secant algorithm is to be defined for all k > 0 then the matrices Uk must also be nonsingular or, what is equivalent, the points F(xkO),...,F(xkn) must be in general position. If the matrix Uk is nonsingular then by theorems 2.2.4 and 2.3.1 the matrix VkUk is the unique matrix which emulates and approximates S over T. Under most of the selection rules proposed for the auxiliary points xkO,...,xkn in Newton secant algorithms the data set Sk+l is disjoint from the data set Sk so that no simplification of the calculation of V U is possible using information obtained in the calculation k+-l k+l

41 of Vk It is worthwhile to consider anotlier secant method, the sequential secant method, which utilizes the points Xk,..,. The following theorem forms the basis for the sequential secant method. Theorem 33.36 If X0,*.,xn and F(x0),...,F(xn) are in general position then the basic secant approximation is given by x = x 0 - VU F(xO) (3.3.8) where V =x - X,.,x -x V X1 0 XO''Xn n-l and U =[F(xl - F(xO),.,F(xn) - F(n_)]. Proof: Let U and V denote the matrices given in theorem 3.3.4. We have U = UP and V = VP where "1 -1 01 P = " 1 (3.3.9) Since P is nonsingular U and V are nonsingular iff U and V are nonsingular. From theorem 3.3.4 x = x0 - VU F(x ) s 0 is the basic secant approximation with respect to xO,...,x. However, x = x - VPP 1UF(xO) (3.3.10) = x - VUx = X - VU F(x). Q.E.D. In the sequential secant algorithm n + 1 initial points x,x,+n.l*,x0 are given and at the k th stage the points Xk n,...,xk are taken as the auxiliary points for the calculation of the secant step. The sequential secant algorithm can be stated as

42 follows. Algorithm 3.3.7 Assume initial points x,...,x are given and set k = O. 1. If FCxk = 0 stop. k 2. Set xk = Xk_, j = 0,...,n. 3. Set xk1l = X-k VkU kF( x) where Vk = Xk(n+l) Xkn'..' xk Xkl and Ui = [F(X k(nl)) - F(Xkn),...,F(xk) - F(xkl)]. 4. Set k k + 1 and return to si:ep 1. In addition to the obvious computational advantage that only one function evaluation need be made at each stage k there is also a savings in the calculation of since Uk+ and Uk have n - 1 columns in common. k+l k+l k In particular, if Uk = [Ukn+1,...,uk] then the inverse of the matrix [Uk n+2,.,uk uk n+l is given by P'U where P is the permutation k-n+~2' uk ukn+l k matrix such that UkP = [ukn+2,...,uk,uk n+l]. Application of theorem 2.4.2 followed by application of theorem 2.4.1 yields Uk+l n+2'...,ukuk+ ]. A more direct procedure for calculating Uk+ is obtained by noting that k+l k+ (k+l - Uk-n+l)en (3.3.11) where P is the permutation matrix just described and e' = (0,...,0,1). n Since it is assumed that U and Uk+ are nonsingular the Shermank k+l Morrison formula (ORT 70, pp. 50) can beapplied to equation 3.3.11 to obtain a closed formula for U k k+l' Although the sequential secant algorithm affords some computational advantages, it is more prone to undesirable behavior than the Newton secant algorithms. Since at stage k there is no guarantee that the

43 points Xkn,...,xk are in general position it may happen that the -1 matrix V is singular. If this occurs and if the vector U F(xk) is in NCVk) then the algorithm becomes stuck since xk+l = xk. For this reason superlinear convergence cannot be proven even if the initial points are close to a solution x where J(x) is nonsingular (ORT 70, theorem 11.3.5). In addition the theoretical difficulties mentioned earlier with the Newton secant methods are still present in the sequential secant algorithm. As with the Newton secant algorithms, if Uk is nonsingular then the matrix VkUk- is the unique matrix which emulates and approximates the data set S over T. Because the data set Sk is generated by deleting k k+l one element of Sk, the oldest element, and adding a new element, the calculation of V+ U1 is simplified. If a different permutation kcl+ k+l matrix P is chosen, any element of the set Sk can be deleted without computational burden. Such a procedure could be used to counteract the tendency of the matrices Uk and Vk to become singular or ill-conditioned. In Chapters 4 and 5 minimization algorithms will be proposed which make use of this procedure as well as more general replacement procedures. In any of the secant algorithms difficulty arises if at some stage k the matrix U is singular since, if this occurs, the matrix VkU will be undefined. When VkUk does not exist it seems logical to choose some approximation to Sk over T. Recall from Chapter 2 that the family of approximations to Sk = (Vk,Uk) over T is given by ~ Vkk~kI k )(3.3.12) Ak= VkUk + Yk(I - UkUk) (3.3.12) where Yk is arbitrary. It is not obvious what choice to make for the matrix Yk when R(Uk) is not all of Rn. In fact, as the following lemma shows, under certain conditions the set of points Xkvl which can be

44 obtained by using an approximation to Sk over T in a secant algorithm from the point xk is all of R. Lemma 3.3.8 Suppose the data set S = CUVkVk) and F(xk) e Rn are given. If F(xk) 0 R(Uk), then for any vector w 6 R there exists a solution for Yk of the equation k + = V UF(xk) + Yk(l - UkU) F(x ) (3.3.13) Proof: If F(xk) k R(Uk) then (I - UkUk) F(xk) 0. The result follows immediately as there exists a Yk which transforms any nonzero vector in R into any vector in R. Q.E.D. If F(xk) 6 R(Uk) then (I - UU) F(xk) = 0 and the choice of Y has no effect on w. In this case the secant step should be replaced k by Xk~l = Xk - VkU+ F~xk). (3.3.14) xk+1 = - VkUk F(Xk)' A clever choice for Yk when F(xk) 0 R(Uk) may yield interesting algorithms. In the minimization problem additional structure is present which indicates a particular choice for Yk when F(x ) i R(Uk). This k k k will be explored in Chapter 4. Since an approximation to a data set Sk over T exists regardless of.k the number of elements in Sk, secant-like algorithms could be constructed which utilize more or fewer than n auxiliary points at each stage. Also, since elements of the data sets S can be pairs (u,v) 6 R x R, k m 1 n, secant-like algorithms could be constructed for finding the roots of a function F(x):D R + R. Neither of these possibilities will be explored in this dissertation. 3.4 Minimization Methods In 1963 a paper by Fletcher and Powell (FLE 63) stimulated research on a class of minimization algorithms which have proven to be very effec

45 tive. The algorithm considered by Fletcher and Powell was originally proposed by Davidon (DAV 59). Since 1963 Broyden (BRO 70, 70a), Pearson (PEA 69), and Murtagh and Sargent (MUR 70) among others have proposed algorithms which bear a similarity to the Davidon algorithm but which are distinct from it. In this section we will demonstrate the underlying commonality of these algorithms and suggest modifications and generalizations of them. Using the notation introduced at the beginning of this chapter for the objective function f(x):DCRn + R1 a general minimization algorithm can be stated which includes the above mentioned class of algorithms. Algorithm 3.4.1 Assume x0 e D and an initial n x n matrix A0 are given and set k = 0. 1. If gk = g(xk) = 0 stop. 2. Set Xk+l Xk - Xkgk where k is chosen so that - XAkgk) = min f(xk - A gk) A E 3. Compute Ak+l by a recursion relation. 4. Set k k + 1 and return to step 1. Notice that step 2 of the algorithm involves a subalgorithm which minimizes the objective function on a line. This "linear search" algorithm is of no interest in itself here. However, as will be seen later, the properties of specific algorithms will be changed when the linear minimization is not carried out exactly. If Akgk $ 0 exact solution of the linear minimization causes gk+lAkgk to be zero. In order to complete the specification of the algorithm it is

46 necessary to define the recursion of step 3. The following recursions have been proposed by various researchers. k+l = Ak+ k(Vkuk) vk Akuk(ukAkUk)-lukAk Davidon (3.4.1) Ak+l = Ak + ek(ekuk) ek Murtagh and Sargent (3.4.2) Ak+= Ak + ek(Vkuk) vk Pearson 1 (3.4.3) Ak+ = Ak + ek(ukAkuk) ukAk Pearson 2 (3.4.4) Akl = Ak + (1 + kUkAkUk) k (vk) Broyden (3.4.5) I t - Sk(Akukvk - vkukAk) - (1 - kvkuk ) Akuk (UkAkUk )-lkAk, k > 0 In these formulas vk = Xk+l - xk uk = gk+l -gk and ek = vk Akuk The names associated with the recursions are not necessarily their originators but rather individuals who have analyzed them extensively. Because it is difficult to obtain convergence results for an arbitrary twice differentiable objective function, algorithms have historically been constructed to have good performance when applied to a quadratic function, that is a function whose Hessian matrix is a constant function of x on D. Since in some small neighborhood of the minimum of f(x), the Hessian is "almost constant", it is reasoned that an algorithm so constructed will perform well on any f(x) in the terminal stage. Following this logic we temporarily restrict attention to a quadratic objective function. In algorithm 3.4.1 it is desirable for the sequence {Ak} to be an emulating sequence. The following theorem demonstrates why.

47 Theorem 3.4.2 Suppose an initial point x0 e Rn is given. Suppose the matrices AO,A1,...,Ak of algorithm 3.4.1 are chosen such that gkAkgk 1 0 when gk $ 0 and such that Ak emulates Sk over T where s0' sk ( k,Vk) [(u0,... uk- ),(vO",..vk-1)], k = 1,,..,n vi = xi - Xi and ui = gi+l - gi, i = 0,...,k. If algorithm 1m +1 3.4.1 is applied to an objective function f(x): Rn + R1 with a constant positive definite Hessian matrix H, then either gk = 0 for some k < n or UV = Vkk = 0, k =,...,n, (3.4.6) A = H-1 (3.4.7) n and gn = 0. (3.4.8) Proof: Since gkAgk 0 when gk 0, the search direction Akgk is nonzero unless gk = 0. Because H is positive definite f(x) is strictly convex and therefore a solution must exist to the linear search problem. It follows that the algorithm is determined for n steps unless g= 0 for some k < n. It must be that Xk 0 for k k k = 0,...,n-l, since if X = 0 the minimum has been achieved along the line xk + AgXAk which implies that g kgk = 0. Because of the linear search gk+lv = 0, k = 0,...,n-l. (3.4.9) With k = 1 equation 3.4.9 gives glV = 0. (3.4.10) We will now show that gVk = 0 implies that gk+lVk+ = 0 and hence establish by induction that gkVk = 0, k = l,...,n. (3.4.11) Since gk = Hxk + c, we have

48 -= -1' gkAkUk gkVk = 0. T t Since Ak is nonzero and finite it must be that UkVk = kUk =0. From this it follows that gk k+l = [(Uk + gk) Vk gk+lV ] (3.4.13) = [0 + 0,0] which completes the induction. Next, since vkUk = and Uk = HVk, k = 1,...,n it must be that vkHV k ukH Uk= 0, k = i,...,n. (3.4.12) Therefore, because H is positive definite and Xk 0, the vectors (vO,...,v 1) span Rn and there exists a unique matrix H1 which emulates S over T, that is H is the unique matrix such that H Un = V. Since A was chosen to emulate Sn over T, An H-. Finally, since g V = 0 and since the vectors (vO,...,v 1) span R, g = 0. Q.E.D. t t However, if Ak emulates (Uk,Vk) and Uk = HVk where H is positive definite then A cannot be negative definite since this would imply tKha k However, if Ak emulates (U,, and Uk) = HV where H. is positive

49 If the search direction is chosen to be Akwk where wk is any nonzero vector such that kV 0 and gkAkk 0 rather than Akgk the proof of theorem 3.4.2 is unchanged. This suggests the following more general minimization procedure for quadratic functions. Algorithm 3.4.3 Assume x0 e D and an initial n x m matrix A0 are given and set k = 0. I. If gk = 0 stop. 2. If k = 0 set xk+l =xk - XkAkk where wk is an arbitrary nonzero vector and where Xk is chosen such that f(xk - kAk) = min f(xk- XAkwk) X R 3. If k > 0 set Xk+l = k XkAkWk where ak is any nonzero vector such that wkVk = 0, X is chosen as in step 2 and k k k Sk (Uk,Vk) is defined in theorem 3.4.2. 4. Compute Ak+l by a recursion relation. 5. Set k = k + 1 and return to step 1. For this algorithm a theorem similar to theorem 3.4.2 can be stated. Theorem 3.4.4 Suppose an initial point x0 e R is given. Suppose the matrices A0,A1,...,Ak of algorithm 3.4.3 are chosen such that gkAkxk ~0 when gk / 0 and such that Ak emulates Sk overT where Sk is as defined in theorem 3.4.2. If algorithm 3.4.3 is applied to an objective function f(x): Rn + R1 with a constant positive definite Hessian matrix H, then either gk= for some k < n or Ukvk - 0 k = l,...,n, (3.4.14)

50 A = H1 (3.4.15) n and gn = 0. (3.4.16) The proof of theorem 3.4.4 is identical to the proof of theorem 3.4.2 except that in equation 3.4.12 wk replaces gk. Since in theorem t 3.4.2 it was shown that gkVk = 0 for k = 1,...,n, theorem 3.4.2 is a special case of theorem 3.4.4. Algorithm 3.4.3 is not entirely sensible on a non-quadratic surface for more than n steps since after n steps Vk will span Rn To apply algorithm 3.4.3 past n steps some method must be used to delete old columns from Uk and Vk. One such procedure would be to reset every n steps. In Chapter 4 other deletion procedures are considered in conjunction with a different algorithm. If the linear search in algorithm 3.4.3 is not performed exactly, then the proof of theorem 3.4.4 breaks down. However, if the matrix A = H 1 in algorithm 3.4.3, then the step Xn+l n nn (3.4.17) makes gn+= 0. In the case of one particular recursion, the Murtagh and Sargent formula 3.4.2, under certain conditions A = H even in the n absence of an exact linear search. This recursion will be considered further later in the section. We now turn our attention to the relationship between the specific recursion formulas 3.4.1 through 3.4.5 and theorem 2.4.5. Our main objective will be to show that the matrices Ak generated by these formulas emulate the data sets S of theorem 3.4.2. Recall from Chapter 2 that if S. = (U,V1) and S. = (U.,V.) are two data sets, Si. is the vi',i ]Vj subset of S. whose elements are also contained in S. and Sq. is the 1 ] 11

51 subset of S whose elements are not contained in S.. The ordering of i. Si is assumed to be such that the elements of SPj precede the elements q of Sij in Si. If in theorem 2.4.5 m = n and the particular choice 1 1 q' 2 Q' Z = aiVij + aiUijAj (3.4.18) 2 3 q' 4 q' Z = aiVj +.UA (3.4.19) 1 i ij] 1 4 is made where ai,...,ai are constants then a formula arises which contains the recursions 3.4.1 through 3.4.5. 1 2 The choice of Z and Zi in equations 3.4.18 and 3.4.19 was made after consideration of the existing recursion formulas in an attempt to relate these formulas to the material of Chapter 2. Other choices of 1 2 Zi and Zi in theorem 2.4.5 may lead to still other recursion formulas for emulating sequences. Direct substitution of equations 3.4.18 and 3.4.19 in theorem 2.4.5 yields: Corollary 3.4.5 Suppose Si = (Ui,Vi) and S. = (U,V.) are non] I and S empty data sets and that Aj emulates Sj over T. If S and are not empty and if 1 q'uP 2 q'P a. U. + aU1.Vp. 0, (3.4.20) 1 ij 1] 1 lj i' a1.v.aUV. =0,13 aV.UPi + aU V. = 0 (3.4.21) q 1 q 2 q' q 1 q'q 2' q Vi (aivi.Ui + aijAjUi ) (a.. + aU A.U. (.4.) ij i7j ij ij'i 1 1 1 ] 1I q = Vi., UA,. 3 q,uq. q, uq - v q3V9Uq. 4 q q. A.uq.(a.V.. + a.U.) (a.. + a.A.) (3.4.23) ] 13 1 i3 i3 1], ] i 1 13 13 1 1 3 13 =A.Uj where the g-inverses are arbitrary, then

52 q 1 q q 2 qt q 1 q 2 qt Ai - Aj + V (aiVi Uii + a.iU.A U (aV + aU.Aj) (3.4.24) - (ai ijUij + aiuijAUj ) (avq. aU ijA ) AU {3 q. 4 3v? 4q.? ~A.U jaV- - + a.A U )~ (a V - + a^ ^. emulates Si over T. Further, if Sp. is empty and if conditions 3.4.22 and 3.4.23 are satisfied then Ai given by equation 3.4.24 emulates Si over T. Corollary 3.4.5 allows the sets S. and Sj to differ in an arbitrary manner. In order to arrive at the recursion formulas 3.4.1 through 3.4.5 it is necessary to consider a particular sequence of data sets. Corollary 3.4.6 Consider the sequence of data sets SO = 0, k= (Uk'Vk) = ([uo,... uk-l [vO.. vk-1), k, =...,n. If Ak emulates Sk over T and if 1ak 2k akkk(3.4.25) akvkUk + akukVk = 0, k > 0, avkUk + akukV = 0, k > 0, (3.4.26) kkk kkk Vk(akvkuk + akUkAkUk) (akvkuk + akukAkuk) = Vk, k > 0, (3.4.27) 3? 4k k k 4 k A u (akv'u + auA u) (a3v u + auA ) u AkU k > 0, (3.4.28) k k k k k u k k k k k k k k k k kuk then 1, 2, - I' 2 Ak+ = A + vk(a kvuk + akkAkk) (akvk akk (3.4.29) 3, 41 - 3, 4? - Akuk(akvu auA + au) (av akukAk) k > 1, kk kkk kkkk kk kkk' emulates Sk+l over T. Proof: The proof follows immediately from corollary 3.4.5 by taking S: = S S, - and S. = {(uvk)}. Notice that by

53 definition any matrix A emulates SO and that S 0. Conditions 3.4.27 and 3.4.28 of corollary 3.4.6 are satisfied iff akvu uk+ akukiAkuk O (3.4.30) and 3, 4, akvku + a ukAkuk' 0 (3.4.31) respectively, since the g-inverse of a nonzero scalar is its reciprocal and a g-inverse of the zero scalar is zero. Conditions 3.4.25 and 3.4.26 of corollary 3.4.6 are satisfied if, but not only if, the orthogonality conditions vkUk = (3.4.32) and ukVk = O (3.4.33) are satisfied. Before proceeding it should be pointed out that two of the coefficients in equation 3.4.29 are redundant. If inequalities 3.4.30 and 3.4.31 hold, then equation 3.4.29 can be written as k+l = A k k+ V k (Vkk + bk kkAk) (3.4.34) A ku k(kuk + bkukAkuk) ( + bk k k k kk kkkk k k k k 1 2 1 2 4 3 1 3 where b = a/a and b= ak/ak, provided ak and ak are nonzero. k kk If ak or ak is zero, similar formulas can be written involving ak/ak and ak/ak. Therefore equation 3.4.29 depends only on the ratios 2 1 4 3 ak/ak and ak/ak. We are now in a position to state a result about the family of recursions 3.4.29 when applied in minimization algorithm 3.4.3. Theorem 3.4.7 Suppose an initial point x0 e Rn is given. Suppose the matrix A0 and the vector 0 of algorithm 3.4.3 are chosen so that

54!? g0A0w0 / 0 and the matrices Ak, k > 0, are chosen by the recursion A Ak + v (a vuk +.a2uAk u) (akk + a2kA 3 41+, 3 4? Akuk(akvkuk + akukAkuk) (a + auA).k ^k kkVk + akk k k k a uk) where vk = Xk+l xk and uk = gk+l gk Suppose the coefficients 1 4 a, ak,...,ak are chosen so that g AA k 0, k > 0, and so that 1, 2? k k k k k k k(3.4.36) akVkUk akUkAkUk 3, 4? a3u k + a kukAkuk 0 (3.4.37) for k > 0. Suppose algorithm 3.4.3 is applied to an objective function f(x): R + R with constant positive definite Hessian H. Then either g = 0 for some k < n or A emulates S over T, where SO = 0 and Sk = (Uk,Vk) = ([u0,.o,u k_], v0,..,Vk-l]) k =,...,n. Further, Ukvk Vkuk =, k l,...,n, (3.4.38) An = H (3.4.39) and gn = 0. (3.4.40) Proof: Assume gk / 0 for k < n. A0 emulates SO over T by definition. Therefore, by corollary 3.4.6 and equations 3.4.36 and 3.4.37, A emulates S1 over T and by theorem 3.4.4! I UlV1 = 0 and V1U1 = 0. (3.4.41) Assume then that Ak emulates Sk over T and that Ukvk = 0 and Vkuk = 0. (3.4.42) By corollary 3.4.6, equations 3.4.36, 3.4.37, and 3.4.42, Ak+ emulates

55 Sk+l over T and by theorem 3.4.4!! U v = 0 and V u = 0. (3.4.43) k+l k+l k+l k+l By induction then Ak emulates S over T for k = 0,...,n. Equations k k 3.4.39 and 3.4.40 follow from theorem 3.4.4. Q.E.D. The algorithms of theorem 3.4.7 contain all of the algorithms specified by general algorithm 3.4.1 and the recursion formulas 3.4.1 through 3.4.5. To obtain these formulas the following choices are made 1 4 for the coefficients ak,..,ak. ak 0, a = 0a = a = 0, a 0 Davidon (3.4.44) 1 2 3 4 (345) ak =-ak O0, ak =-ak f 0 Murtagh and Sargent (3.4.45) 1 2 3 4 ak O, ak 0, ak = 0, ak = 0 Pearson 1 (3.4.46) 1 2 3 4 a = 0, ak O, ak = 0, ak ~ 0 Pearson 2 (3.4.47) 1 +2 kk 3 ak=+ kUkUk ak = ku-kkUk, ak kv kUkkAkUk 4 ak =1 - Bkvkuk k k Broyden (3.4.48) The algorithms of theorem 3.4.7 are more general in four respects than these known algorithms. First, general algorithm 3.4.3 allows a far more general choice of a search direction for a given matrix Ak. Second, in algorithm 3.4.3 we recognize that it is only necessary to choose the matrix Ak and the vector ak such that gkAk in order to insure that the algorithm does not become stuck at any stage. Third, the family of recursion formulas given by equation 3.4.29 includes all of the recursion formulas 3.4.1 through 3.4.5 as well as a large class of new recursions. Finally, theorem 3.4.7 does not restrict recursion formulas to be the

56 same at each stage of the algorithm. With the wide latitude afforded by the algorithms of theorem 3.4.7 it should be possible to improve upon the performance of existing algorithms. It may be that the more general recursion formula of corollary 3.4.5 can be utilized in minimization algorithms constructed to generate more than one new element for the data set S at each stage. k Also the formula of corollary 3.4.5 could be useful in problems other than the minimization problem. These aspects are not pursued here in depth. In order to fulfill the hypothesis of theorem 3.4.7 it is necessary to choose the coefficients ak,...,a so that gkAkk f 0 and so that equations 3.4.36 and 3.4.37 are satisfied. In the known I 4 algorithms wk is chosen to be gk. Two choices for ak,...ak have been shown, in this case, to guarantee that gkAkgk Z 0 and that equations 3.4.36 and 3.4.37 are satisfied. The first of these is the choice which yields the Davidon formula, equation 3.4.1. In (FLE 63) it is shown that if A0 is chosen to be positive definite and the Davidon formula is applied to any nonzero sequence of vectors (uk,vk), k = 1,2,.., then A is positive definite for all k. In (POW 71) it is k further shown that if the Davidon algorithm is applied to an arbitrary twice differentiable function and if an exact linear search is employed then vkuk X 0 and ukAkuk ~ 0 unless gk+l = 0. The same results have been shown to hold (BRO 70, 70a) for the choice which yields the Broyden formula given by equation 3.4.5. No such results can be obtained for equations 3.4.2, 3.4.3, and 3.4.4. The Murtagh and Sargent formula given by equation 3.4.2 has a further interesting property. If the Murtagh and Sargent recursion is applied to

57 a sequence of data sets SO = 0, Sk = (kVk) = (Uo-.,Ukl], [vo,...,vk1l]), k = 1,...,n, where u. = Hvi i = 0,...,n-, for some symmetric matrix H, then equations 3.4.25 and 3.4.26 of corollary 3.4.6 are automatically satisfied since a1'U + a2u V a(v'HV-''V ) = 0 (3.4.49) kkk kk k kk kk k k and 3' 4' 3' - - V'f akvkUk + akukVk = a(v HV - v ). (3.4.50) This property leads to the consideration of the following algorithm for minimizing f(x):DCRRn R1 on D which does not involve an exact linear search at each stage. Algorithm 3.4.8 Assume x0 e D and an initial matrix A are given 0 0 and set k = 0. 1. If gk = g(xk) = 0 stop. 2. If k < n set Xk+l = xk + vk where vk is any vector not in R(Vk) and where Vk = [vO,..,vk_i]. 3. If k > n set xk+l = Xk - Akgk and set vk = xk+l - xk. 4. Set Ak+l = Ak + ek(ekuk)lek where ek = vk - Akuk and Uk= gk+l gk' 5. Set k = k+l and go to step 1. As with algorithm 3.4.3 some modification of this procedure may \ be desirable when the objective function is not quadratic. If the objective function f(x) is quadratic and if its Hessian matrix is nonsingular then the following theorem holds. Theorem 3.4.9 Suppose an initial point x0 e Rn and an initial matrix A0 are given. Suppose algorithm 3.4.8 is applied to an objective function with constant nonsingular Hessian H. If ekuk 1 0 for

58 k = 0,...,n-l then either gk = 0 for some k < n+l or Ak = H (3.4.51) and gk+l = 0. (3.4.52) Proof: First notice that algorithm 3.4.8 is determined for k = 0,...,n provided gk $ 0 and ekuk # 0 for k = 0,...,n. Because of equations 3.4.49 and 3.4.50, conditions 3.4.25 and 3.4.26 of corollary 3.4.6 are satisfied. Conditions 3.4.27 and 3.4.28 of corollary 3.4.6 are satisfied iff ekuk X 0. Therefore Ak emulates Sk over T for k = 0,...,n where Sk = (Uk,Vk) = ([u0,...Ukl], [vO,... vk-l]) Because vk % R(Vk), k = l,...,n-l, the vectors (vo,...,vk1) are a basis for Rn and therefore there exists a unique matrix An = H which emulates Sn over T. Finally, since on a quadratic surface g(x) = g(y) - H(x - y) (3.4.53) we have gn+l n + H(Xn - Hgn Xn) = n HH gn = 0. Q.E.D. It is necessary to require that H be nonsingular since if H is singular and the data set S is such that Cv0,...,v 1) spans Rn no matrix A exists which emulates S over T. It does not seem to be n n possible without a priori knowledge of H to choose the vectors vk so that ekuk 0. One would hope that the results of theorems 3.4.7 and 3.4.9 could be extended at least in part when algorithms 3.4.3 and 3.4.8 are applied to an arbitrary twice differentiable function f(x):DCRn + R1.

59 Unfortunately this is not the case. The proofs of theorems 3.4.6 and 3.4.9 are predicated on the existence of a nonsingular symmetric matrix H such that Uk = HVk, k = l,...,n. If f(x) is not quadratic it may happen that no matrix exists which transforms Vk into Uk. More importantly there may exist no matrix which emulates Sk = (Vk,Uk) over T. Even if an Ak exists which emulates Sk over T it may not be symmetric. In the absence of a matrix which emulates Sk over T it would seem desirable to obtain a matrix which approximates Sk over T. As will be shown in section 3.5 it is unlikely that any of the formulas of this section generate a sequence which approximates Sk over T when f(x) is not quadratic. 3.5 Recursive Formulas for Approximating Sequences Recall from theorem 2.2.4 that the family of approximations to a data set Sk = (Ukvk) over T is given by = VkUk + YkI (3.5.1) where Yk is arbitrary. If a sequence of data sets Sk = (Uk3Vk) = ([u0,...,U k ], [vO,...,vk l]), k = 1,2,..., is being considered, theorem 2.5.1 can be used to compute Uk from Uk- and arrive at the family of approximation to Sk over T. Although this procedure is quite efficient it is worthwhile to derive from theorem 2.4.1 an explicit recursive formula for the family of approximations to Sk over T. We begin by developing an expression for the matrix [X,x][A,a]+ Direct application of theorem 2.4.1 yields [X,x][A,a]+ = XA+ + (x- XA+a)(a'(I - AA+)a)-l (3.5.2) a (I - AA+), a ~ R(A), [X,x][A,a]+ = XA+ + (x - XA+a)(l + a A+ A+a)- (3.5.3)

60'+ + a A -A, a G R(A). Setting [X,x] = [A,a], an expression for the operator I - [A,a][A,a]+ is obtained. I - [A,a][A,a = (I- AA) + (I - AA)a (a (I- AA )a)1 (3.5.4) a (I - AA+), a g R(A)'IA+1' -1 I - [A,a][A,a] = (I - AA+) + (I - AA+)a(l + a A A a) (3.5.5) +' + a A A a e R(A). Next, applying theorem 2.4.1 to the matrix [A,a]+ [A,a]+ yields [A,a]' [A,a]+ = (I -ab')'A+ A+(I - ab') + bb' (3.5.6) where b = (I - AA+)a(a(I - AA)a) 1, a % R(A) (3.5.7) b = A+A+a(l + a'A+A+a), a e R(A). (3.5.8) Substituting equations 3.5.7 and 3.5.8 into equation 3.5.6 gives [A,a]'[Aa] = A A - A A a(a'(I - AA+)a)la(I - M ) (3.5.9) -(I - AA+)a(a(I - AA+)a)^ a'A A ~ + -la A + + (I- AA+)a(a (I - M)a) a A A+a + (a'(I - AA)a)a (I - AA+) + (a(I - AA+)a)-l (I- AA+)a (a'(I - AA+)- a'Cl - AA+), a 0 RCA), [A,a]+'[A, a]+ = AA - 2A +'A + aAA a'A'A+ (3.5.10) Aa [trA, = AA - 2A+tA+a(2 + a AA+AAa) (A5.)

61 +1 + A +a + a A+4' -la +' -la 1A+A+ A+ A A'a(l + a A Aa) a A A a(l + a A A a)a A A 1 + aA+, 1-+ ~a + 1 -A+ + + 1 + a A A) A+a(1 + aA a) a A A, a e R(A). Notice that equations 3.5.4, 3.5.5, 3.5.9, and 3.5.10 are coupled in the sense that t.expressions for I - [Aa][Aa] involve A A and the + + lb + expressions for A A involve I - AA Assume that a sequence of data sets S = 0, Sk = (Uk,V) = ([uO.Ukl], [v0...,vk- ]) are given for k = 1,2,.... From: equations 3.5.4, 3.5.5, 3.5.9, and 3.5.10 we may write a recursive + + + formula for the matrices M = (I UkUk) and N =U U for k= 1,2,.. M0 - I (3.5.11) 0 -it r i Mk+l k - Mkuk(UkkM'kuk) UkMk uk ^R(Uk), (3.5.12) = - Mkkl + M kukN, -uk e R(UK) (3.5.13) NO = 0 (3.5.14), -1lk k -, 1 K= N- NkUUk(uMkuk) kk Mk(kk) uN (3.5.15)! + (1 + uNkUk)M k(U;kUk Ukk uk u R(Uk), Nk+l Nk - Nkuk( + UkNkk) Ukk, u R(Uk) (3.5.16) Since the operator Mk is an orthogonal projection operator, uk 6 R(Uk) iff Mkuk = 0 or equivalently iff ukMkMkuk ukM Uk 0. Therefore,

62 it is a straightforward matter to determine whether uk e R(Uk) and consequently which formulas to use. The matrix Nk is of interest here because it allows the calculation of Mk in the event uk e R(Uk). If uk 0 R(Uk) for all k < n then Mk is given by equations 3.5.11 and 3.5.12. Before continuing it is worthwhile to observe that equations 3.5.11 and 3.5.12 define a well known conjugate gradient recursion formula (FLE 69). If the vectors uk, k = 0,...,n-l, are independent then this conjugate gradient formula can be interpreted as the recursive calculation of the operator (I - UkUk). In the conjugate gradient algorithm the search direction is taken to be vk = (I - UkUk)gk (3.5.17) + - Because (I - UkUk) projects vectors onto R (Uk), vk e R (Uk). If the objective function has a constant Hessian H then vk is H orthogonal to the previous directions Vk = [v0,...,vkl] since VkHVk = VkUk = 0. (3.5.18) This, of course, is the fundamental property of the conjugate gradient algorithms. Using the matrices Mk and Nk, the matrix Pk = VkU can be calculated recursively. From equations 3.5.2 and 3.5.3 we have P = 0 (3.5.19)? -1-. Pk+l = k - Pk ) (u kMkuk) UkMk' Uk u R(Uk), (3.5.20) t -1' k+l =Pk + (k - Pkuk)(l + kNkk) UkNk k e R(Uk). (3.5.21) The matrix Pk is the approximation to Sk over T obtained by choosing Yk = 0 in equation 3.5.1. The entire family of approximations to Sk

63 over T is therefore given by Ak = Pk YkMk (3.5.22) where Yk is arbitrary. If the matrix Y = Y for k = 1,2,... then a particular sequence of approximations to Sk over T results. A =Y (3.5.23) (3.5.24) k+l k+l k+l Ak+l Pk + (k - Pkuk)(ukMkuk) ukMk + YMk (3.5.25) 1'1 -- YMkUk (UkMkUk) UkMk A ^WW^^- I =Ak + (Vk - Akuk)(UkMkuk) UkMk uk - R(Uk) Ak+l k + (k - Pkuk)(l + kNkuk) UkNk + YMk (3.5.26) -i YMkuk(l + UkN uk UMk Ak+ (vk Akuk)(1 + UkNkuk) UkNk e R(Uk) Formulas 3.5.20, 3.5.21, 3.5.22, 3.5.25, and 3.5.26 are distinct from the formulas of corollary 3.4.6. Even in the case where the vectors uk, k = 0,...,n-l, are independent these recursion formulas cannot be arrived at by a particular choice of the coefficients akl,...,ak in corollary 3.4.6. Application of the recursions 3.5.23, 3.5.24, and 3.5.25 in minimization algorithm 3.4.3 might yield a minimization algorithm with nice properties. However, the matrix Ak of the recursion approximates

64 Sk over T for all k. In minimization and root finding problems old data becomes invalid as the local character of the function changes. Therefore, it probably is desirable to have a procedure for removing elements from the data sets as the elements become old. In principle it should be possible to obtain recursive formulas for approximations to a data set generated by deleting an element from its predecessor. This would be an interesting area for further investigation. In a process where an approximation to a data set with a large number of elements, relative to n, is to be recursively calculated the formulas 3.5.23, 3.5.25, and 3.5.26 are very useful since they only require the storage of three n x n matrices, two of which are symmetric. Further, if it is known that for all k > k the vectors u0,..,Ukl of the data set Sk span Rn, then uk 6 R(Uk) for all k >, and only equations 3.5.16 and 3.5.26 are needed for the calculation of an approximation to Sk over T. In this case then it is unnecessary to compute the matrix Mk. If the data sets contain n or fewer elements it seems more efficient to apply theorems 2.5.1 and 2.5.2 directly. 3.6 Summary In this chapter the material of Chapter 2 has been utilized to unify a large class of root finding and minimization algorithms. Many other algorithms not discussed specifically herein, such as the Pshenichnii algorithm for minimization (PSH 69) and the Fletcher algorithm for root finding (FLE 68) can also be interpreted in the context of Chapters 2 and 3. In section 3.3 the secant methods for root finding were shown under appropriate conditions to generate a sequence of matrices which emulate a sequence of data sets. It was pointed out that by considering

65 generalized secant methods involving approximations to the sequence of data sets it should be possible to get stronger convergence and rate of convergence results for the secant methods. Also, since the theory of Chapter 2 allows data sets in Rn x R, m / n, it is possible to formulate secant-like algorithms for determining the root of F(x): Rn + Rm or for minimizing the norm of F(x). In section 3.4 the relationship between theorem 2.5.5 and a number of proven minimization algorithms was considered. This led to a large new class of algorithms which, when applied to a quadratic function with positive definite Hessian, generate H conjugate directions, yield the matrix H1 and minimize the function in n steps. Theorem 3.4.5 establishes the properties of these algorithms on a quadratic surface. Also in section 3.4 the special properties of the Murtagh and Sargent algorithm were discussed. It was shown that, because equations 3.4.22 and 3.4.23 of corollary 3.4.4 are automatically satisfied when the Murtagh and Sargent algorithm is applied to a quadratic surface, an exact linear minimization is not required at each stage. 1 2 The general choice of Zi and Zi given in corollary 3.4.3 214 allows the calculation of an emulating sequence, if one exists, when the sequence of data sets is arbitrary. This result may be of value in algorithms where the data sets differ by more than a single element. In section 3.5 a recursive expression was developed for the calculation of an approximating sequence when the data sets are generated by adjoining a single element at each stage. Such a formula could be useful when the number of elements in the data sets becomes large.

CHAPTER 4 FUNCTION MINIMIZATION USING THE PSEUDOINVERSE 4.1 Introduction, Definitions and Preliminary Lemmas In this chapter a general algorithm will be developed for determining the minimum of a function f(x):DC.R -* R1. As in Chapter 3 g(x) and H(x) will denote the first and second derivatives of f(x) where they are defined. Under fairly weak hypotheses it will be shown that the general algorithm generates a sequence {xk) which converges to the minimum of f(x). Under somewhat stronger conditions the terminal rate of convergence will be shown to be superlinear. In developing these results three norms on R will be utilized: n Ilxll = i lx{ (4.1.1) i=i n x 112 = ( (xi)2)1 (4.1.2) i=1 1IIXI = max lxi (4.1.3) i=l,...,n where x is the i th component of x. When an arbitrary norm will suffice the subscript will be omitted. Associated with each of these norms will be the matrix norm on the space T of all n x n matrices defined for A e T by IIA sup IIAxI. (4.1.4) xeRn fIxj Matrix norms will be subscripted to correspond to the vector norm which generates them. We will make use of two results concerning norms. First, the matrix norm associated with the vector norm Ilx[I c is given by 66

67 n II A II = max Jlak I. (4.1.5) k=l,...,n jl(4.15) Second, for any two vector or matrix norms IL. I a nd I. b there exist constants C > 0 and C2 > 0 such that C1 I|xI - a 2 1.6) for all x e R or C1 IA II lAb < C2 A (4.1.7) 2 for all A e Rn. A proof of these results may be found in (ORT 70, section 2.2). The following standard definitions will be needed in the development. n Definition 4.1.1 The open sphere of radius r > 0 centered at x e R, denoted by S(x,r), is the set of all x e Rn such that Ix - x 2 < r. Definition 4.1.2 The interior points of a set A CRn, denoted by A are all points of A contained in some open sphere contained in A. Definition 4.1.3 A set A C Rn is closed if Limit xk = x and k o o {xk} C A implies that x 0 A. Definition 4.1.4 A set AcRn is bounded if "ACS(0,r) for some.4rt> 0. Definition 4.1.5 A set AC R is compact if it is closed and bounded. Definition 4.1.6 A function F(x):DCRn Rm is continuous at x e D if given s > O 3 a 6 > 0 a F(x) e S(F(x),E) whenever x e S(x,6). Definition 4.1.7 A function F(x):DCR + R is differentiable at x e D if given > 0 3 a linear transformation F'(x): Rn R and a 6 >0 IF(x)- F(x)- F )(x-x)ll< Ellx x| (U.i. - whenever x e S(x,d). The linear transformation F'(x) is the derivative of F(x) at x.

68 Definition 4.1.8 The derivative of F(x):cR' Rm at x e D is strong if given ~ > 0 a 6 > 0 9 IF(X)- F(y) -F'()(x - y)l < C ix y y (4.1.9) for all x,y e S(x,6)CD. Definition 4.1.9 A function F(x):DcRn + Rm is continuously differentiable at x I D if 3 a 6 > 0 9 F(x) is differentiable at all points ^ ^ n xm i of S(x,6)cD and if F'(x): S(x,6) - R is continuous at x. To require a strong derivative at a point is less restrictive than to require a continuous derivative at a point since functions exist which have a strong derivative at a point x but which are not differentiable at all points of any open sphere containing x. However, in many practical situations the concepts are equivalent as the following lemma shows. Lemma 4.1.10 Suppose F(x):DcRn + Rm is differentiable at each point of S(x,6)cD for some 6 > 0. Then F'(x) is strong iff F'(x) is continuous at x. A proof of this result may be found in (ORT 70, lemma 3.2.10). In proving the rate of convergence results it is essential that the derivative of f(x) have certain properties in a neighborhood of a minimum. The following lemma establishes conditions for these properties. Lemma 4.1.11 Suppose f(x):DCR +R is twice differentiable at x e D and that H(x) is positive definite and strong. Then there ^ exist constants C1 > 0 and C2 > 0 and an open sphere S(x,6) such that 1 ix < gx)- g(ll < C - (4.1.10) U x,y C SCx,6)cD. Proof: By the definition of a strong derivative, given any

69 > 0 3 a 6 > 0 9 I[gCx) -gCy} - H(x( -)Cx y)lll < lx - yll C4.1.11) e x,y 6 S(x,6)CD. This implies that tlH(x)(x - y)) - Ilx -y l <lg(x) -g(y) < (4.1.12) -H(x)(x- y)l+ + x - t x,y 6 S(x,6). Further, because H(x) is positive definite (C1 - E) lix- yl g(x) - g(y)li (02 + C) Ix - yl (4.1.13) where C > 0 and C2 > 0 If is chosen less than C then the choices C =1- and C = C2 + yield the result of the lemma. Q.E.D. 4.2 Fundamental Definitions and Lemmas The results of this section form the foundation'for the construction of the general minimization algorithm. In what follows the matrices U and V can be viewed as representing a data set S as discussed in chapter 2. We begin with a lemma. n 1 Lemma 4.2.1 Suppose f(x):DCR + R is twice differentiable at x 6 D and that H(x) is positive definite and strong. Let U be any n x m matrix, m > 0. For a G (0,o) sufficiently small 4 a 6 > 03 if x,y 6 S(x,6) and if v= x- y R (U) then l(I - UU+)ul > luiu (4.2.1) where u = g(x) - g(y). Proof: If v = 0 the lemma is clearly satisfied. As.tumez then t-ht v / 0. First we have IlCI- UU+)ul = Ilvl IC( - UU)uIlv lv-: C4.2.2) I C1v'(I - UU )u 1lv 11l

70 where ttke constant C1 > 0 depends on the norm. ||. Since v 6 R (U) it follows that IIC- UU )ui > Cv u Iv II 1 (4.2.3) Letting e = u - H(x)v IICI - Uuull > C vH(xv Vvll -1 + C1ve lv I 1 (4.2.4) Because H(x) is positive definite V(I -UU+)uJI C2 ilvjl + Clv'e Jv i (4.2.5) for some C2 > 0. By lemma 4.1.11 3 a C3 > 0 dependent on H(x) and 2 3 a 61 > 09 1 Ilvll C3 Ilull (4.2.6) g x,y 6 SCx,61). Because g(x) has a strong derivative at x, for any ~> 0 3 a 2 > 09 Hell = lu -H(x)vll < E lv (4.2.7) U x,y 6 S(x,62). Further, by lemma 4.1.11 3a 63 > 0 and a C4 > 0 9 lvll < c4 t uI (4.2.8) tl x,y 6 S(x,3). It follows that for all x,y e S(x,4), 6= min. {^1 62 63} 11(i- UU+)ull C2C3 Hu -CI v'el Iv1 l1 (4.2.9) > cCull -Cll ivi lie II lvii> c23 Iull - C1 le -1 (CC- C1C4 C) U 1 If s is chosen less than C2C3(C1C4) the choice 6 = 6 and 0 a< (C2C3 C1C4 e) yields the result of the lemma. Q.E.D. Remark 4.2.2 A review of the proof of lemma 4.2.11 will verify that the ~ " ~ ~ ~' ")

71 maximum value of a depends, only on the norm chosen and on the matrix 11(x). Remark 4.2.3 When the norm of lemma 4.2.1 is||.I 2' 11( -UU),U | U2 u2 for any u and therefore only a.6 (o,1] need be considered. This follows because the operator - UU+ is the projection operator whose range is R -U). Next a family of data sets will be defined which has properties useful in proving the rate of convergence results. The general minimization algorithm will be constructed to choose only data sets S from this family of data sets. Definition 4.2.4 Suppose f:DCRn + R is differentiable on D. Let S(x,6a), x 6 D, 6 > 0, a 6 (0,1], be the family of data sets S5 (U,V) with the following properties. 1. Each pair (U,V) consists of a matrix U with n rows and m columns and a matrix V with n rows and m columns where m may have any value in {l...,n}. 2. The columns v., i = l,...,m, of V are nonzero and of the form 1 x. - yi where x.,yi 6 S(x,6) D. 3. The columns u., i = 1,...,m, of U are nonzero and of the form g(xi) - g(Yi) where x. - y. is the column v. of V. 4. Th columns u, i = 2,... m, of U are such that 11( - Ui_ U )Ui 2 a Ui 2 (42.10) where U. is the n x(i-l) matrix whose columns are the first i-l columns of U with the same ordering as the columns of U. As remark'.2.3 indicates, it is enough to consider a 6 (0,1] since if a > 1 condition 4.2.10 is never satisfied. We next verify that, under appropriate conditions on fCx), the family S(x,S,a) is

72 not empty. Lemma 4.2.5 Suppose f(xl:DcRn R is differentiable on D, twice differentiable at x 6 D and that H(x) is positive definite and strong. Then for a 6 CO,1] sufficiently small and for any 6 > 0, the family SCx,,S1i is not empty and in fact contains data sets with m elements for m = 1,..,n. Proof: By lemma 4.1.11 3 a 6 > 0 9 u = g(x) -g(x) g 0 A ^ whenever v:= x - x ~ 0 and x e S(x,l6). Therefore the set S(x,6,a) is non-empty for any 6 > 0 since any pair of one column matrices (u,v) chosen so that x 6 S(x,6) n S(x,61) is an element of S(x,6,a). If (Ui,Vi) is a pair in S(x,6,a) with i < n columns and if a is sufficiently small,a pair (Ui+,Vi+l) 6 S(x,,a) with i+l columns can be constructed as follows. Let V.i = [Vi,v.] and 1+1 = C1 A ~ A Ui+l= [U.,ui] where v. = x. -x e R (Ui), x x, x. S(x,6) u = g(x) - g(x) 0, and 11(I - UUU)uiI 2 I 2 Tha such a choice can be made follows from lemmas 4.1.11 and 4.2.1. Since this construction can be accomplished for i = l,...,n-l, the family S(x,6,a) contains data sets with m elements, m = l,...,n. Q.E.D. The following lemma establishes a bound on a matrix generated from elements of S(x,6,a) which will be needed in the rate of convergence proof. Lemma 4.2.6 Assume f(x):D Rn + R is twice differentiable at x 6 D and that H(x) is positive definite and strong. Given any a 6 (0,1] for every E > 0 3 a 6 = 6(e) > 0 S e data sets (U,V) 6 S(x,6,a) I(U - HCx)V)U C(a) (4.2.11) where the constant C depends only on a.

73 Proof: Because of the definition of SC(x,,a if CU,V) 6 SC x,,a) the columns of U are linearly independent. Therefore, by theorem 2.5.1 U. = i lu]i U (lI - u(zzi) (i C4.2.12) 3-C' ) - 1 I i-I'I i= i (Z ) Zi. where u. is the i th column of U, Ui (U 1) is the matrix whose columns are the first i (i-l) columns of U with the same ordering as the columns of U and z. = (I - U.U)ui (4.2.13) Using the explicit representation given in equation 4.1.5 for the matrix norm II UI we have that if (U,V) 6 S(x,,a) and ui is the i th column of U k-l!1-Uizzi 1 t II III - u(ziziY Z -1 k j l(z 1)1 (4.2.14) 1II-.i'~ =zi~z Z 00 u izi 1 j=1 i, k 1 +< max u u Cz.z) liii ~ ^~1 11 n., jl + zli { uiz.~zlzi>2 k for some component u. of u.depends on Thus i ad i. i oo | i max 11u 2zlz iz) + 2I n IliU l I-zill Z + 1 i''' --' 2~~~~~00

74 thberefore i - uizz. zM z C2 a 1 (4.2.16) for some constant C2. Further, fbrm equation 4.1.7 II - u.(Z. z )z < C (C + 1) (4.2.17) 1 1 ii i -- 3 2' where C3 depends on the matrix norm chosen. Next, let u. denote the 1 i th row of the matrix U+ and notice that repeated applications of theorem 2.5.1 yield m ji1 = 1T zi(I - u(z z.) z.)(z.z.) i (4.2.18) j=i+l j Jlugl (zzm) zll (4.2.19) where m is the number of columns of U. From equation 4.2.17 it can be concluded that uI <C4 (C3(C2 a ( + 1))m1 Z (4.2.20) < C(C3(C2 ao + l)) l C a 1 -4 3 2 ~Z ~1)) u.1, where the constants C, and C depend on the choice of norm and where C3 is chosen greater than 1. It follows by lemma 4.1.11 that for some 61 > 0 all columns u. and v. of any pair (U,V) 6 S(x,61,a) are such that llvi < C6 uI and therefore u. Cm-1 - -1'U < C4(C3(C2 a1 + l)M C5C 0 v. ~ (4.2.21) 1 — s 1 Because H(x) is strong, given C < 0 3 a 62 = 62(c) > 0 3 all columns u. and v. of any pair (U,V) 6 S(x,62,c) are such that 1 1 lu. - H(x)vi. < e lvil. (4.2.22) Finally, since

75 ll(U -,H(x)V)U = Z (u. - H(x)v.)u. (4.2.23) -'' i~ll. _-v i ~i"'11' we have that for any e > 0 3 a 6(0) = min {61,62(E)} ICU - HCx)V)U+ < iE llv ( C(C3C 2 l +C1)) (4.2.24) i=l C C -1 C a1-1 < emc4(C3(C2 a + 1)a CC a Q.E.D. The next lemma shows that if 6 is sufficiently small and if the vector w is nearly in R(U), then the vector p* - VU w nearly equals H (x)w when (U,V) 6 S(x,,a). Lemma 4.2.7 Assume f:DCR + R is twice differentiable at x 6 D and that H(x) is positive definite and strong. Let a 6 CQ,1] and > 0 be given and consider the vectors p (I - UU +) (4.2.25) p* - VU + (4.2.26) h - p* - H Cx)W (4.2.27) where (U,V) e S(x,6,a) and w 6 Rn. For every 6 > 0 ] a 6G) > Q 3 j (U,V) 6 S(x,6,a) and any G 6 Rn if IPI <1 310)1 (4.2.28) then lIhj < U + C E) jlHjHCx)l 1 Eg (4.2.29) where the constant C depends only on a. Proof: If w= 0 the lemma is clearly satisfied, so assume U) 0. We have

76 lhil = f(VU+- -1H(x)),w (4.2.30) = IH-(x)(H(x)VU+ -I )l < 1 H 1(X)11 (H(X)V-l ) )1 I llH- (x)l I(H(x)V - U)Ut- (I -UU+)w11 < UH- (x)l ( IIH(Hx)V- U)U+11 I11 + I(U- U+)Wl ). By lemma 4.2.6, for every -e > o0 a 6 > 0 9tj pairs (u,v) e S(x,6,a) Q.E.D.. Remark 4.2.8 If (U,V) e S(x,6,a) and U and V have n columns then U is nonsingular and p = 0 for every w e Rn. In this case B = 0 satisfies equation 4.2.21 for any w e R and equation 4.2.31 reduces to lihl < C e IH-1(X)II j wj. (4.2.32) Also, if the norm of equation 4.2.28 is U 2 then by remark 4.2.3 8 may be chosen in [0,1] without loss of generality. Finally we show that for 6 sufficiently small if the vector w is "nearly in" R(U) then the angle between pa of the previous lemma and w is bounded less than 90~ when (U,V) is any pair in S(x,6,a). Lemma 4.2.9 Assume f:DCRn -+ R is twice-,differentiable at x e D and that H(x) is positive definite and strong. Let a e (0,1] be given and consider the vectors p (I - UU+)w (4.2.33) and p* - VU+W (4.2.34)

77 where (V,U) e S(x,6,a) and eG R, O. For > 0 sufficiently small a 6 > 0 9 if PFU< 81elHI (4.2.35) then ~,p* 1lwlRp*"> B. (4.2.36) Proof: Notice first that with h = pi - H l(x)W w l'H-(X UP,-l(XG - w * H11,X)II II11 -l H =(x)w p 1HpwH (x)w = - Hx). I1~11 ||H- 1(X)~1 ll l~ ll IIP* ||iU IIH-Z(:)Jll 1 u 1 II H-i(x) |11 ||H (x)a || lIw'h +1+w'p*ll- I[H'(x)wll lp*lf-l < ilhll 1 IIP - lIH-1(X.?IIIl ~IHl(x)w"( ] I lHl(x)l )g Utilizing the result of lemma 4.2.7 and the fact that x - yl< c implies that I Ix - aYII < c, we have that given aoe (0,1] for every $ > O a 6(c) > 3 J (U,V) e S(,,) IIP-II l < l 11|, o, e R implies w H l(X IP* <.2(- + C ) H- (:)t! 11 Ow|l IH-1(X^lI 11 1llp, 11 11 l X)ai < C2( + C1 e) (4.2.38) where ~ > O, C0 > 0 depends only on a and C2 > 0 depends only on HCx). Because HCx) is positive definite 3 a constant C3 > a 9 8U ) o 0

78 > C (4.2.39) II1 a II Cx)l iTherefore, given a C (0,1] for every t > 0:a 6 > 0 9 (U,V) e S(x,6,a) and any non-zero 0 e Rn, - I11 <3 II ll implies ~ P* > C - C 2C + C s). (4.2.40) 11 11 II * 3 2 Choosing B less than C3(1 + C2)- any 6 corresponding to an e less than (C3 - (1 + C2))(CC) yields > c - c2( + C(C - (1 + C2)6)(C )) Util.,p I 1 3 2 3 2 2 1 II lp" > C3 - C2~ - 03 + (1 + C2)B (4.2.41) > 3. Notice that B must be less than C3(1 + C ) in order for there to 2 exist a positive e less than (C3 - (1+ C2) )(C2C1)-1 Q.E.D. Remark 4.2.10 If the norm is 1 I 2 then the inequality IIP 112 < 1 n2 (4.2.42) is equivalent to the inequality _t n< 8. (4.2.43) This may be seen as follows. If IIP12 < W 11 2 then IIPII2 <1 11W112 IIP 112. (4.2.44)

79 It follows that Clo (I- UUU I ( uut W <. (4.2.45) Ilw ii 2 lIP 12 However w CI - UU+) (I - UU+ ) = a CI - UU+)w (4.2.46) tp which leads to the conclusion that X p _ < A. (4.2.47) 11 112 P 112 The argument is reversible. 4.3 General Minimization Algorithm We now propose a general algorithm for the minimization of a function f(x):DCRn + R. The algorithm is divided into three stages. In Stage I a direction is chosen for a linear search based on a data set Sk and the gradient g(xk) and an approximate linear minimization is performed starting at xk. In Stage II a tentative new data set Sk0 is chosen which lies in an appropriate family S(x,5,a). In Stage III, if it is necessary, the data set Sk is augmented in stages kO Sj j = 1,2,..., to form the data set S+l by choosing a sequence of auxiliary steps which insure that for k sufficiently large the data set S contains n elements. k+l The general algorithm is constructed to allow as much latitude as possible in the choice of a sequence of data sets S. There is also, under certain circumstances, latitude in the choice of a direction for the linear search. Convergence and rate of convergence results will be presented for the general algorithm. In Chapter 5 tiese results are supplemented by numerical experiments with algorithms constructed within

80 the framework of the general minimization algorithm. Before proceeding a definition is needed. Definition 4.3.1 A function o(t): [0,) + [o0,) ~s a forcing function if Limit (t ) = 0 implies that Limit t = 0 k+. k k + K k. to We now state the general minimization algorithm. Algorithm 4.3.2 Suppose the function fCx):DCRn + R is to be minimized on D. Let x 6 D be given and set k = 0. Let K be any finite subset of Ip, the positive integers, and let K2 by any subset of Ip Let a e (0,11, 3 6 (0,1] and Z 6 I be given. Also, let the forcing functions A' ), a 2.) and a C.) be given. Stage I 1. If gk gx) =0 stop. 2. If Uk and V are not defined, go to step 5 of Stage I. 3. If Pk C(I - UkUk) gk 0 and if k satisfies k gk(f Ikll 2 llgklI 22- >' set pk = Pk and go to step 7, Stage I. 4. If p VkUk and if p* satisfies Pk k gk k a k P* gk( IIPkl 2 ligk 1 2 > 8, set k= p* and go to step 6, 1 k P 2 9k 2k P Stage I. 5. Set pk equal to any nonzero vector such that pgk ( IIPk112 ljgk 2I 2) > B and go to step 7, Stage I. 6. If: f(x) - f - k) > (llgkU2), set xki = - p A = Xk+l - Xk Agk = gk+l gk and go to Stage II. 7. Set Xk+l = Xk - XkPk, where k is chosen by a steplength algorithm such that f(xk+i) - f(x) > a2(pkgk >Pk 2 ) Axk = Xk+1 - X gk = gk+l and go to Stage II. Stage lI 1. Set j = 0 and let Vkj,Ukj be any two n x m, m 6 {l,...,n}, kj'3kj

81 matrices with the following properties. a. The columns of Vkj are a non-empty subset of the vectors Ax- and Cwhen definedl x-x, k -R <_k <-k, j' >. -The columns of Ukj are nonzero and are the corresponding vectors Agk and Cor) Ag-k with the same ordering as the columns of Vi.. CThe specific ordering of the columns of kj Vkj and Ukj is of no consequence. However, it is kj kj important that the correspondence between the columns of Ukj and Vkj be preserved.) kj kj b. The i th column u. of Uk is such that kj (I - Ukj (U k ) )Uij > a i = 2,m (4.3.1) ki kj i 2 i-i where U is the matrix consisting of'the first i-l columns of Ukj with the same ordering as the columns of kj U. 2. If k K1, either set Uk+ = Ukj, Vk+ = Vkj k = k+l and return to Stage I or leave Uk+ and Vk+1 undefined, set k = k+l and return to Stage I, step 1. If k 0 K1 go.to Stage III. Stage III 1. If U and Vk are undefined, or if Ukj has n columns, or if Ukj has more columns than Uk, or if Uk has as many columns as Uk and k e, set Uk+ = U Vk+ = Vkj k k+l and return to Stage I, step 1. 2. Set p1 equal to any nonzero vector in R (Ukj) and set Xkj = - XkjPkj where?kj > 0 is chosen so that gkj = k - j. Agk; i9<?. k

82 3. If Agj Q and IICI iUkJ kll 2' set UkCj+1) = kjgkj VkCj+ll = [Vkjk, j = j+l and go to step 1, Stage III. 4. Set Uk+ = Ukj, Vk+ = Vkj, k = k+l and return to Stage I, step 1. Remark 4.3.3 Stage III of algorithm 4.3.2 can be formulated in a slightly different manner to make better use of the auxiliary points Xk.. Step 1 of Stage III remains unchanged. Steps 2, 3 and 4 of Stage III are replaced by the following procedure. 2. If Pk CI - Ukjkjgk and kjgk( Pkjll 2 gkl 2) > B set Xk = xk - kjPk where Xkj is chosen by a stepkj xk 7-' length algorithm such that fxk) ff(Cxk) > a2(Pkjgk Igk -1). Set Axkj = xkj - Xk, Agk gkj - gk' Xk Xkj and go to step 4. 3. Set Xkj =k kjPkj where Pk is any vector in R (Uj) kj k kj Pkj w kj kj and Xkj > 0 is such that aCll j - Xkl ) <l g Set kj = kj - xk and gkj = gj gk 4 If Agkj 0O and IICI k Uk k)Aglkj 2 > aAgkj 2,set Uk(j+l) =Ukj Agkj]' Vk(j+l) = [VkjAxk l j = j+l and go to (j+l) ^kj kj k(+l) = kj-kj^' step 1, Stage III. 5. Set Uk+ = U Vk+ = Vkj, k = k+l and return to Stage I, step 1. Notice that because I - UkjVj is a projection operator, if < -).-1 kjgk( Pkj U then wgk1( 112 g 1 2 < t for any nonzero vector w in R (Ukj). Therefore, if the test of step 2 fails, it is futile to search for other vectors in R (Uk ) bounded away from being orthogonal to g by 3. Vectors in R (Ukj) are chosen because, as will be seen, such a choice results eventually in the'satisfaction of the condition of step 4 of remark 4.3.3. The convergence and rate of

83 convergence results of the next section are given only for algorithm 4.3.2. The modifications to the proofs of these results when the procedure of remark 4.3.3 is employed are straightforward. 4.4 Convergence and Rate of Convergence Before presenting the convergence and rate of convergence results several preliminary definitions and lemmas are needed. Definition 4.4.1 Let f(x):DCRn R1 be any function. A level set of f, denoted by L(c) is the set of all x 6 D such tat f(x) < c. Definition 4.4.2 Let f(x):DCRn R be any function. The path connected component of the level set LCc) containing x 6.D and denoted by LX (c) is the set of all x G L(c) such that there exists a continuous function p(t): 0,11 + LCc) with p(O) = x and p(l) = x. Definition 4.4.3 Let f(x):DcRn R be any function differentiable on D and let {k} be any sequence contained in D. A sequence of non-zero vectors {pk}CRn is gradient related to:{xk})- if there exists a forcing function ao() such that Ig(xk) pki IPk 1 > a( llg II) for all k. Definition 4.4.4 Let f(x):DCRn - R be any function differentiable on D. A critical point of f(x) is any point x e D such that g(x) = 0. Definition 4.4.5 Let f(x):DCRn R be continuously differentiable on D and assume that for D0C D = sup {Ilg(x) - g(y): x,y 6 D} >0. (4.4.1) The mapping 6(t): [0,o) + [0,o), defined by inf Ix - yi: x,y 6 D, jig(x) - g(y)jj > t, t 6 [O,0),.6(t) =.~ " (sl t ECa,-), (4.4:. )

84 is the reverse modulus of continuity of gCx) on Do. The following two lemmas are necessary to prove that, under appropriate conditions, the sequence {Xk} geneaited by algorithm 43.2.is determined. Lemma 4.4.6 Assume that f(x):DCRn + R has a continuous derivative on D, that D0CD is compact and that a of equation 4.4.1 is positive. Then 6Ct), the reverse modulus of continuity of g(x) on D0, is positive for all t > 0 and 6(t) is a forcing function. *'.' A proof of this result may be found in (ORT 70, lemma 14.2.6). Lesaa 4.4.7 Assume that f(x):DCR + R has a continuous derivative n on D, that for x 6 D, L(f(x)) is compact and that for p 6 R, g(x)p > 0. There exists a solution for X of the. equation A = min > 0: p g(x- Xp) u=p'g(x)} (4.4.3) with (x - Xp) e L (f(x)) and v 6 [0,1). Further f(x) - f(x - Xp) > ca(g (x)p Ip 1) (4.4.4) where a(t) = pt6([l - I]t), t > 0, and where 6(t) is the reverse modulus of continuity of g(x) on LX(f(x)). Proof: A proof of this result is contained in the proof of lemma 14.2.7 of (ORT 70). In order to apply lemma 14.2.7 and its proof, it is necessary to note that any connected component of a compact set is compact. It will now be shown that if c2(.) is properly chosen, algorithm 4.3.2 generates a well-defined sequence {Xk}. Theorem 4.4.8 Suppose fx):DCR + R is continuously differentiable on D and that for xQ 6 D, L(f(x0)) is compact. Suppose the forcing function a2Ct) of algorithm 4.3.2 is such that

85 a2Ct) < pt 60,(J - it) C4.4.5) for all t > 0 where (St) is the reverse modulus of continuity of gCx} on L(fC(x)) and I is a fixed constant in C0,11. If algorithm 4.3.2 is applied to f(x) starting at xa then the sequence {xk} is contained in L(fC(x)) and is determined for all k > 0 unless for some k > 0, gk =. Proof: Notice first that Stage II of algorithm 4.3.2 can always be completed since a pair Axk, Ag. always exists for k - 9 < k k and since a one column matrix always satisfies condition b of Stage II, step 1. Also, Stage III of algorithm 4.3.2 can always be completed since a return to Stage I is generated if Ukj has n columns or if step 2, kj Stage III fails to generate a column with which to form Uk(j+l) and k(j+l) Vk(jl) It remains to be shown that Stage I can always be completed. Since for k = 0, xk 6 L(f(x0)) is given and gk 0, unless the theorem is trivially satisfied, assume that at stage k, xk L(f(xo)) and gk 0. In order to specify xk+l it is necessary that pk 0 and A be determined. Either p will be set to pk 0 or pk will be set to p Z 0 or step 5 of Stage I will be executed. If step 5 of Stage I is executed then the choice p = gk f 0 is always satisfactory so that P can always be determined. Now k will either be chosen equal to one in step 6 of Stage I or step 7 of Stage I will be executed. By lemma 4.4.7, since f(x) is continuously differentiable on D, there always exists a A such that f(xk)- f(xk Aikk) 2 gkpk ipk (i4.4.6) so that if step 7 of Stage I is executed Ak is determined and therefore xk+l is determined. Further, if either step 6 or step 7 of

86 Stage I is executed, X+l LCfCxQI. At stage k+l then either gk = 0, in which case the sequence {Xk} C LCf (x ) is determined up to stage k or gk+l #, in which case an inductive cycle is complete. Q.E.D. Before proceeding with the convergence results one further lemma is required. Lemma 4.4.9 Suppose f(x):DCR + R is continuously differentiable on D, that DO D is compact and that {xk}CD0 is any sequence such that Limit g(xk) = 0. Then the set Q of all critical points of f(x) k + oo in Do is not empty and Limit {inf k - xl}= 0. (4.4.7) k -oo x e 2 Proof: Because D is compact, {x} has a convergent subsequence {x.} and if Limit {x.} = x then because g(x) is continuous on D g(x) = 0 so that x e Q and 2 is not empty. Next assume that Limit {infllxk -xl } 0. (4.4.8) x 6 Q Again, because Do is compact there must exist a subsequence {x.} of 0 1 {xk} such that Limit {inf. - x } = 6 > 0. (4.4.9) x e But because for any convergent subsequence {x.} Limit x. = x and J 7 x e Q, Limit {inf flx. - xlI } < Limit lxj - xl = 0 (4.4.10) k to j o+m which is a contradiction. Q.E.D. We next demonstrate that under appropriate hypotheses the sequence {Xk} generated by algorithm 4.3.2 converges in the sense of equlation

87 4.4.7 to the set of critical points of f(x). Theorem 4.4.10 Suppose f(x):DCRn + R1 is continuously differentiable on D and that for x e D, L(f(xo)) D is compact. Then the 0 set Q(xo) of critical points of f(x) in L(f(xo)) is not empty and if algorithm 4.3.2 is determined for all k > 0 then Limit {inf Hxk- xi } 0. (.4.11) k' X x e n(xO) Proof: First, because f(x) is continuous on D and because L(f(xo)) is compact, f(x) is minimized on L(f(xo)) at some x e L(f(xo)). Since g(x) is continuous on D it must be that g(x) = 0 and therefore Q(xO) is not empty. Because of steps 6 and 7 of Stage I, {x} C L(f(xo)). In order to establish the result of the theorem it is necessary to show that Limit gk = 0 and to apply lemma 4.4.9. Notice that the sequence k ooX [Pk} generated by the algorithm is gradient related to the sequence {xk} since, because of steps 3, 4 and 5 of Stage I, IgtPkl Pk lsgkPk|Pk|l p 1k > llgk|ll= a(lllgk) (4.4.12) for all k, where o() = St. Consider the subsequence {x.} of {xk} consisting of all elements of {xk} generated by step 7 of Stage I of the algorithm. Because f(xk+l) f(x(k) for all k and because of step 7 it must be that f(x.) - f(x. j) > a2(pj lPj 11> ) (4.4.13) The function f(x) is continuous and bounded below on L(f(Xo)) and therefore Limit f(xj) - f(j+1) =. (4.4.14) It follows from the definition of a forcing function that Limit pg- IIPj I - = 0. Since {pi} is gradient related to {x}, Lii 1j'

88 Limit o(gj ) = 0. This in turn implies that ^ ->- 00 " j + Limit Igj|I = 0. (4.4.15) j- oo Consider next the subsequence {xj} of {xk} consisting of all elements of {xk} generated by step 6 l Stage I. The subsequence {xi} contains all elements of {xk} not contained in {xj} since for all k either step 6 or step 7 of Stage I is executed. Because f(xk+l) < f(xk) for all k and-because of step 6 f(x.) - f(x ) > a( Ilgillj ) (4.4.16) Again, because f(x) is continuous and bounded below on L(f(xo)) Limit f(xi) - f(xi+l) 0 (4.4.17) i 00 and therefore Limit lgill =0. (4.4.18) i -, oo Since {x.} and {xj} contains all elements of {xk} Limit JIgkI = 0 k o X Application of lemma 4.4.9 completes the proof. Q.E.D. Theorem 4.4.10 demonstrates that the sequence {xk} generated by algorithm 4.3.2 gets arbitrarily close to the set Q(x0) of critical points of f(x) in L(f(xo)). What has not been shown is that Limit II x+ - Xkl =. (4.4.19) k oo It may in fact be possible for the sequence {xk} to "hop around" in the neighborhood of different elements of Q(Xo). It is evident that if Q(xg) contains only one element x then

89 Limit xk = x (4.4.20) k k+oo k - X and Limit IXk+l k = 0. (4.4.21) k -* oo If Q(x ) contains only a finite number of elements, then the following result holds. Theorem 4.4.11 Suppose f(x):D CR R is continuously differentiable on D and that for x0 e D, L(f(xo))CD is compact. Suppose the set Q(x ) of critical points of f(x) in L(f(x0)) has a finite number of elements. If algorithm 4.3.2 is applied to f(x) starting at x0, if the sequence {xk} is determined for all k > 0 and if {xi} and {x.} are any two convergent subsequences ofthe sequences {xk} with limit points x and x respectively, then f(x) = f(x). Proof: By theorem 4.4.10 Limit {inf IXk xlI }= 0. (4.4.22) k -* o x e Q(x0) Assume f(x) < f(x). Because the subsequence {x.} is convergent,.gikve 6 > 0 there exists a jl such that for all j > jl,x - x < 6. Because f is continuous on D_ given ie'> O there exists a 6 > 0 such that Ilf(x) - f(x) I < e whenever II x- xU < 6. Therefore, given e > 0 there exists a j, such that for all j > j, Ilf(x) - f(x)l < i. Because the sequence {f(xi)} is non-increasing f(x)i > f(x) for all i. If C is chosen less than f(x) - f(x), then for all. j-> jl and for all i, f(x.) > f(x.). Because-both sequences {x.} and: {x.} are not finite it must be that the sequence J

90 {f(xk)} is not non-increasing which is a contradiction. Q.E.D. Finally, it will be demonstrated that under appropriate conditions the sequence {xk} of algorithm 4.3.2 for large enough k is generated by -the recursion k+l - VkUk (4.4.23) where the matrices Vk and Uk have n columns and are elements of S(x,6,a) for arbitrarily small 6 and that the rate of convergence of the sequence {xk} is superlinear. Theorem 4.4.12 Suppose f(x):DCR + R1 is continuously differentiable on D and that for x0 e D, L(f(x0))CD is compact. Suppose algorithm 4.3.2 is applied to f(x) starting at x0 and that Limit xk = x (4.4.24) k + where x e Q(Xo), the set of critical points of f in L(f(xo)). Suppose f(x) is twice differentiable at x and that H(x) is positive definite and strong. If the constants a and 3 of algorithm 4.3.2 are chosen sufficiently small, if the forcing function al( ) is such that al(t) < C min w H(x)) II 2 t2 w e n IwI 2 Ik(~) II t (4.4.25) m e Re "l' t2 for all t sufficiently small and for some C e (0,1) and if the set K3 = - K1UUKI - contains n elements each sufficiently large, then there exists a k* such that for all k > k* xK ~l Xk P (4.4.26) XK+1 = xk- p Further Limit IIXk+1 - I k ~...= 0. (4.4.27) lI~k -x

91 Proof: Since Limit xk =x for any 61 > a ka k > k k -* oo Xk e S(x,61). Since Limit gk = 0, for any 62 > 0 3 a k2 9 k k2 k +- and ti j ~ such that xkj is defined, xj e S(xk,). Choosing 61 and 62 such that 1/2 min {162} = 63 and k3 = max {k,k2} for any 63 > 0 3 a k3 9 U k > k3 and j > 0, xkj,xk e S(x,63). Because of the age bound s in step 1 of Stage II, fort y 63 > 0 3a k3: k >k + and j > 0, (UkVk) and (UkjVkj) are in S(x,63,) where a is the constant of the algorithm. By lemma 4.2.1 and lemma 4.1.11 choosing 6 above sufficiently small we have that for all k > k3 + 9 and a sufficiently small Agkj 1 0, IlCI( - UkjUktj) 6gkj I2 Agkj (4.4.28) 11 k kj 2 > Alkj 2 in Stage III, step 3. It must be then that with a chosen sufficiently small j k > k3 + 9 step 4 of Stage III is never executed. Therefore, 3 for k > k3 + 9 the only returns to Stage I are generated in Stage II, 3 step 2 or in Stage III, step 1. Since K1 is finite it has a maximal element k and for k > k5 = max {k3 + k,k4} the only returns to Stage I are generated in Stage III, step 1. Because Uk and Vk can only be left undefined by a return to Stage I from Stage II, step 2, for all k > k5, Uk and Vk are defined. Next, by lemma 4.2.9, remark 4.2.10 and lemma 4.1.11 choosing 63 above sufficiently small we have that in Stage I for 3 > 0 sufficiently small gkPk( llgk 2 1211Pk112 > - (4.4.29) and gkPkC lgkIl 2 IlPk I 2) >( B cannot occur together for k > k5 in Stage I. Therefore, for k > k5 Pk= Pk or Pk = Pk'

92 For k > k Stage III, step 1 generates a return to Stage I with U+1 having n columns or with. U+1 having more columns than Uk or with Uk+l having as many columns as Uk but with k e K. For k > k the number of columns of Uk and Vk cannot decrease and further since K was assumed to have at least n elements, each sufficiently large, there exists a k6 > k5 such that for all k > k6, U and Vk have exactly n columns. By the definition of SCx,,a), Uk has full rank for all k and therefore for k > k6 Uk is nonsingular, (I - UkUk) = = 0 and Pk = Pk Because f is twice differentiable at x and because Limit xk = x..~~~~~~.. ~k o~ for every E >0 a k7 3 k > k f(x) = f(x) + 2 k - x)'H(x)(x - x) + rk (44.31) 1k kil_ Ilk 1= A verification of this result appears in (ORT 70, lemma 3.3.12). Further, because H(x) is strong and because Limit xk = x for every k oo 6 > 0 3 a k8 9 U k > k8 g(xk) = H(x)(xk - x) + ek, Ilekl i. - II (4.4.32) By remark 4.2.8 and because Pk = 0 for k > k6 it follows that for every e > 0 4 a k k > kg hk 2 k H (xgk-2 C1 E 11H-1x)U 2 gk 1 2 (4.4 33) By lemma 4.1.11 3 a k10 9 k > k1 NexIXtlwelhav eh(4.4.34) Next, <we have that- x Next, we have that

93 Xk- Pk = X - H;gk - hk (4.35) xk - H- Cx(H(x)Cx x) + e:) - hk = x HCx)ek - h. Therefore, because f is twice differentiable at x 4 a kl > k 9 k >kll f(xk -p) = fCx) + CH-lCx) ek + hkH(x) (4.4.36) (H (x) ek + hk) + r(xk - pk) where llr(xk- pk)II < II H 1(x) ek hkI.2 It follows that for every e > 0 3 a k2 =max {k7,k8,kgk10,kll} 9 a k > k f(xk) -f(xk Pk) (Xk - x)'H(x)(xk ) (4.4.37) - (H-Cx) e + h)H(x)(H (x) e + hk) + r -r xkpk) where llrkil <1 lXk1 - xl1 (4.4.38) llekll < llxk - xl (4.4.39) Ilhl l C1C2 IHK11 (l 2 IIXk - (44 40) r(x, - PkII c IIX - X11 Because HCx) i strong,for any e > Q 4 a 6 > O 9' x B S(x,6) lg(x - g(x) - HC(Cx - x)ll <e ix -x I (4.4.42) or

94 IgCxjl < II1rCX( xl t- j ii - x C4.4.43) < "C lHc I + 1x- x It follows that for every ~ > 0 J a k13 9 b k >' 3 (x - x)'H(xCx k - x) (CXk- X ) H(Cxk- x) xk - xl (.4.4.4) I x1, - 112 k * min _H(_)_ CI()l ||+ w)-2 |1 gk 2 w6 Rn 2oll 9 Finally, if ~ is taken sufficiently small then there exists a k = max {k12k13 } such that for all k > k f() - f( - Pk) > C min H(x) 2 (k) 0 11112 7 KH(X) II (4fl42 2 klH(;)u UX(;)I 2 RB~112 aIRIIll y > l C llgkl ) for C 6 (0,1). Therefore, for all k > k k Pk (4.4.46) k= Pk and l= k- (4.4.47) k+l = Xk - Pk' The final result of the theorem follows immediately since for every e > Q0 a k B k > k llxku ll- kllrKl ek h = II- (x) h + h!1 < llH-() II lekll+llehk 4 ~C1k- xll. Q.E.D.

95 Remark 4.4.13 The coefficients a and 5 in theorem 4.4.13 are restricted to be less than some positive constants a and 5, respectively. Specifically a must be chosen so that inequality 4.4.28 is eventually satisfied and a must be chosen so that the inequalities 4.4.29 and 4.4.30 eventually are not satisfied together. These conditions arise from lemmas 4.2.1, 4.2.7, and 4.1.11 and remark 4.2.8. Referring to these lemmas we see that a must be chosen less than C C of lemma 2 3 4.2.1 and that S must be chosen less than C(1 + C) of lemma 4.2.7. Upon checking the origin of these constants it can be established that an adequate choice for a is ^ va mil. H)v (4.4.49) v eR Ilvjll 2H(Xc)v11 and that an adequate choice for 3 is = m. u'H (x)u IH Ixl 211 u 2 ~B =min -.. u e R ln lu 112 H- (X)u u e Rn H-1()UI 2. (4.4.50)..... 4.5 Conclusion -. (4.4.50) In theorems 4.4.8 and 4.4..12 it was necessary to restrict the admissible forcing functions ClCt) and 2 Ct) and the admissible constants a and a in a way which depends upon the objective function f(x). Specifically in theorem 4.4.8, to insure that the sequence {xk} is determined, a2Ct) is bounded by a forcing function which k 2 depends on the reverse modults of continuity of g(x) on L(f(x0)). In theorem 4.4.12, to insure that the terminal rate of convergence to x is superlinear, the constants a and 3 are chosen less than constants related to H(x) and for small t, OlCt) is chosen less than a forcing function related to H(x).

96 In practice it is unlikely that specific information will be available about the reverse modulus of continuity of g(x) on L(f(x0)) or about H(x). However, the choice of the forcing function a (t) does not.cause difficulties since, as is seen -in (ORT 70, section 14.2), steplength algorithms have been developed which insure that condition 4.4.5 is satisfied provided only that f(x) is continuously differentiable on L(f(xO)). The constants a and B depend upon the conditioning of the matrix H(x) and can be initially chosen small enough to cause superlinear convergence on functions with reasonably well conditioned minima. If, during the execution of the algorithm, it is determined that the a and 3 conditions are not being satisfied because the Hessian is not well conditioned at the minimum these constants can be reduced. A similar procedure can be used with the forcing function a (t). Initially a Ct) can be chosen as yt2 for some positive constant y which is small enough to insure superlinear convergence provided the Hessian at the solution is reasonably well conditioned. The constant y can then be revised if necessary. Suppose in algorithm 4.3.2, Stage III is eliminated and a return to Stage I is always generated from Stage II. If it happens that for all k greater than some k matrices Uk and Vk can be chosen in Stage II having n columns and satisfying the age bound of Stage II, step la and the angle condition 4.3.1 then the results of theorems 4.4.8 through 4.4.12 will still hold. Further, if there exists a subsequence of the iteration sequence where matrices Uk+1 and Vk+ can be chosen with n columns that satisfy the age bound and the angle condition then convergence on this subsequence will be superlinear. A

97 review of the proofs of theorems 4.4.8 through 4.4.12 should convince the reader that this is the case. The specific algorithms discussed in Chapter 5 do not implement Stage III of algorithm 4.3.2 as it was felt that for k sufficiently large it will nearly always be possible to choose matrices U and V with n columns while satisfying the conditions k+l k-l l of Stage II. In developing algorithm 4.3.2 the principal reasons for choosing the data sets Sk = (UkVk) from the family S(x,6,a) were: 1. To insure the existence of a matrix which emulates Sk over T for all k. 2. To insure that the norm of Uk remains appropriately bounded so that for small, H- (x)UkUk VkUk becomes small and so that pk VkUkgk nearly equals H (x)gk when g is that (x)g nearly in R(Uk). These properties of S(x,6,c) are fundamental and should be of value in developing other minimization procedures. It may be possible to. obtain strong rate of convergence results even if the data sets are not restricted to some family S(x,6a'). For instance if it happens that Agk is nearly in R(Uk), one might ignore the component of Agk not in R(Uk) and consider the data set (Uk+l,Vk+l) ([Uk,Agk],[V,Axk]) where Agk is the component of Agk contained in R(Uk). If this is done it may happen that no emulation exists for (U,V ). However, the norm of U should remain appropriately k-l" ktl k+l b+ -1 bounded and for small 6, p E VkUkgk should approximate H (x)gk. It is conjectured that if H(x) is required to be Lipshitz continuous in a neighborhood of x then it can be proved for algorithm

98 4.3.2 that Limit IXk+ - x1 k i oo. +. < C < oo IIXk "x where Z is the age bound of the algorithm. The key to proving this result lies in obtaining a stronger bound on j hk B 2 in equation 4.4.33. To do this requires that lemma 4.2.5 be modified so as to obtain the appropriate bound in lemma 4.2.6.

CHAPTER 5 NUMERICAL EXPERIMENTS 5.1 Introduction Minimization algorithm 4.3.2 is constructed so that convergence is insured and so that the terminal rate of convergence is superlinear for a large class of objective functions. At the same time there is considerable latitude in the choice of a specific algorithm within the general framework of algorithm 4.3.2. This latitude is intended so as to allow an algorithm to be developed which performs well away from the minimum and which retains the desirable convergence and terminal rate of convergence properties. Because it is very difficult to obtain theoretical results about performance of an algorithm away from the minimum some experimental work is usually necessary to evaluate the overall pperfori mance of a specific algorithm. To verify that it is feasible to construct an effective algorithm within the framework of algorithm 4.3.2, two algorithms were programmed in FORTRAN. These -algorithms were applied to a variety of test functions and the results were compared to the results obtained using the version of Davidon's algorithm available in the I.B.M. 360 Scientific Subroutine Package. The results of these numerical experiments are presented in this chapter. We begin by describing the version of Davidon's algorithm which was used as the standard of comparison. 5.2 Davidon's Algorithm The Scientific Subroutine Package version of Davidon's algorithm was chosen because it is widely circulated and accepted as an efficient program for minimizing functions. Certain drawbacks in the algorithm, as a standard of comparison, have been accepted in order to obtain a 99

100 program which is readily available to other investigators. The algorithm uses the Davidon recursion formula to generate a sequence of matrices {Hk} beginning with the matrix H = I It differs from the algorithm proposed by Fletcher and Powell (FLE 64) in two important respects. First, when certain rather pathological conditions occur the matrix Hk is reset to the identity matrix. Second, the linear search subalgorithm does not perform an exact linear minimization. Except for these variations the Scientific Subroutine Package Davidon is the Fletcher-Powell-Davidon algorithm which is specified by algorithm 3.4.1 and by the Davidon recursion formula 3.4.1 with H = I. So that there is no confusion as to how the program operates we will specify the conditions under which a reset can occur and the linear search subalgorithm. The following conditions at stage k cause Hk+l to be taken as the identity. 1. The inner product of the linear search direction - Hkgk with the gradient gk is positive or zero. 2. The ratio Hkgkll Ugk I- is not greater than C where E is a user supplied constant. 3. The function value does not decrease by more than C in one iteration where c is the same user supplied constant as in step 2. 4. The argument of a square root operation involved in the cubic interpolation of the linear search is negative. This occurs if the cubic interpolation polynomial (see 5.2.1 below) has no minimum on the interpolation interval. Because of the procedure for choosing the interpolation interval this should not happen. t t 5. The product (UkVk)(UkHkUk) = 0. If this occurs then the Davidon recursion formula involves division by zero.

101 In practice the reset was encountered very infrequently. If a reset occurred during a Davidon run the iteration at which it occurred will be noted in the data. The linear search subalgorithm of the Davidon algorithm performs an approximate linear minimization. Assuming that at stage k an initial point xk and a search direction pk are given, the linear search algorithm is as follows. Algorithm 5.2.1 1. If 0 < 2 (f(xk) - f*)(gkHkgk) < 1 where f* is a.user suppliedestimate of the minimum function value, choose as.an, initials step Xk+ XPk = Xk + 2(fk - f)(gHkgk)-lp k Otherwise choose X = 1. 2. Compute f(xk + pk) and g(xk + XPk). If Pkg(k + k) > or if f(xk + Xpk) > f(xk), set x = xk, y = xk + Xp and continuee at step 3. If pkg(Xk + Apk) = 0 set Xk+l = xk + XPk and exit from the linear search. If pkg(xk +Xpk) < 0 set X = 2X and repeat step 2. 3. Interpolate cubically on the line segment (x,y) using the directional derivatives in the direction pk d(x) E pg(x) uPk 11 and d(y) E pg(y) lIPk -1 and compute the x which minimizes the interpolation polynomial on (x,y). If f(x) < f(x) and f(x) < f(y), set k+ = x and exit from the linear search. 4. If d(x) > 0 or if both d(x) < -0 and f(x) > f(x), set y x and repeat step 3 unless d(y) = d(x) and f(y) = f(x). In this last case set xk+l = x and exit from the linear search. 5. If d(x) < 0 and f(x) < f(x), set x = x and repeat step 3

102 unless d(x) = d(x) and f(x) = f(x) In this last case set xk+l = x and exit from the linear search. This particular linear search algorithm has certain disadvantages. First, in step 1 of the algorithm the initial step size is determined from a user supplied estimate of the minimum function value. A good estimate of the minimum function value may not be available. Second, step 3 of the algorithm determines the minimum of a cubic polynomial which agrees with f and its directional derivative at the points x and y. A quadratic interpolation using three function evaluations or the directional derivative at xk and two function evaluations would greatly reduce the total number of gradient evaluations and still provide a good estimate of the minimum. Third, in step'3 d return from the linear search is generated whenever the function value at the interpolation point is no greater than the function value at either end point. This linear search procedure, therefore, cannot guarantee that the decrease in function value at each stage is bounded greater than an appropriate forcing function. In fact, the deletion procedure in steps 4 and 5 is such that it is possible for the function value to increase as Figure 5.1 demonstrates. x ~x x Y x.X x FIGURE 5.1 ABNORMAL CONDITIONS IN LINEAR SEARCH SUBALGORITHM

103 The curves of Figure 5.1 represent the function along the direction of the linear search. In all three cases shown the point Xkl will be chosen to be x since f(y) = f(x) and d(y) = d(x). Although these conditions are exceptional, they can occur. Of these disadvantages the one of most concern is that the linear search subalgorithm does not guarantee a "sufficient decrease" in the function value at each stage. A complete listing of the Scientific Subroutine Package Davidon program is given in Appendix II. Some statements have been added to display pertinent data at each stage of the algorithm. These, however, do not alter the operation of the algorithm in any way. Several tests for abnormal conditions in the algorithm cause program execution to terminate. None of these conditions were encountered in practice. 5.3 The New Algorithms The new minimization algorithms which were programmed differ in three respects from the structure proposed in general algorithm 4.3.2. First, the linear search procedure 5.2.1 was used in both new algorithms without modification. The primary objective of the numerical study was to compare the major iteration algorithms rather than the linear search subalgorithm. For such a comparison to be valid the linear search procedure should be the same in -al;:agoerithms. -SinDv-tienDavtdon algorithm was chosen as the standard its linear search procedure was used. As was noted earlier the bounds imposed by forcing functions 1(t) and (1) in algorithm 4.3.2 will not necessarily be satisfied when linear search algorithm 5.2.1 is employed. Second, algorithm 5.2.1 does not choose a unit step size'(X. = 1) when this step size in the direction pk of algorithm 4.3.2 yields a sufficient function decrease. In practice these deviations f-rom the stUGcture. offalgorithm i4~3'32idmi'notttcaie".

104 difficulty. Algorithm 5.2.1 performed adequately in all cases tested except when the Hessian matrix was very badly conditioned. Further, it was noted that for large k the step size approached unity when the search direction was Pk. The third respect in which the new algorithms differ from algorithm 4.3.2 involves stage III of algorithm 4.3.2. As was noted in section 4.5 it may happen that for all k greater than some k* a pair of matrices (Uk,Vk) having n columns can be formed in stage II of algorithm 4.3.2 which satisfy both the age bound and the angle condition of stage II. If this occurs then stage III of the algorithm is unnecessary. It was conjectured that this would be the usual rather than the exceptional situation. Consequently, stage III of algorithm 4.3.2 was not implemented in either of the new algorithms. The experimental results verified that this was indeed the case. It was further conjectured that the age bound on the columns of Uk and Vk could be implicitly K. k satisfied by a proper choice of the constants a and R. Therefore, initially no explict age bound was placed on the columns of Uk and Vk. This procedure will be seen to work well except in the case where the Hessian matrix is very badly conditioned. Except for the deviations: just discussed.the two new.algorithms are contained within the framework of general algorithm 4.3.2. Assuming an initial point x e R is given and that k is initialized to zero, the algorithms can be stated as follows. Algorithm 5.3.1 1. If gk = g(xk) = 0 stop. 2. If Uk and Vk are not defined, set pk = gk and go to step 6.

105 3. If (I - UUk)gk 0 and Pkgk 12 gk12 set pk = pk and go to step 6. 4 If k VU k k 0 and pgk( Ilkll 2 ki. 2 s = and go to step 6. 4 *' U * and > 9 set Pk = Pk. "f P kPk gk( ugkn2 5.If Vk an V Undfne and go to step 6. 5. If Uk and Vk have one column, make Uk and Vk undefined and return to step 2. Otherwise delete the first (oldest) column of Uk and Vk and return to step 2. 6. Set xk+ = Xk - Xkpk where Xk is chosen by the steplength algorithm 5.2.1. 7. Set Vk = x x- Xk = k+l - jk and i =1. If Il(I - U.kU)uklI 2 > a IlUkl 2' set Uk+l k[Uk,'Y, V k+l Vk Vk k = k+l and return to step 1. 8. If )ukl 2 Ua u 2, where U and V are the k k12 k k matrices Uk and Vk with the i th column deleted, set k k Uk+1 U [ Uk1, Vk+l = [Vl,vk], k = k+l and return to step 1. k+l k+i ". k ^ ^ 9. If i < m, where m is the number of columns in Uk, set i = i+l and return to step 8. 10. If i m, set U. If i =k+l = Uk Vk+l = Vk, k k+l and return to step 1. Algorithm 5.3.2 Algorithm 5.3.2. is identical to algorithm 5.3.1 except that in step 5 pk is always chosen equal to gk and the algorithm continues at step 6. In steps 7 through 10 the vectors uk and vk are adjoined to the matrices Uk and Vk to form Uk+ and Vk+l provided uk is bounded away from R(Uk). Failing this uk and vk are added to Uk and Vk to form U and V if deleting the i th column of Uk adVk ca s k+l k+l and Vk causes u, to be bounded away from R(Uk). An attempt is made

106 to delete the oldest data first. If this alsc fails, Uk+l and Vk+l are taken as Uk and Vk. Algorithm 5.3.1 and 5.3.2 contain two unspecified constants a and B. Remark 4.4.13 indicates that the choice of these constants depends on the conditioning of the Hessian matrix at the minimum. In particular, to insure that ultimately pk either equals pk or Pk, B should be chosen less than min u H (x)u max IIH (x) 112 11IU 2 1 + * u Rn lul H(x)u u n xu IH-H(x 112 (5.3.1) and to insure that ultimately when pk = pk the condition of step 7 will be satisfied, a should be chosen less than min v H(x)v.............. (5.3.2) v e Rn lvl2 jH(X)v2 (5.3.2) Since these conditions cannot be checked beforehand, the reasonable thing to do is to choose a and a small enough so that conditions 5.3.1 and 5.3.2 are satisfied unless the objective function is very poorly conditioned at the minimum. If the constants a and B are chosen too small, numerical difficulties may arise either because the search direction Pk is nearly othogonal to the gradient or because the norm of Uk becomes large. On some functions it may be necessary to adjust the constants a and 3 during the runs. In the numerical experiments particular attention was paid to the effect of a and B on convergence. An intuitive justification can be given for the choice of search directions in steps 3, 4, and 5. In step 3, if Pk. the projection of gk onto R4(Uk), is bounded away from being orthogonal to the gradient then Pk is chosen as a search direction. On a quadratic surface with,k

107 Hessian H the direction pk is H conjugate to the directions comprising the matrix Vk since PkUk = PkHVk = 0. (53.3) Step 4 is entered if gk is essentially in R(Uk); i.e., UkUgk In this case pk = VkUgk should be a good "Newton-like" step, for if the function is quadratic Xk+l = Xk - VkU k -1 + - Xk - H UkUkgk = Xk - Hlgk. If both steps 3 and 4 fail, step 5 chooses some other downhill direction as in this case the function is not "locally quadratic". In algorithm 5.3.1 the oldest data, the first columns of Uk and Vk, are deleted and steps 3 and 4 are repeated in the expectation that without the oldest data either a pk or a pk step will satisfy the direction requirement. At most, step 5 of algorithm 5.3.1 reduces the matrices Uk and Vk until a gradient step is taken. In algorithm 5.3.2, step 5 simply forces a gradient step. Computation of the matrix U+ from U and u k+l k and uk is performed by straightforward application of theorems 2.4.1 and 2.4.2. Notice that the columns of Uk are linearly independent for all k. Therefore, the pertinent formulas in theorems 2.4.1 and 2.4.2 are always the formulas for linearly independent vectors. Appendix III contains a listing of the program for algorithm 5.3.1. A simple modification to this program gives algorithm 5.3.2. The modification is not shown.

108 5,4 RoSenbrOck's BananaValle The algorithms were tested first on the two dimensional Rosenbrock function which is given by f(x) = 100(x2 - x2 + (1- x22 (5.4.1) where x and x2 are the components of x. The starting points used were C-1,-1} and (1,-1). Figure 5.2 is a plot of Log10of(x)j versus iteration number for Davidon's algorithm. The single reset which occurred on the third iteration of the run starting at (1,-1) did not significantly affect the performance of the algorithm. -20 Starting at (-1,-1) Davidon reduced the function value to about 102 in 16 iterations and 65 function (gradient) evaluations. Note that because of the linear search procedure a gradient evaluation is performed with each function evaluation. From (1,-1) the Davidon algorithm ~20 reduced the function value to about 1020 in 17 iterations and 50 function evaluations. Algorithm 5.3.2 was run on Rosenbrock's function from the same starting points with a = 10-4 and = 10-4. Figure 5.3 is a plot of LoglOlf(x){ versus iteration number for these runs. From both starting points the age of the oldest column of U and V reached two on the k k second iteration and remained at two for the entire run. By the age of the oldest column we mean the current iteration number minus the iteration number at which the first column was added to U or Vk This indicates that at each iteration after the second, algorithm 5.3.2 was able to form Uk+ by deleting the oldest column of U and appending the vector uk while still satisfying the conditions of step 8 of algorithm 5.3.2. The procedure for generating the matrices Uk and Vk in algorithm 5.3.2

109 Log0 I f(x) FIGURE 5.2 Davidon's algorithm applied to Rosenbrock's function starting from a) (-1,1) l\k *~ ~b) (1,-1) 0~ ^ t t= *^^* denotes a reset -5 a -10 -15 5 10 15 20 Iteration Number

110 Log10 f(x) FIGURE 5.3 Algorithm 5.3.2 applied to Rosenbrock's function with a - B - 10- and starting from ~~~~~~\ ~a) (-1,-1) 0^^-^, ~~b) (1,-) -5. -~l ~\l -10~1 5 10 15 20 25 30 Iteration Number

ll is such.that the rank of U cannot decrease. For these two runs Uk k k became nonsingular on the second iteration and remained nonsingular. Starting at (-1,-1) algorithm 5.3.2 reduced the function value to about 10-2 in 22 iterations and 75 function evaluations. From (1,-1) algorithm 5.3a2-reduced the function value to about 1-20 in 24 iterations and 80 function evaluations. In terms of both function evaluations and total number of iterations, algorithm 5.3.2 —wasislightly inferior;to Davidon on the Rosenbrock function from these two starting points. However, the superlinear terminal rate of convergence of algorithm 5.3.2 is verified. In applying algorithm 5.3.1 to Rosenbrock's function, runs were made with several values of a and S. Again the starting points (-1,-1) and (1,-1) were used. Figures 5.4;and 5.5 represent the Log10If(x)j versus iteration number. Figures 5.6 and 5.7 depict the age of the oldest column of Uk and Vk as a function of iteration number for the same runs. From either starting point with a = B = 10, the age of the oldest column was two after the second iteration. Also with ala 3.= 10-, after the second iteration Uk was nonsingular and the p step was k'k satisfactory at each stage. On Rosenbrock's function then the sequence of matrices VkUk generated by algorithms 5.3.1 and 5.3.2 with a = = 10- is the same as the sequence which would be generated by the sequential secant recursion formula if it were applied to the function minimization problem. In the particular case of Rosenbrock's function no difficulty was encountered. with this formula and the algorithm converged superlinearly for reasonably small a and B. On more difficult functions,>diffici lties will-arise because of the problems

112 Log10 f(x)J FIGURE 5.4 Algorithm 5.3.1 applied to Rosenbrock's function starting from (-1,-1) with a) a = B= 104 b) a= B = 10-3 c) a = B = 102 \. ~"^S ^^ ^d) a = B = 10-5 -10 *-15 IedaI 5 10 15 20 25 30 Iteration Number

113 Log10 I f(x) FIGURE 5.5 Algorithm 5.3.1 applied to Rosenbrock's function starting from (1,-1) with \ a) a =10-4 b) a= = 10 0 "-2 ~ t ~ " * - ^ ~ ^ c ) a = 3 = 1 0 2 d) a = B = 10-5 -10 -15n Iteration Number

114 Agae of then ot l column.f 3< and Vk* FIGURE 5.6 Algorithm 5.3.1 applied " * NOTE: Each increment on to Rosenbrock's function vertical axis denotes one starting from (-1,-1) with iteration of age. On each run the age of the oldest a) a = = 10column at iteration zero -3' is zero. b) a = B = 10 c) a e B= 102 t~' /~~d d) a = c 10 b 5 10 15 20 25 30 Iteration Number

5L FIGURE 5.7 c of oldest Algoitbm 5.3.1 appliAed columnnof.U an ~f oRsnbrOCR, S funct R stadioReg fwom (i,-) with -14 * See note on a) 3 Figure 5.6 a) b) on l -2 c) r =B =10 10 1.ter~atiofl NumflIer

116 discussed in Chapter 3 with the sequential secant algorithm. One might expect to find a correlation between the coefficients a and B and the performance of algorithms 5.3.1 and 5.3.2. Figures 5.4 and 5.5 do not demonstrate such a correlation. Starting at (-1,-1) the choices a = 3 = 10 and a 10- yield convergence in a comparable number of iterations. Figures 5.6 and 5.7 indicate better correlation between performance and the age of the oldest column of Uk Vk. In all cases the performance of algorithm 5.3.1 was inferior to that of Davidon on the Rosenbrock function using the starting points (-1,-1) and (1,-1). However when a and X are chosen small, the difference in performance is not significant. In all cases the superlinear rate of convergence of algorithms 5.3.1 and 5.3.2 is verified. 5.5 Wood's Function The second test function investigated was Wood's function of four variables, which is given by 2 2 2 22 f(x) = 00(x2 - x) + (1 - x ) + 90(x4- x) (5.5.1) 2 22 + (1- x3) + 10.l(x -1) + 10.l(x -1) 2 2 + 19.9(x2 - 1) (x4- 1) The first runs made were with algorithm 5.3.2 starting at (-3,-1,-3,-1) and (-3,0,-3,-1). It was observed that, from both starting points with various values of a and 1, algorithm 5.3.2 had a strong tendency to take repeated gradient steps. In all runs the matrix U became nonsingular k on the 4th or 5th iteration. Following that, the direction Pk was uphill which caused the gradient direction to be chosen as the search direction. As there is no procedure for reducing the rank of U in

117, * algorithm 5.3.2, all steps past step 5 were tither 4;k steps or gradient steps. Unfortunately, choosing a gradient step and updating the matrices Uk and Vk did not tend to make the gk direction downhill. As a result the algorithm degenerated to a gradient search procedure. Although the theory of Chapter 4 predicts that (once in a small neighborhood of the minimum) the direction p will be downhill, k convergence to such a neighborhood is extremely slow in algorithm 5.3.2 if the algorithm initially performs a large number of gradient direction searches. Because of this deficiency, which was quite clearly brought out on Wood's function, algorithm 5.3.2,was not investigated further. Instead, attention was restricted to algorithm 5.3.1. In applying algorithm 5.3.1 to Wood's function a further effort was made to determine the relationship between the performance of the algorithm and the coefficients a and B. Starting from the point (-3,0,-3,-1) two runs were made using algorithm 5.3.1. Coefficients a and 3 were chosen small in one run and relatively large in the other. In Figure 5.8, Logl0 f(x)j is plotted versus iteration number for these two runs and for the Davidon run. Notice that in this case the larger choice of a and B caused algorithm 5.3.1-to converge —po6oly. From the starting point (-3,-1,-3,-1) a large number of runs were made with values of a and B varying between 10-4 and 101. Figures 5.9, 5.J0, and 5.11 display Log10jf(x)l versus iteration number for a = 10-1 -2 -4 10, and 10 respectively with several values of $ in each case. The corresponding Davidon run is shown in Figure 5.11. In Figures 5.12, 5.13, and 5.14 the rank of U and the age of the oldest column of Uk and kk Vk are plotted as a function of iteration number for the same family of runs. The number orQ iterations required for convergence generally seemed

118 Log 10 f( x)1: i FIGURE 5. 8 a) Davidon's algorithm applied to Wood's function starting from (-3,0,-3,-1). b) Algorithm 5.3.1 applied to Wood's function starting from (-3,0,-3,-1) with a = 10-4 and B = 10-3 c) Same as b) except a = 10 and B =.4xlO 0 -5 a -10 10 20 30 40 50 Iteration Number

119 FIGURE 5.9 Algorithm 5.3.1 applied to Wood'S fun otio n starting at (3 wt fun = 10-1 and with ct a) B = 1 b) B X l c ) B = ~.4 x 10-2 k~ c ~ ti) N urb \ d? $ 5 10 0 -51I-5tv 5 Iteration Nuxnber

120 Logo0 f(x) l -5 -10 FIGURE 5.10 Algorithm 5.3.1 applied to Wood's function starting froQm (-3,-l,-,-1) with. oa= 102 and a) = l10 b) B 102 -2 c) B =.4 x 102 -15 d) B 10 10 20 30 40 50 Iteration Number

121 FIGURE 5.11 Algorithm 5.3.1 applied to Wood's function starting fsmi (-1l;-3,-E,-3) w-th - 1,074= lO andl>. a) =.4 x 10-41, = 10-2 b) B =.4 x 1o-l c) B = 10o-1 d) Davidon's algorithm applied to Wood's function starting from " (-1,-3, —L,- 3). C -5 b -10 d -15 1Q0 20 30 0 - 50 Iteration Number

122 FIGURE 5.12 Age of the oldest column Algorithm 5.3.1 applied to Wood's of U and VJ ir rank! of Uk function starting at (-1,-3,-1,-3) with a k n-1 and * NOTE. Continuous functions w are age and discontinuous a) @ = 10-4 funetions are rank. Each 10-3 vertical increment indicates one iteration of age or one c) B 4 x 10-2 rank change. Age and rank d) B = 10-1 are zero at iteration zero. d ____ _ C II __bb b I b m _________ a 10 20 30 40 50 Iteration Number

123 FIGURE 5.13 Age of the oldest column of UAg ad he andentcof-...of. Algorithm 5.3.1 applied to Wood's Uk adVanrk ofk function starting at (-1,-3,-1,-3) with a = 10-2 and ~ See note on Figure 5.12 b) = 10-2 c) =.4 x 10-2 d) = 10-1 c a i _ - 10 20 30 40 50 Iteration Number ~m -_ - -, mmm - _ _:',..~~~~~~~~ Iterat ion Number

124 Ag'^i Q~e ~fth-~''ol st_:'um ~;fFIGURE 5.14 Age:of th,oi]oAiest column Qf Uk and Vk and -rank of Uk Algorithm 5.3]. applied to Wood's function starting at (-1,-3,-1,-3) * See note on with = 10-4 and Figure 5.12 a) $ =.4 x 10'.:^~~~ b) $ = 10-'^~~~~ c) B =.4 x 10d) 3 = 10d 10 20 30 40 50 Iteration Number a _ _ —-A_ - -— " -,'O _ m. _ a. _ 10 20 30 40 50 Iteration Number

125 to decrease with decreasing a and B. There are, however, some notable -1 -2 exceptions. The run in Figure 5.9 with a = 10 and =.4 x 102 for instance, converged in the fewest number of iterations yet a = 10 was the largest value of a tested. Similarly, the relationship between the age of the oldest column of Uk and Vk and performance of the algorithm is not entirely consistent. Although the average age of the oldest column does decrease with decreasing a in some cases where good performance was noted very old data was retained. Also, in some runs where the algorithm performed poorly there was no'accumulation of old data. Because the constant B determines when the rank of Uk is to be decreased, one would expect that thee-rank of Uk wold. tend to'increase on the average as 3 was chosen smaller. This does seem to be the case, although the relationship is not very pronounced. In many applications the amount of computation required to evaluate the function and its gradient is large when compared with the computations which occur in the algorithm. If this is the case then a more valid measure of performance of the algorithm is the number of function (gradient) evaluations r/lt to r?eht Wirn W. >In Figures 5.15 and 5.16 Log10if(x)l is plotted versus the number of function (gradient) evaluations starting from (-3,0,-3,-1) and (-3,-1,-3,-1), respectively for algorithmr-53. -andf-or.Davidon..Allhthe curves correspond to runs presented earlier in Figures 5.9, 5.10, and 5.11. In all cases except one, algorithm5 3 1 perfbimsbbetterthan Davidon using this measure of performance. In particular, notice that -4 -3 -4 from both starting points for small a and B (a = 10, = 10, ) algorithm 5.3.1 requires fewer function C(gradient )me^a iadn asB. Intheee cases it appears as though algorithm 5.3.1 provides a slightly better

126 Log10jf(x) FIGURE 5.15 a) Davidon applied to Wood's function starting at (-3,0,-3,-1). b) Algorithm 5.3.1 applied to Wood's function starting at (-3,0,-3,-1) with a = 10-, g = 10-3. c) Algorithm 5.3.1 applied to Wood's function starting at (-3,0,-3,-1) with a l= 10-2, B = 10-2 — 5 -10 C 15 50 100.150 200 Function (Gradient) Evaluations

127 Logl I f(x)l FIGURE 5.16 a) Davidon's algorithm applied to Wood's function starting at (-3,-1,-3,-l). b) Algorithm 5.3.1 applied to Wood's function starting at (-3,-1,-3,-1) with a = 10-, =.4 x 10-4 o~~~~0 -+~c) Algorithm 5.3.1 applied to Wood's function starting at (-3,-1,-3,-l) with a = 10-2, = 10-4 d) Algorithm 5.3.1 applied to Wood's function starting at (-3,-1,-3,-1) with a = 10-1, = 10-4 a -10 — 15 50 100 150 200 Function (Gradient) Evaluations

128 estimate of the distance to the minimum along the search direction so that the linear search subalgorithm makes fewer function (gradient) evaluations. This may be due to the fact that algorithm 4.3.2 does not depend upon an exact linear minimization for its theoretical properties. 5.6 Powell s Function The four dimensional function given by f(x)= (x + x2) + 5(x- x4) + - 2x + o(x x (5.6.1) 1 2 3 4- +0 (x2 2x3) +10(x- x4)4 (5.6.1) is particularly interesting because the Hessian matrix at the unique minimum of f(x) is singular. The results of Chapter 4 do not apply to a function with a singular Hessian at the minimum. Further, no theoretical results have been obtained for the Davidon algorithm in this case. Functions with singular or ill-conditioned Hessians do arise in applications. Runs were made from four different starting points with Davidon's algorithm and algorithm 5.3.1. In Figures 5.17 and 5.18, Log10jf(x)l is plotted as a function of iteration number for Davidon's algorithm and algorithm 5.3.1, respectively. Figure 5.19 is a plot of the rank of Uk and the age of the oldest column of Uk and Vk as a function of iteration number for the four runs of Figure 5.18. In these four runs of -4 algorithm 5.3,1, a = 3 = 10. In all four cases Davidon's algorithm -20 reduced the function value to 10 within 50 iterations. It is interesting that the rate of convergence seems to be linear in the Davidon run rather than superlinear. This suggests that it may be possible to obtain some convergence and rate of convergence results for Davidon's algorithm when the Hessian is singular at the minimum. With a = 3 = 10 4, algorithm 5.3.1 behaved much like Davidon's

129 Log10 f(x) l FIGURE 5.17 Davidon's algorithm applied to Powell's function starting from a) (-3,-1,,1) b) (1,1,1,1) 0 c) (-3,0,-3,-1) d\ ) (-3,-,-3,-1) -5 -10 -15 a - c b d 10 20 30 40 50 Iteration Number

130 Log0 I f(x) I FIGURE 5.18 Algorithm 5.3.1 applied to Powell's function with a = = 10-4 starting from a) (-3,1,0,1) b) (1,1,1,1) c) (-3,0,-3,-1) d) (-3,-1,-3,-1) -5 -10 -15 10 20 30 40 50 60 Iteration Number

131 FIGURE 5.19 Algorithm 5.3.1 applied to Powell's A/i of the oldest column of function with a - B = 10-4 starting (Jk and Vk and rank of Uk* from:- ~a) (-3,-1,0,1)' See note on b) (1,1,1,1) \' Figure 5.12 e. 1' c) (-3,0,-3,-1) - d) (-3,-1,-3,-1)d b - c a 0O 20 30 0 50 Iteration Number 4/_ -.4._/ 20.30_,,.....0 - ~ ~ ~ ~ ~~IeainNme

132 algorithm for the first 20 iterations. After 20 iterations algorithm 5.3.1 ran into difficulty. In three of the four runs it became impossible to delete any single column of the matrices Uk and V k k and to add the new columns (Uk,vk) while satisfying the angle condition of step 8 of the algorithm. As a result the age of the columns of Uk and Vk grew quite large. In the one case where good convergence was obtained the age of the columns did not become large. However, in the final iterations the rank of Uk remained at two. In an attempt to improve this performance the coefficients a and a were reduced to 108 and the runs were repeated. Figure 5.20 depicts LoglO1f(x)j versus iteration number for these runs. In Figure 5.21 the age of the oldest column and the rank of Uk are plotted versus iteration number for the same runs. Although the tendency of the algorithm to become stuck is somewhat less for these runs, in each case the age of all columns of U and Vk grew without bound. Again this occurred because it was impossible to satisfy the angle condition by deleting a single column of Uk and Vk. Because of these runs it was decided to impose an explicit age bound on the columns of Uk and Vk. Figures 5.22 and 5.23 depict the runs made from the same starting points with a = 8 = 108 and with the age of the oldest column of Uk and Vk constrained to be less than 2n. This modification improved the performance of algorithm 5.3.1 on Powell's function considerably. From two of the starting points convergence is obtained in considerably fewer iterations than with Davidon's algorithm. From the other two, algorithm 5,3.1 requires alfew:more iterations to converge.. The linear search algorithm 5.2.1 did not perform well on Powell's

133 Log1 f(x) FIGURE 5.20 -Algorithm 5.3.1 applied to Powell's function with a = = 10-8 starting from a) (-3,-1,0,1) b) (1,11,1) c) (-3,-3,-1) d) (-3,-1,-3,-1) -5 -10 -15 10 20 30 40 50 60 Iteration Number

134 FIGURE.5.21 Age of the oldest column of Algorithm 5.3,1 applied to Powell's Uk and Vk and rank of Uk* function with a = = 10-8 See note on Figure 5.12 >a) (-3,-,0,1),: / ^b) (1,1,1,1) c) (-3,0,-3,-1) >^~~ ~ -^^~~~ / d) (-3,-1,-3,-1) b _~".... _.. 1' _ / 10 20 30 +0 50 Iteration Number imm~~~~~~~~~~~~~

135 Logl (x)! FIGURE 5.22 Algorithm 5.3.1 applied to Powell's function with a = B = 10-8 and with the age of the oldest column of Uk and Vk bounded less than 2n = 8, starting from a a) (-,-1,0,1) b) (1,1,1,1) ] \t ~\\ ~ c) (-3,0,-3,-1) >~ d) ~(-3,-1,-3,-1) -5 -10 _ -15 b 10 20 3040 50 60 Iteration Number

136 FIGURE 5.23 Age of the-:odiest-column of Uk and V Vk-d rank of Uk*- Algorithm 5.3.i apDlied to Powell's function with a = 10-8 and with the age of the oldest column of U and * See note on V bounded less than 2n = 8 starting Figure 5.12 4fom a) a) (-3,-1,0,1) b) (1,1,1,1) C) b * c) (-3,0,-;,-1) d) (-3,-1,-3,-1) 10 20 30 40 50 Iteration. Number It erat ion NumTber

137 function either in Davidon's algorithm or in algorithm 5.3.1. Although this subalgorithm did not become stuck it did require a large number of function (gradient) evaluations to obtain a function decrease. A different linear search procedure would improve the performance of both algorithms when the Hessian is very poorly conditioned. The performances of algorithm 5.3.1 and Davidon's: algorithm,wer:comparable in terms of function (gradient) evaluations on Powell's function using search algorithm 5.2.1. 5.7 A Six Dimensional Function The final test function was a six dimensional function constructed from Wood's four dimensional function and the two dimensional Rosenbrock function. This function is given by f(x) = (g2(x) + ch2(x))1/2 (5.7.1) where g(x) = 100(x2 - x1) + (1 xl) (5.7.2) and 2 2 22 h(x) =100(x -x) + (1- x3) + 90(x - x5) 2 2 2 +(1- x5) + lO.l(x -1) + 10.1 (x -1) - + lO~l~x4 16 + 19.9(x -1) (x - 1). (5.7.3) 406 The value C = 793 was chosen so that at the point (1,-1,-3,-1,-3,-1) g2(x) = ch2(x). Algorithm 5.3.1 and the Davidon algorithm were run from the three starting points (0,0,0,0,0,0), (.9,..9,.9,.9.9,.9) and (1,-1,-3,-1,-3,-1). In all three runs of algorithm 5.3.1,1- and B-ware chosen to be 10. Figures 5.26, 5.25, and 5.26 depict Logolf(x)I versus iteration number for Davidon's algorithm and for algorithm 5.3.1 starting from these three starting points. In one case Davidon's

Log1olf(x)l FIGURE 5.24 5 ^ a) Algorithm 5.3.1 applied to the six dimensional-function with a B 10 starting at (l,-l,-3,-1,-3,.-1).. b) Davidon's algcrithm applied to the six dimensional function starting at ro~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~c \^ ^^___~ ~ ~~~~~~~~~- ( l -31,-3,-1) 15 20 40 60 80 100 120 140 160 180 200 Iteration Number

139 Log10lf(x) I |~~~~~~~~~ b FIGURE 5.25 a) Algorithm 5.3.1 applied to 0 the six dimensional function with a = B = 10-4 starting at (0,0,0,0,0,0). b) Davidon's algorithm applied to the six dimensional function starting at (0,0,0,0,0,0). -5 a -10 -15 20 40 60 80 100 Iteration Number

140 Log lf(x) FIGURE 5.26 a) Algorifnm 5.3.1 applied to the six dimensional function with = B ='10'4 starting at (. 9,.9,.9,.9,.9,.9). b) Davidon's algorithm applied to o P' the six dimensional function starting at (..9.9,9,.9^,.9). -5 10 a b _10 20 40 60 80 100 Iteration Number

141 algorithm becomes stuck while algorithm 5.3.1 converges in less than 100 iterations. In two of the three cases algorithm 5.3.1 performs better than Davidon's in:.terms of iterations required to reduce f(x) to 1020. If the measure of performance used is the total number of function (gradient) evaluations,:-agorithm':5.3. 1 performs: siglifianhtly better than Davidon's on the:six dimensional function from all these starting points. 5.8 Summary The numerical experiments verified the convergence and rate of convergence results obtained in Chapter 4. On the two dimensional Rosenbrock function both new algorithms performed adequately although not quite as well as Davidon. In testing the new algorithm on Wood's function it was discovered that algorithm 5.3.2 had a strong tendency to take repeated gradient steps and therefore converged slowly. As a result of this problem, algorithm 5.3.2 was not considered further. On Wood's function, algorithm 5.3.1 demonstrated that if the coefficients a and 3 are chosen sufficiently small, convergence is reliablesand iraepid. "In terms of total number of function (gradient).e'valuations a4lgortithfi -5&3.1 performed slightly better than Davidon's algorithm on-:Wood's function. The four dimensional Powell function with a singular Hessian caused both Davidon's algorithm and algorithm 5.3.1 to converge linearly rather than superlinearly. It was found to be necessary to impose an explicit age bound on the columns of Uk to obtain good performance from algorithm 5.3.1. On Powell's function, Davidon's salgoithmaand algorithm 5.3.1 performed equally well. On the six dimensional function,,~agorithm 5.3.1 demonstratdd t:hat

142 it was less prone to becoming stuck than Davidon's and further it performed better in nearly all cases. Overall, the numerical experiments do not show a sharp distinction between the performance of the new algorithm and Davidon's. However, for higher dimensional functions there seems to be a tendency for algorithm 5.3.1 to perform better. This tendency may be accentuated on functions with dimension greater than six.. To date such functions have not been tested. It is perhaps unfortunate that algorithms 5.3.1 and 5.3.2 were constructed using the linear search algorithm 5.2.1. Since the theoretical results of Chapter 4 do not require an exact linear search and since a unit step size is indicated when the search direction is pk it seems reasonable to expect that a linear search algorithm tailored to the theoretical results of Chapter 4 would be more successful. The performance of the new algorithms may also be considerably improved by applying other procedures for updating the data sets and for choosing the search direction when both the pk and pk directions are unsatisfactory. The numerical experiments performed so far indicate that the new algorithms are comparable in performance to Davidon's algorithm. With futher refinement they can be expected to yield significantly superior performance.

CHAPTER 6 CONCLUDING REMARKS The material presented in this dissertation can be divided into two major parts. In the first part, Chapters 2 and 3, a large number of established algorithms for root finding and minimization are considered in a general framework. This general framework unifies these algorithms and leads to the statement of a broad class of minimization algorithms which contains the established algorithms. Several modifications to the secant algorithms for root finding are also suggested. The research involving existing algorithms led the author to consider an entirely new class of algorithms for function minimization. The second part of the dissertation, Chapters 4 and 5, deals with this new class of algorithms. Theoretical convergence and rate of convergence properties are proved for the new algorithms and numerical experiments were conducted to verify these results and obtain further information concerning the performance of the algorithms. The material presented in Chapter 2 deals with the problem of determining a linear transformation which approximates the relationship defined by a data set. Of particular significance is the material of section 2.4 concerning the characterization of the approximations to a data set when an approximation to a second data set, having elements in common with the first, is known. These results are important since they simplify the calculation of a sequence of approximations to a sequence of data sets when the data sets are not disjoint. One result in particular, theorem 2.4.5, yields a general formula for an emulation of a data set, when one exists, in terms of an emulation of a second data set when the data sets have common elements. This result is entirely new. It is shown in 143

144 Chapter 3 that the recursion formulas used in most of the best known minimization algorithms, along with some new recursion formulas, are a special case of the general formula of theorem 2 4.5. Although the concept of an emulation or approximation of a data set is not fundamentally new, the organization of these ideas in terms of a sequence of data sets, the notion of approximations using different norms and the restriction of the approximations to a certain set of linear transformations as presented in sections 2.1 through 2.3 is original. Chapter 3 relates the material of Chapter 2 to the secant root finding algorithms and to a class of minimization algorithms which includes Davidon's algorithm. In section 3.3 it is suggested that generalized secant algorithms could be investigated using the pseudo inverse of the matrix Uk when the inverse does not exist or when a root of F(x): Rn Rm, n m, is sought. In considering the minimization algorithms in section 3.4 attention is restricted to quadratic functions. A class of minimization algorithms is presented in theorem 3.4.7 the members of which generate conjugate directions, form the inverse Hessian matrix and determine the minimum of a quadratic function in n or fewer steps. This class contains established algorithms such as Davidon's algorithm but is more general in four respects than these algorithms. First, new recursion formulas are admitted. Second, search directions, other than -Akg are allowed. Third, the conditions on Ak required to obtain convergence in n steps on a quadratic surface are relaxed. Fourth, it is recognized that different recursion formulas may be used at each stage. In section 3.5, new recursion formulas are developed for a sequence of approximations to a sequence of data sets generated by appending a single element to the data set at each stage. Such recursions may be of value in developing new algorithms

145 for root finding and minimization or in other problems where data sets occur which contain a large number of elements relative to n. In Ct[hapter x a new general minimization algorithm is presented, This algorithm, algorithm 4.3.2, is constructed so as to allow latitude in the choice of a linear search direction, a linear search procedure and the elements in the data set at each stage. In the algorithm provision is made for taking auxiliary steps to insure that a data set with the necessary properties can be chosen at each stage. It is shown.. Rn 1 in theorem 4.4.10 that if f(x) DCRn R R is continuously differentiable on D and if L(f(x0)) is compact then algorithm 4.3.2 converges to the set of critical points of f(x). It is further shown in theorem 4.4.12 that if algorithm 4.3.4 converges to a critical point x, if f(x) is twice differentiable at x and if the second derivative H(x) is positive definite and strong at x the rate of convergence is superlinear. These results are considerably stronger than results obtained by other researchers for established minimization algorithms. For instance, Powell (POW 71) has shown that Davidon's algorithm converges superlinearly if f(x) D CRn R1 is twice differentiable on D and if the eigenvalues of H(x) are all greater than some positive constant for all x e D. The numerical experiments in Chapter 5 verify that specific algorithms can be formulated from algorithm 4.3.2 which do not involve significantly more computation than established algorithms and which perform as well as the established algorithms. The author believes that with further refinement these new algorithms will offer distinct advantages over existing algorithms.

APPENDIX I DERIVATIVES OF TRACE FUNCTIONS In this section formulas are derived for the derivatives of the trace of a products of matricies. If f(A): Rmn Ri is a differentiable d function of the m x n matrix A by the notation - f(A) we will nean dA the m x n matrix which has as its ij th component a f(A) where 1J a.. is the element in the i th row and j th column of the matrix A. 13 The element in the i th row and j th column of A', the transpose of A, will be denoted by a'ij and the element in the i th row and j th column of a product of matricies such as AB'C will be denoted by (AB'C)... By the notation ai, cj we mean the summation E aik bk' J k ik kj where aik and bki are components of the matricies A and B. The notation a.. b.., c,. will denote the summation Z Z a b c -4 1 k ik kl lj where aik, bk and clj are components of matricies A, B and C. Other ik' kl lj summations are similarly defined. Since the trace is only sensible for square matricies it will be assumed that all matrix products are defined and that the matrix product yields a square matrix. 1. trace (BAC) B'C' dA Proof: (BAC)] = bi,. j trace (BAC) = Z bb af cc a~ trace (BAC) = ak- Z bi, cCi akli k i ak~ ~1 146

147 = b. c. bik li = b c' ki il Therefore dA trace (BAC) = B'C' dA 2. dA trace (BA'C) = CB Proof: trace (BA'C) = Z b, a** c. a trace (BA'C) = bakl b,c kl kl i a *_ i = Z b. ck. li ik = b' i i 1ik Therefore d trace (BA'C) = (B'C')' = BC. dA 3. trace (BADAC) = (CBAD +DACB)' dA Proof: trace (BADAC) = a Z bi a,,, d*. Ci +kl akl i bd Ilk ii i i ~ + bk dlt atLi,

148 = (BAD)ik cl + b (DAC)i ik ii ik i = ((BAD)'C') + (B'(DA)') l kl ki Therefore d trace (BADAC) = (CBAD + DACB)'. d 4. d- trace (BA'DA'C) = CBA'D + DA'CB Proof: trace (BA'DA'C) = b:a.,, *a. c aa al 1' = a., dU ^ i i *:,1 Cki + b dk a, i = z (BA'D)il Cki + bil (DAtC)ki =((BA'D) C')lk + (B (DA'C)')lk = (CBA'D)kl + (DA'CB)kl Therefore d trace (BA'DA'C) = CBA'D + DA'CB. d ~ dA 5. trace (BA'DAC) = (CBA'D)' + DACB Proof: -~ trace (BA'DAC) a= - a, a, *Ai 3d.l aak d~ct s _J~ t

149 = E b 4 i ia k li + bil d}, a, c*i. - (BA'D)ik Cli + bil (DAC)ki i + ((BA'D)'C')k + (DACB)kl d Therefore trace (BA'DAC) = (CBA'D)' t DACB. dA 6 trace (BADA'C) = CBAD + (DA'CB)' dA Proof: trace (BADA'C) = a b a, d a* c_ D= Z bi a**t dl ki ik 1 i Z (BAD)il Cki + bik (DA'C)li i = (CBAD)kl + (DA'CB)'kl Therefore trace (BADA'C) CBAD + (DA'CB)'. dA

APPENDIX II FORTRAN PROGRAM DAVIDON'S ALGORITHM PAGE ~0001 C C to~~.~~.e. l.~ * 0........... a.........1S.....e 1e t.g..........,... C C SUBROUTINE OFMFP C C PURPOSE, C TO FIND A LOCAL MINIMUM CF A FUNCTION OF SEVERAL VARIAeLES C BY THE METHOO OF FLETCHER AND POWELL C C USAGE C CALL DFMFP(FUNCT,NX,FGEST,EPS,LIMITIER,H) C DESCRIPTION OF PARAMETERS C FUNCT - USER-WRITTEN SUBROUTINE CONCERNING THE FUNCTION TO C BE MINIMIZED, IT MUST BE OF THE FORM C SUBROUTINE FUNCT(NARG,VAL,GRAD) C ANO MUST SERVE THE FOLLOWING PURPOSE C FOR EACH N-DIMENSIONAL ARGUMENT VECTOR ARG, C FUNCTION VALUE AND GRADIENT VECTOR MUST HE COMPUTED C ANO, ON RETURN, STORED IN VAL ANO GRAD RESPECTIVELY C ARG,VAL AND GRAn MUST BE OF OdUBLE PRECISION. C N w NUMeER OF VAR.IAbLES C X - VECTOR OF 61MENSION N CONTAINING THE INITIAL C ARGUMENT WHE.RE THE ITERATION STARTS, ON RETURN, C X HOLDS THF ARGUMENT CORRESPONDING TO THE C COMPUTED MINIMUM FUNCTION VALUE C OOU8LE PRECISION VECTOR, C F * SINGLE VARIABLE CONTAINING THE MINIMUM FUNCTION C VALUE ON RETURN, I.E. F=F(X). C DOUBLE PRECISION VARIABLE. C G * vECTOR OF OIMFENSION N CONTAINING THE GRAnIENT C VECTOR CCRRESPONDING TO THE MINIMUM ON RETURN, C I.E, GsG(X), C DOUBLE PRECISION VECTOR. C EST - IS AN ESTIMATE OF THE MINIMUM FUNCTION VALUE. C SINGLE PRECISION VARIABLE. C EPS - TESTVALUE REPRESENTING THE EXPECTED ABSOLUTE ERROR, C A REASONABLE CHOICE IS 1l**(-16), IE, C SOMEWHAT GREATER THAN 1r**(-D), WHERE D IS THE C NUMBER OF SIGNIFICANT DIGITS IN FLOATING POINT C REPRESENTATION. C SINGLE PRECISION VARIABLE. C LIMIT - MAXIMUM NUF3FR OF ITERATIONS, C IER w ERROR PARAMETER C IER m 0 MEANS CONVERGENCE WAS OBTAINED C IER a I MEANS NO CONVERGENCE IN LIMIT ITERATIONS C IER -I1 MEANS ERRORS IN GRADIENT CALCULATION C IER a 2 MEANS LINFAR SEARCH TECHNtSLUF INDICATES C IT IS LIKELY THAT THERE EXISTS NO MINIMUM, C H - WORKING STORAGE OF DIMENSION N*(N+7)/2, C DOUBLE PRECISION ARRAY, C C REMARKS C 1) ThE SUBROUTINE NAME REPLACING THE DUMMY ARGUMENT FUNCT C MUST ME DECLARED AS EXTERNAL IN THE CALLING PROGRAM, C II) IER IS SET TO 2 IF, STEPPING IN ONE OF THE COMPUTED 150

151 PAGE 00,02 C DIRECTIONS, THE FUNCTION WILL NEVER INCREASE WITHIN C A TOLfRASLE RANGE OF ARGUh'FNT, C IEQR 2 MAY OCCUR ALSO IF THE INTERVAL WHFRE F C INCREASES IS SMALL AND THE INITIAL ARGUMENT WAS C RELATIVELY FAR AWAY FROM THE MINIMUM SUCH THAT THE C MINIMUM WAS OVERLEAPED, THIS S DUE TO THE SEARCH C TECHNIQUE WHICH DOUBLES THE STEPSIZE UNTIL A POINT C IS FOUND WHERE THE FUNCTION INCREASES. C C SUBROUTINES AND FUNCTION SUBPROGRAMS REQUIRED C FUNCT C C METHOD C THE METHOD IS DESCRIBED IN THE FOLLOWING ARTICLE C R. FLETCHER AND M.J.D. POWELL, A RAPIO DESCENT METHOD FOR C MINIMIZATION, C COMPUTER JOURNAL VOLb, ISS. 2, 1963, PP.l63s168. C C C SU6ROUTINE P FMFP(FUNCT,N,, XF, GEST,EPS, LIMIT, IER,H) C C DIMENSIONED UUMMY VARIABLES COMMON/ANS/YSTAR,FSTAR, ICNT DIMENSION H(l),X(1),GCt) A,XSTAR(1CO DOUPLE PRECISION X,,FX,FY,OLDFHNPMGNRM,HGXOYALFAOALFA, 1AMaDA,T,ZW A,XMAG, XLOG, XSTARGMAGGLOGFMAGFLOG FSTAR C C COMPUTE FUNCTION VALUE AND GRADIENT VECTOR FOR INITIAL ARGUMENT CALL FUNCT(NX,F,G) C C RESET ITERATION COUNTER AND GENERATE IDENT.ITY MATRIX ICNTx0 lERco KOUNT=O N2 N+Nl N3:N2+N N31':N3+1 I KXN31 WRITE (6,50) 508 FOPMAT('RESET*) ICNT ICNT+l DO 4 Jt,N H(K)~, DO NJaN-J IFCF(:J)5,5, 2 DO 3 Lm.,NJ KLm<+L 3 H KL) O'D, d4 K"KL+1 C C START ITERATION LOOP 5 KOUNTRKOUNT +1 GMAG&0.DPC?0

152 PAGE p000 3 XMAGC0.D0 D0O1?0 I1I,N XMAGVxMAG. ( (I )XSTAR())**2 120 GMA MMAG+G(I)**2 XMAGDSORQ T (xMAG) GMAGaDSOT (GMAG) FMAGpF FSTAR IF(XMAG.GTe.,D0.AND.GMAG.GT..0. AN.DFMAG,GT.OD,0 )GO TO 121 WRITE (6,1'44) KOUNTXMA,,GMAGFMAG 104 FURMAT (///0'L^,5I3. "X'~^,01r,5, G"',0D12.,5,' "PF,0'D1?.5/) GO TO 122 121 XLOG=DLOGI0(XMAG) GLOG=DLOG1 CGMAG) FLOGsntLOG0CFMAG) WRITE (6,101) KUUNT,XMAG,XLOG,GMAGGLOCGFMAGFLOG 101 FOHMAT(///'Lu'13,''X"a*,D12.5,^ LOG"X- tOR, 5,' 0 ^1G i,D12.5, 1I "LOG"G"i', A Dl2.5,' "F*^,D12,5, LOG "F a' D12 5/) 122 WRITE (6,10R) CX(I),ItN) 102 FORMAT ('X:^, 1l12,4) WRITE (6,103) (G(I),a1,N) 103 FQ M AT (Gt,', 10-1 e. ) CC=KOUNT C C SAVE FUNCTION VALUE, ARGUMENT VECTOR AND GRADIENT VECTOR OL OF iF OLDFaF DO 9 JI1,N KN-+J H( K) G(J) K K+N H(K)r X(J) C DETERMINE DIRECTION VECTOR H K:J+N3 TBo.DO DO 8 LmlN TT-G (L) *H(K) IF (L-J)6, 77 6 K'K+N-L GO TO 8 7 KSK+ 1 8 CONTINUE 9 H(J)sT C CHECK WHETHER FUNCTION WILL DECREASE STEPPING ALONG H, DY xO DO HNQRM80.0 GNRMV. DO C CALCULATE DIRECTIONAL DERIVATIVE AND TESTVALUES FOR DIRECTION C VECTOR H ANO GRADIENT VECTOR G, DO 10 J=l N HNRMaHNRMOABS (H( J)) GNRM=GNRM+0A35 (G(J)) 10 DY.DY+H(J)*G(J) C

1 53 PAGE 00004 C REPEAT SFARCH IN DIRECTION OF STEEPEST DESCENT IF GIRECTIONAL C DERIVATIVE APPEARS TO PE POSITIVE OR ZERO, IF(DY) 11,51,51 C C REPEAT SEARCH IND IRECTION OF STEEPEST DESCENT IF DIRECTION C VECTOR H IS SMALLt: C.OMHPAREO T''GRADIENT VECTOR,G, 11 IF(HNPM/GNRM-EPS) 5 1 w l2 C C SEARCH MINIMUM ALONG DIRECTION H C C SEARCH ALONG H FOR POSITIVE DIRECTIONAL DERIVATIVE 12 FYmF ALFA.2,D0*(ESTF) /DY AMtHDA I 00 C C USE ESTIMATE FOR STEPSIZE ONLY IF IT IS POSITIVE AND LESS THAN C 1, OTHERWISE TAKE 1, AS STEPSIZE IF(ALFA) 15,15,13 13 IF(ALFA-AMBOA) 1415,15 14 AM3DAAALFA 15 ALFAs).D0 C - C SAVE FUNCTION AND DERIVATIVE VALUES FOR OLO ARGUMENT 16 FX=FY DX:DY C C STEP ARGUMFNT ALONG H DO 17 I1i,N 17 X(l)=X (I)+AMBA*H(I) C C COMPUTE FUNCTION VALUE AND GRADIENT FOR NEW ARGUMENT CALL FUNCT(N,X,F,G) FYzF C C COMPUTE DIRECTIONAL DERIVATIVE DY FOR NEW ARGUMENT, TERMINATE C SEARCH, IF DY IS PUSITIVE. IF DY IS ZERO THE MINIMUM IS FOUND DY~,DO DO 18 It1,N 18 DY~.DY,.G (I) *H (I) IF(DY) 19,36,22 C C TERMINATE SEARCH ALSO IF THE FUNCTION VALUE INDICATES THAT C A MINIMUM HAS BEEN PASSED 19 IF(FY-FX)20,22,22 C C REPEAT SEARCH AND DOUBLE STEPSIZE FOR FURTHER SEARCHES 2? AMBDABAMBDA+ALFA ALFAuA4B6DA C END OF SEARCH LOOP C C TERMINATE IF THE CHANGE IN ARGUMENT GETS VERY LARGE IF(HNRMHAMdDOA-lie1)lbl 6,21 C C LINEAR SEARCH TECHNIQUE INDICATES THAT NO MINIMUM EXISTS 21 IERa2 REFTU RN

154 PAGE 0'D05 C C INTERPOLATE CU1dCALLY TN THE INTERVAL'EFINED RY ThE SEARCH C ^ABOVE AND COMIPUTTF THE ARGUMENT X FOR WHICH THE INTEPPOLATION C POLYNOMIAL IS MINIMIZEC 2? Tmeo.D?3 IF(AMHDA)?4, 3,24 24 Zs3q. *(FX-FY) /AMHDA+DOXDY ALFASDOAX (OAHS(Z).DA8S (DX),ODAeS(Y)) DALFAaZ/ALFA DALFAuDALFA*OALFA.DX/ALFA*DY/ALFA IF(DALFA)51,25,25 25 WAALFA*DSQRT(DALFA) ALF ADY- OX+W W IF (ALFA) 250,251,250 250 ALFAu(O0Y-Z+)/ALFA GO TO 252 251 ALFA (Z+OY-W)/(Z+DX+ZODY) 252 ALFAzALFA*AMHDA DO 26 Il1,N 26 X(I)~X(I)+(T-ALFA)*H(I) C TERMINATE, IF THE VALUE OF THE ACTUAL FUNCTION AT X IS LESS C THAN THE FUNCTION VALUES AT THE INTEPVAL ENDS. OTHERWISE REDUCE C THE INTERVAL HY CHrJn ING ONE EN)-POTNT EQJUAL TO X AND PEPFAT C THE INTERPO)LATION. wHICH END-POINT IS CHOOSEN DEPENDS ON THE C VALUE OF THE FUNCTION ANn ITS GRADIENT AT X C CALL FUNCTCN,XX,FG) F (F-FX)2 2 7 7,?2 27 IF(F-FY)36,36,28: 28 UALFA:,DC0 DO 29 I1,N 29 OALFA:DALFA.G(I)*H(I) IF(UALFA)30,33,33 30 IF(F-FX)32,31,33 31 IF(DX-DALFA)32,36,32 32 FXsF DX=DALFA TxALFA AMBDAaALFA GO TO 23 33 IF(FY-F)35,34,35 34 IF(DY-OALFA)35,36,35 35 FY=F DY=DALFA AM0DA~AMdDA-ALFA GO TO 22 C TERMINATE, IF FUNCTION HAS NOT DECREASED DURING LAST ITERATION 36 IF (OLDF-F+EPS) 51,38,38 C COMPUTE DIFFERENCE VECTORS OF ARGUMENT AND GRADIENT FPOM C THO CONSECUTIVE ITERATIONS 38 DO 37 Jl1,N H(K)GC(J)-H(K)

155 PAGE 00006 K sN+K KINeK 37 H(K)X (J)-H(K) C C TEST LENGTH OF ARGUMENT DIFFERENCE VFCTOR AND OIRECTION VECTOR C IF AT LEAST N ITERATIONS HAVE BEEN EXECUTED. TERMINATE, IF C BOTH ARE LESS THAN EPS IERI0 IF (KOUNTN) 4/, 39,39 39 T0Oii) Z?0,D0 DO 40 J~1,N WCH(K) K'nK+N TWT+DABS(H(K)) 40 ZnZZW*H(K) IF (HNRM-EPS) 41, 41,42 41 IF(T-EPS)56,56,42 C C TERMINATE, IF NUMBER OF ITFRATIONS WOULD EXCEED LIMIT 42 IF(KOUNT-LIMIT)43, 5,5 C C PREPARE UPDATING OF MATRIX H 43 ALFAu(D,00 DO 47 Jl,N K~J+N3 00 06 Lm1I,N Kl.sN+L W~W+HfKL)*H K) IF(L-J)),14454S, 44 KsmKN-L GO TO 46 45 K=K+1 46 CONTINUE KcN+J ALFA~ALFA+W*H(K) 47 H(J) W C C REPEAT SEARCH IN DIRECTION OF STEEPEST DESCENT IF RESULTS C ARE NOT SATISFACTORY IF(Z*ALF A)48, 1,48 C C UPDATE MATRIX H 48 KsN31 DO 49 Lt1,N KL~N2*L DO 49 J=L,N NJ N2 +J H(K) ~H(K)+H(KL) H(NJ)/Z-H(L)*H(J)/ALFA 49 KvK+1 GO TO 5 C END OF ITERATION LOOP C C NO CONVERGENCE AFTER LIMIT ITERATIONS 5% IER 1i

156 PAGE 0007 RETURN C C RESTORE OLD VALUES OF FUNCTION AND ARGUMENTS 51 DO 52 J11,N KIN2+J 52 X(J) H (K) CALL FUNCT(NXF,G) C C REPEAT SEARCH IN DIRECTION OF STEEPEST DESCENT IF DERIVATIVE C FAILS TO SE SUFFICIENTLY SMALL F (GNPM-EPS) 55, 55,53 C C TEST FOR REPEATED FAILURE OF ITERATION 53 IF(E~R)56S,54,5 54 IER~-1 GOTO I 55 IER:0 56 RETURN

APPENDIX III FORTRAN PROGRAM FOR ALGORITHM 5.3.1 PAGE p0000 SULHPOUTINE OIXON(F(UNCT,N,X,FIGwESTEPSLIMIT) C CI. ECL AATIONS C IMPLICIT PEALe*8(A-H0. Z) CCIM' CN/A ^S/xST A (1 ),e STAr CO.MOh/tPIhl/lIC 1?,1e ),UDJ (1P, 10),UOM(10, I ),U1 10) U( UT(1),UV (10),UM( nI MEtS ION V C 1i, 1 ), VV ( 1), P(l1), X (10),XOLO(1),G(1 0) A GOLn ( 1), EPS (4), IA6E ( ) C CII. CHECK DIMENSION OF PFOPLEM C IF (N.LF. 10) GO TO 5 WkITE (h,10i) N 101 FORMAT ('-Nx',13,3X,'IS TOO LARGE FOR SUBROUTIN.E CHANGE DIMENSION ANT"//) STOP C CIII. INITTALIZE V ARIA^HL S C 5 IfK0 _[:'1 I.W"2:1 DO 1] Il,N IAGL(I)S= 10 tJo(l n I s T.r0,0 CAL.L F-LtF CT (, X,F,G) C CIV. STARTING ITERATION NLUMER "L" C 15 IF ( IAU().EQ.)GO 10 lb IF (L- I Af[ ( I ) -2 *N) f, 1 7, 1 7 17 CALL PINV?(N,IRK, 1) Ik.K TV.I K- I D)O 18 J-l,IRK DO 19 I:1,N u(I,J)mU(,IJ+I ) V (I,J) v (,J1) 19 J)D(J,T LUOM(J, I) 18 IAGE(J)=IAGE (J+) IAGE(IRK+1) 0 1: L L+ IF (L,r:T. LI"IT) RETURN IRK I IRK- 1 GMAGa~,.;:Dr XMAGsP C.D0 nO 2( It,N XMA.G= X tMAG. ( ( ) -XSTAR (l) )**2 20 r3AGr.Ar,.*Gf I)**2 XMPAG6lS'qT (XMAG) G MAG05 OSFRT C(GMAG) FMAGsF-FSTAp II (XMAG.GT. rE.O0,rANDGMAG.GT.0.DO0AND.FMAG.GT.0.DO)GO TO 121 WkITE (, 15) L,XMAGI,- tMG, AFMAG 105 FORMAT (C/I/'L,13,' Y'xs',D G"~,012125, G D12' F-F*a=',012.5/) 157

158 PAGE 0102 Gn TO 1?? 1 2 Xt OGG:LOGLC 1 ( XM AG) rLOL,sr)Ot O1(; 1GMAG) FLOG0L rG 1 (FMA() WRITE (t 101) LXMAGXLCG,GMAGGLOGFMAGFLOG 101 FORMAT(///'L. I',I3.' -vX~^,Di2.5, LOG-^X^-a, 1.5,'.G"~,012~5.' L A GIP.5, "F' l,012.b, LOG-F'-=,Dl2.5/) I?? WRIIF. (6,l 0?) (XCI), I1,N) 10? FORAT ('x.10e1I2.4) WRITE (^h103) (G(I),IuI, N) 103 FOR MAT ('r., Ilt?12.4) WHITE (0,104) (IAGE(I),ImlN) 104 FORMAT ('IAGE^.1016) CCIL C CIV, A. ITERATION TERMINATEO IF "'G < EPS4 C 24 IF ( GMAG GT. EPS(4)) GO TO?5 RE TIl t C. CIV, E. COMPUTE PSEJUDO INVERSE OF NEW tl FROM l)D & UV C 25 IF (ISw-.E,. 0) GO TO 26 )00?5s Is T,N' 254 Ul (1,I5) 13.0 IF (CFK,FQ. 0) GO TO 26 PC 255 l=1,IRK DO 256 J-1,N 256 JUV(J) ULJ,I?55 CALL PINV(N,I) C CIV, C. COMPUTE P9AR C 26 P ISW ^. 275 PMAGsl,( COS: 3, Jb U no 27 I:I,IRK UT (I) 1.DU DO?7 Jzl,N 27 UT(I) lIT CI)+UDI (IJ)*G(J) RI (IPK *FQ. N) GO TO 3, 0 29 I:1,N P(I)mG I) DO?8 J-,IRK 28 P(I)tP (I)-lJ t T) JT J) PMAGsP-MAG +P(I)**? 29 COS:COS+G CI ) *P I ) PMAGmOSRT (PMAG) COSCOQ/ (GMAG*PMAG) C CIV. 0. IF PBAR IS PERPENOICULAR TO G COMPUTE PSTAR C IF (COS,GT, EPS(1)) GO TO 40 30 COS.O0 PMAr.,.C ODP DO 32 I1l,N

159 PAGE P0oV3 P(I)(', J:(, * DOQ 31 J:I I K 31 P(I) zP (I) +V I, J) iiT (J PMAGF A.I P (I) * * 3?2 COSsCO*S G(I)*P ) PMA(t; —r)S:T (PMAG) COS-C(,S/(GMAG*PMAG) S w3l I C CIV, E, IF PSTAR IS PERPfNCICULAR TO G; DELETE THE OLDEST COL, OF U C IF (CCIS.GT, EPS(2)) GO TO 40 CALL PINV2(N,IRK,1) L'L-1 I SW? 2 IRK IFK-1 O0 3, J:,IlK onr 5s I-,N tl(I J)) tj (IJ+1 ) V(I,J):V(I,J+I) 35 un (J,I I =LU)M(J, I) 36 IAGE(J) I AGE(J ) l AG ( I PK 1 ) rub IC] 1I 40 DO 4111=,N GOLD(I)=G(I) u1 XOL}(I)= (I) C CIV., F, DETERMINE STEPLFNGTH C p- CALL STPLFN(FUNC',N,X,F,GP,EST) C CIV. G. CALCULATE OX & DG VECTORS C 44 I MAtr, G. 0 45 00 5';-I=I,N VV (I) (I)-OLn (-I) uV (I) G(I)-,OLOD I) 50 IJMAG.MAG+UV ( ) ** UMAGUSQWT (UMAG) C CIV, H, TEST NEW DG VICTOR C IF (IRK F.Q, N) GO TO 61 PO 515.I1,IRK UT (I) I. u DO 505 J1l,N 505 UTCl)ruTCI)UDO(I,J)*UV(J) CMAGp =Ci.Or DO 52 lt~,hN COMP~.0D0 00 51 Jul,IRK 51 COMP~CO;P+U I,J)*UT(J) 52 CMAGrCMAh lCO'MP** CrAG.MDS. RT (CAG)J IJMAG

160 PAGE 0P04 IF (CMAG,GT. (1t.L -EPS(3))) GC TO 61 C CIV. HI. PASSEDS UPOA1E U & V MATRICES C!RK= ItK+ 1 IAGE (TRP)aL DO (6t 2-I1,N J(I#lkK)mUV(I) 60 V(Il,IK)3VV(I) GO TO 15 CIV, H2. FAILED: TRY ELIMINATING ONE OF THE COLUMNS C 61 DO0 75 Ml,IFRK CALL PINV2(N,IRK,M) CMAGvO,n 00 67 I=1,N COMP 0,. JJ:0 DO 66 J:t,! RK IF (J.ELJ. P) GO TO i6 JJaJJ 1 CnMPCOMP+Uf (I J)* UT (JJ) 66 CONTINUE 67 CMAG:CMAG+COMP**2 CMAG=DSC( RT (CMAG) /JHAG TF (CMAr,GT, (ai)l-EPS(3))) GO TO 75 IRK1I IrNK- 1 DO 72 T l, IfK 00 72 J=t,N 72 UO(I, J) lDM (I, J) DO 73 J1,lIkK1 IAGE J) IAGE (J+1) DO 73 X-1,N U(IJ)=U(I1J+I) u (II J.):v ( I# J+ 1' 73 V(IJ)=VCIJ+l) IAGE(IRK)= L 00D 7lS ll,N U(I, I })suv (I 7 GO TO 5V)VV G6 TO 15 75 CONTINUE C CIV. H3. TEST FAILED AGAIN U, V & UD REMAIN UNCHANGED C ISW2?P GO TO 15 ENO

161 PAGE 00001 SUBROUTINE PINV(N,IRK) C MATRICES liSFO: U,Un,UOM -- VECTORS USED: UIUTUV C U: THE DELTA-G MATRIX SUPPLIED C UD: T'-E PSUEOO INVERSE OF THE "U" MATRIX SUPPLIED,UPDATED & RETURNED C UO: WOk'KING MATRIX C Jl: WORKING VECTOR; U a (I U U * U) * UV C lIT: WORKING vECTOR r UT a UD * UV C UVI NEW CCLLMN JUST ADUED TO THE "U" MATRIX IMPLICIT QEAL*8 (A-H,O-Z) COMMONM/FINI/U (1l4,10),U t 1?, 10),UiDM 10, IP,U1 Cl0),UT( 0), UV(10),UM(le) IF (IFK,LE, ) RETURN I KI IR U1MA G s, IF (IRKI.EC. 0) GO TO 211 OO 21 IIK1 RKI UT ( I):0.O000 DO 21 J1I,N 21 UT(I) =UT (I)+UD (IJ) UV(J) 211 D0 23 I1l,N U (I):UV(I) IF (IRK1.EQ,.) GO TO 23 DO 2??2 J1,IK1?2 U (l)Ul (I)-U('I,J)*UT(J) 23 UIMAGl:U2AGU+U (I)**2 00?b J:1,N 25 UO(IRKJ)U(JU(J)/U1MAG IF (I;KI.EC, ~ ) GO TO 2h DO 2' I:l,IRKI ti.cM 2I, J):.l) I, J 00 241 I.I,IRKi J (I,J) cU M (I, J) 00 241 K:s,N 241 LJUO(,J): UO(I,J)-UOM(I,K) *LV (K)*UI (J)/UIMAG 26 RETURN END

162 PA Ge Q0011 SUBROUTINE PINV. (N, fR,M) C MATRICES USFO: UD,UiDM -- VECTORS USED: UTLUV,UM C UD: PSUEOD INVERSE OF THE "U" MATRIX SUPPLIED C JUOM PSUEDO INVERSE OF TMh U MATRIX MINUS THE MTH COL, RFTURLNDE C UT: ThE PRODUCT OF "1JUM" TIMES "UV" RETURNED C JV: NEW CCLUMN TO BE ADOEC TO THE U MATRIX C UM: THE MTH ROW OF THE lD0 MATRIX IMPLICIT REAL*8(A-H,O*Z) COMMON/PINI/U( II, 1,) UD( l V, 10) UOM(It 1,).Ut (10), UT( IO), UV[i),UM(1Q) UDMAXGa.D0 OO0 62 II,N UT (I) so. O UM(I)sUODCM, ) 62 UDMAGaUDMAG+UM (I)*2 10 I )DO 65 llalIRK IF II,EQO M) GO TO 65 UT(I)n0,00 nO 64 JtI,N DUMUD (I I J) 00 63 K:,N 63 DUM:DL)M-UJi) (!I o ) *UM (J) *UM CK) /UDMAG UT(I) cUTl ) +DLM*UV J) 6 UOM (', J) -OUM 65 CONTINUE RETURN ENO

163 PAGE 001t1 SU6ROUTINE STPLEN (FUNCT, N,X, F,G, P EST) IMPLICIT kEAL*8 (4CA-HZ) DIMENSION X(1),GC(10),P(l~) OY=:.L~O 000 10 Il,N P(I) -PCtI) e D)YDY+r (CI)*P (I) IF (OY) 12,11,11 1i wRITE (6,10?) 100 FORMAT (^-THE DIRECTIONAL DERIVATIVE APPREARS TO BE > OR ~ 0,) STOP 12 FYVF ALFA=2.O0* (EST-F)/OY A M OA 1.O~ TI (ALFA) 15, 15,13 13 IF CALFA-AMBA) 14,15,15 14 AMBDA4ALFA 15 ALFAOeD0O 16 FXsFY OX~ODY 00 17 I11,N 17 X( I) X ()+AM8OA*P (I) CALL FUNCT(N,X,F,G) FYaF DO 18 I1,N 18 ODYa)Y+G(I)*P(I) IF (OY ) 19,32,.2 19 IF (FY-FX) 20,?2,?2 20 AMHDA=AMflPA+ALFA ALFA AM DA GO TO 16 22 TzO.C"Q 23 IF (AM8DA)?4,36,24 24 Z23.,D'*(FX-F Y)/AMHOA+lX+D)Y ALFA=DMAXI(DABS CZ) DAbS (X) DA4S (DY)) DALFA=Z/ALFA DALFA= DALFA*DALFA-DX/ALFA*OY/ALFA IF (OALFA) 51,25,25 51 STOP 4 25 W~ALFA*DSnRT(DALFA) ALFAaDY-DX+w+W IF (ALFA) 250,251,250 250 ALFA (OY-Z+W) /ALFA GO TO 252 251 ALFA~(Z+OY-w) / (Z+OX+ZDY) 252 ALFA ALF A*AM8DA 0n 2h Tl,N 26 X (I) (I) + T-ALFA)*P (I) CALL FUNCT(NX,F,lG) IF (FY-X) 27,27,28 27 IF (F-FY) 36,36,28 28 OALFA:0.D,0 DO 29 Is1,N 29 DALFAOALFA+G (I)*P(I) IF (OALFA) 38,33,33

164 PAGE 0~02 30 IF (F-FX)32,31,33 31 IF (nX-OALFA) 32,S6,32 32 FYsF DX=OALFA T2ALFA AMBDOAALFA. GO TO 23 33 IF (FY-F) 35,34,35 34 IF (OY-DALFA) 35,36,35 35 FY~F DY DALFA AMBDA=AMBDA-ALFA GO TO 22 36 RETURN

BIBLIOGRAPHY AFR 70 Afriat, S.N., The Progressive Support Method for Convex Programming, SIAM Journal of Numerical Analysis, Vol. 7, No. 3, September 1970, pp. 447-457. AKA 59 Akaike, Hirotugu, On a Successive Transformation of Probability Distribution and Its Application to the Analysis of the Optimum Gradient Method, Annals Institute of Statistical Mathematics, Tokyo, Vol. 11, 1959, pp. 1-16. ALB 72 Albert, Arthur, Regression and the Moore-Penrose Pseudoinverse, Academic Press, New York, 1972. ANT 62 Antosiewicz, Henry A., and Werner C. Rheinboldt, Numerical Analysis and Functional Analysis, Survey of Numerical Analysis, J. Todd, ed. McGraw-Hill, New York, 1962, pp. 485-517. BEN 66 Ben-Israel, Adi, A Newton-Raphson Method for the Solution of Systems of Equations, Journal of Mathematical Analysis and Applications, Vol. 15, 1966, pp. 243-252. BRO 65 Broyden, C.G., A Class of Methods for Solving Nonlinear Simultaneous Equations, Mathematics of Computation, Vol. 19, 1965, pp. 577-593. BRO 67 Broyden, C.G., Quasi-Newton Methods and Their Application to Function Minimisation, Mathematics of Computation, Vol. 21, 1967, 368-381. BRO 69 Broyden, C.G., A New Method of Solving Nonlinear Simultaneous Equations, Computer Journal, Vol. 12, 1969, pp. 94-99. BRO 70 Broyden, C.G., The Convergence of a Class of Double-Ranked Minimization Algorithms, Journal of the Institute of Mathematics and Applications, Vol. 6, 1970, pp. 76-90. BRO 70a Broyden, C.G., The Convergence of a Class of Double-Rank Minimization Algorithms, Journal of the Institute of Mathematics and Applications, 1970, Vol. 6, pp. 222-231. CAN 66 Canon, M., C. Cullum, and E. Polak, Constrained Minimization Problems in Finite-Dimensional Spaces, SIAM Journal of Control, Vol. 4, No. 3, 1966, pp. 528-547. CHI 64 Chipman, John S., On Least Squares with Insufficient Observations, American Statistical Association Journal, December 1C64, pp. 1078-1110. CLI 64 Cline, Randall E., Representations for the Generalized Inverse of a Partitioned Matrix, SIAM Journal, Vol. 12, No. 3, September 9$64, pp. 588-600. 165

166 CLI 65 Cline, Randall E., Representation for the Generalized Inverse of Sums of Matricies, SIAM Journal of Numerical Analysis, Ser. B, Vol. 2, No. 1, 1965, pp. 99-114. COH 72 Cohen, Arthur I., Rate of Convergence of Several Conjugate Gradient Algorithms, SIAM Journal of Numerical Analysis, Vol. 9, No. 2, June 1972, pp. 248-259. DAV 67 Davidon, William C., Variance Algorithm for Minimization, Computer Journal, Vol. 10, No. 4, 1967, pp. 406-411. DAV 59 Davidon, William C., Variable Metric Methods for Minimization, A.E.C. Report ANL-5990, Argon National Laboratory. DEI 67 Deist, F.H., and L. Sefor, Solution of Systems of Non-Linear Equations by Parameter Variation, Computer Journal, Vol. 10, 1967, pp. 78-82. DES 63 Desoer, C.A., and B.H. Whalen, A Note on Pseudoinverses, SIAM Journal, Vol. 11, No. 2, June 1963, pp. 442-447. FLE 63 Fletcher, R. and M.J.D. Powell, Rapidly Convergent Descent Method for Minimization, Computer Journal, Vol. 6, No. 2, 1963, pp. 163-168, FLE 64 Fletcher, R., and C.M. Reeves, Function Minimization of Conjugate Gradients, Computer Journal, Vol. 7, 1964, pp. 149-154. FLE 65 Fletcher, R., Function Minimization Without Evaluating Derivatives - A Review, Computer Journal, Vol. 8, 1965, pp. 33-41. FLE 68 Fletcher, R., Generalized Inverse Methods for the Best Least Squares Solution of Systems of Non-Linear Equations, Computer Journal, Vol. 10, 1968, pp. 392-399. FLE 69 Fletcher, R., A Review of Methods for Unconstrained Optimization, Optimization, Proceedings of an International Conference on Optimization at Keele Univ., Academic Press, 1969, pp. 1-13. FLE 70 Fletcher, R., A New Approach to Variable Metric Algorithms, The Computer Journal, Vol. 13, No. 3, August 1970, pp. 317-322. GRE 59 Greville, T.N.E., The Pseudoinverse of a Rectangular or Singular Matrix and Its Application, to the Solution of Systems of Linear Equations, SIAM Review, Vol. 1, No. 1, January 1959, pp. 38-43. GRE 60 Greville, T.N.E., Some Applications of the Pseudoinverse of a Matrix, SIAM Review, Vol. 2, No. 1, January 1960, pp. 15-22. HOR 68 Horwitz, Lawrence B., and Philip E. Sarachik, Davidon's Method in Hilbert Space, SIAM Journal of Applied Mathematics, Vol. 16, No. 4, July 1968, pp. 676-695.

167 JON 70 Jones, A., Spiral - A New Algorithm for Non-Linear Parameter Estimation Using Least Squares, The Computer Journal, Vol. 13, No. 3, August 1970, p. 301. KEL 66 Kelley, H.J., W.F. Denham, I.L. Johnson, and P.O. Wheatley, An Accelerated Gradient Method for Parameter Optimization with Non-Linear Constraints, Journal of the Astronautical Sciences, Vol. XIII, No. 4, 1966, pp. 166-169. LAN 70 Lancaster, Peter, Explicit Solutions of Linear Matrix Equations, SIAM Review, Vol. 12, No. 4, October 1970, pp. 544-566. LAS 67 Lasdon, L.S., S.K. Mitter and A.D. Waren, The Conjugate Gradient Method for Optimal Control Problems, IEEE Transactions on Automatic Control, Vol. AC-12, No. 2, April 1967, pp. 132-138. LAS 70 Lasdon, Leon S., Conjugate Direction Methods for Optimal Control, IEEE Transactions on Automatic Control, April 1970, pp. 267-268. LUE 70 Luenberger, David G., The Conjugate Residual Method for Constrained Minimization Problems, SIAM Journal of Numerical Analysis, Vol. 7, No. 3, September 1970, pp. 390-398. MUR 70 Murtagh, B.A., and R.W.H. Sargent, Computational Experience with Quadratically Convergent Minimisation Methods, Computer Journal, Vol. 13, 1970, pp. 185-194. MYE 68 Myers, Geraldine E., Properties of the Conjugate - Gradient and Davidon Methods, Journal of Optimization Theory and Applications, Vol. 2, No. 4, 1968, pp. 209-219. NEL 65 Nelder, J.A., and R. Mead, A Simplex Method for Function Minimization, Computer Journal, Vol. 7, 1965, pp. 308-313. ORT 70 Ortega, J.M., and W.C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Academic Press, New York, 1970. ORT 72 Ortega, James M. and Werner C. Rheinboldt, A General Convergence Result for Unconstrained Minimization Methods, SIAM Journal of Numerical Analysis, Vol. 9, No. 1, March 1972, pp. 40-42. PEA 69 Pearson, J.D., Variable Metric Methods of Minimization, Computer Journal, Vol. 12, No. 5, 1969, pp. 171-178. PEN 54 Penrose, R,, A Generalized Inverse for Matricies, Proceedings of the Cambridge Philosophical Society, 1954, pp. 406-413. PEN 55 Penrose, R., On Best Approximate Solutions of Linear Matrix Equations, Proceedings of the Cambridge Philosophical Society, 1955, pp. 17-19.

168 PET 70 Peters, G. and J.H. Wilkinson, The Least Squares Problem and Pseudo-Inverses, The Computer Journal, Vol. 13, No. 3, August 1970, pp. 309-316. PIE 70 Pierson, B.L., and S.G. Rajtora, Computaticnal Experience with the Davidon Method Applied to Optimal Control Problems, IEEE Transactions on System Science and Cybernetics, July 1970, pp. 240-242. POR 72 Powers, William F., A Crude-Search Davidon-Type Technique with Application to Shuttle Optimization, AIAA Astrodynamics Conference, Palo Alto, Cal., September 1972, Paper No. 72-907. POW 64 Powell, M.J.D., An Efficient Method for Finding the Minimum of a Function of Several Variables without Calculating Derivatives, Computer Journal, Vol. 7, 1964, pp. 155-162. POW 68 Powell, M.J.D., On the Calculation of Orthogonal Vectors, Computer Journal, Vol. 11, 1968, pp. 302-304. POW 69 Powell, M.J.D., A Theorem on Rank One Modifications to a Matrix and Its Inverses, Computer Journal, Vol. 12, 1969, pp. 288-290. POW 70 Powell, M.J.D., A New Algorithm for Unconstrained Minimization, Non-Linear Programming, Proceedings of a Symposium at the Univeristy of Wisconsin, Madison, Academic Press, 1970, pp. 31-65. POW 70a Powell, M.J.D., A Survey of Numerical Methods for Unconstrained Optimization, SIAM Review, Vol. 12, 1970, pp. 79-97. POW 71 Powell, M.J.D., On the Convergence of the Variable Metric Algorithm, Journal of the Institute of Mathematical Applications, Vol. 7, 1971, pp. 21-36. PSH 69 Pshenichnii, Boris N., On the Acceleration of Convergence of Algorithms for Solving Optimal Control, Computing Methods in Optimization Problems 2, Proceedings of a Conference, San Remo, 1968, pp. 331-352, Academic Press, 1969. RAO 71 Rao, C. Radhakrishna, and Sujit Kumar Mitra, Generalized Inverse of Matricies and Its Application, John Wiley & Sons, Inc., New York, 1971. SHA 64 Shah, B.V., R.J. Buehler, and 0. Kempthorne, Some Algorithms for Minimizing a Function of Several Variables, SIAM Journal, Vol. 12, No. 1, March 1964, pp. 74-92. SHN 70 Shanno, D.F., Parameter Selection for Modified Newton Methods -for Function Minimization, SIAM Journal of Numerical Analysis, Vol. 7, No. 3, September 1970, pp. 366-372.

169 TRI 70 Tripathi, Shiva S., and Kumpati S. Narendra, Optimization Using Conjugate Gradient Methods, IEEE Transactions on Automatic Control, April 1970, pp. 268-270. VET 70 Vetter, W.J., Derivative Operations on Matricies, IEEE Transactions on Automatic Control, April 1970, pp. 241-243. WAR 71 Ward, J.F., T.L. Boullion, and T.O. Lewis, A Note on the Oblique Matrix Pseudoinverse, SIAM Journal of Applied Mathematics, Vol. 20, No. 2, March 1971, pp. 173-175. ZAN 67 Zangwill, Willard I., Minimizing a Function Without Calculating Derivatives, Computer Journal, Vol. 10, 1967, pp. 293-296. ZEL 68 Zeleznik, Frank J., Quasi-Newton Methods for Nonlinear Equations, Journal of the Association for Computing Machinery, Vol. 15, No. 2, April 1968, pp. 265-271. ZOU 66 Zoutendijk, G., Nonlinear Programming: A Numerical Survey, SIAM Journal of Control, Vol. 4, No. 1, 1966, pp. 194-210.

UNIVERSITY OF MICHIGAN 3 9015 02082 7955