THE SECOND ORDER STEEPEST DESCENT METHOD Aharon Ben-Tal* Technion, Israel Institute of Technology and Department of Industrial and Operations Engineering, The University of Michigan Luc De Hulsters Technion, Israel Institute of Technology Technical Report #86-33 September 1986 * Research supported by NSF and the VPR Fund at the Technion.

ABSTRACT The paper introduces a new algorithm for unconstrained minimization of an n-variable twice continuously differentiable function f. Unlike classical methods, which improve a current solution by moving along a straight line, the new method improves the solution by moving along a quadratic curve in R. The specific curve is determined by minimizing an appropriate approximate model of f. The algorithm thus obtained (SOSD) is a natural second order extension of the Steepest Descent method. It possesses a global convergence property combined with a quadratic rate of convergence, while using the same information and employing the same computational effort, as the Newton method. Versions of SOSD with inexact linesearch and with no linesearches at all are studied as well, retaining the above desirable convergence properties, which are also demonstrated computationally. 1

2 Introduction In this paper we introduce a new computational method for solving the unconstrained minimization problem (A) min{f(x): x e R} for functions f which are twice continuously differentiable (f E C ). The classical methods for solving (A) are all based on the basic iteration step x - x+td (1) where dcR is a direction vector and t > 0 is the stepsize. The direction vector is obtained by minimizing an approximate model of f(x + td). Thus if a first order Taylor approximation is used: f(x+td) = f(x)+tdTVf(x) one obtains the Steepest Descent (SD) direction: d =-g / llgll, (g= Vf(x) ) If a second order Taylor expansion is used: f(x+ td) f(x) + td Vf(x) + 1/2t d Vf(x)d then the Newton direction is obtained dN -H - (H=V2f(x)) or (if a bound is imposed on the length of d) a direction as in a Trust Region method: d -(H+ XI) lg emerges (see e.g. [5], [6]). In our method the basic iteration step is x 4- x+td+l/2t z (2) with two "direction vectors" d E R, z E R and a stepsize t > 0. Thus the improving step

3 is along a quadratic curve in R, rather than along a straight line. The idea of an improving step along a quadratic curve was used theoretically by the first author (see e.g. [2], [1], [3] and [4]) to obtain second order necessary optimality conditions for smooth and nonsmooth optimization problems. Here we study the computational implications of the basic iteration (2). The direction vectors d and z are obtained by minimizing an approximate model of f(x + td + 1/2t z) which is in fact a second order Taylor expansion of the latter (as a function of t): f(x+ td+ 1/2t2z) f(x) +tdTVf(x)+ 1/2t[z Vf(x)+d V f(x)d] (3) As in the derivation of the steepest descent or the trust region directions, certain constraints are imposed on d, z. More specifically we consider the following model (recall g=Vf(x), H=V f(x)) (M) min Af(t,d,z): =tg d+ 1/2t [zg+d Hd] subject to lizil < a 11911 <dTg/llgll - The first constraint is just normalizing the length of z. The second constraint enforces the direction d to make a sharp angle with the steepest descent direction; this is a typical requirement in many convergent descent methods. The positive scalar P controls this angle and the positive scalar a controls the length of z. The objective function Af in (M) is (omitting the constant f(x)) the righthand side of (3). Although we minimize Af also with respect to t, this is only for deriving the optimal solutions a and z of (M). In the algorithm 2 t will be chosen by minimizing the true objective f(x + td+ 1/2t z) and not its approximation Af(t,a,z). The directions a and z are obtained in Section 1, they are

4 = -ag/||g|l llgll a= -- T -1 Thus, z is the steepest descent direction and a is a "signed Newton direction" i.e. it is the Newton direction or its opposite according to the sign of g H g being positive or negative. At any rate (in sharp contrast to the newton direction dN) a is a direction of - 1 T - 1 descent, whenever g: 0, H exists and g H g:e 0 (we call this the nonsingular case). Based on the directions a, z, the method studied in this paper iterates from a current iteration point xk to the next point via Xk+l = Xk+tkd+l 1/2tkzk tk = a suitable stepsize (SOSD) - gkll -1 dk- k T -1 Hk gk Zk -ak /lgk[ gHk gk where g= Vf(xk), Hk=V2f(xk). {ak} and {Pk} are sequences of positive scalars, bounded away from zero which may be fixed, predetermined or chosen iteratively. The name SOSD is an abbreviation for Second Order Steepest Descent. Indeed the method (SOSD) is a natural extension of the steepest descent method (SD). This is demonstrated in Table 1 below. The quadratic rate of convergence of the SOSD method mentioned in Table 1 will be proved in Section 4 for a general C -function, following a proof for the special case where f is a convex quadratic function f(x) = 1/2x Qx - b x, (Q symmetric positive definite) These results are preceeded by Section 2, in which the global convergence of the SOSD

5 Table 1: Comparison of SD and SOSD methods TI SD SOSD data used first order second order (gradients) (gradients and) Hessians) improving step along along 1st degree 2nd degree polynomial polynomial role of convergence linear quadratic method is demonstrated. The above mentioned results are for exact line search: tk = arg min f(xk +tdk+ 1/2t2zk) t>o Similar global convergence and rate of convergence results are obtained in Section 5 for versions of SOSD where the stepsize is chosen according to an Armi"o-Goldstein type rule. In the quadratic case we avoid the need to compute the stepsize by fixing the stepsize tk, and then choosing the ak so as to make tk an exact linesearch step. This version, which is also extended to the general C -case, is called "the a-method". It also possesses a quadratic rate of convergence, although not necessarily gobal convergence. Like the pure Newton method it does not require stepsize computations. For the single variable case it reduces exactly to the Newton's method, but for n > 1 it is quite different, and in our computational tests exhibited global convergence in all the examples for which Newton's method failed. At this point we offer an intuitive explanation as to why one should expect from a method such as SOSD to be globally convergent and at a quadratic rate. Note that in the iterative step

6 Xk+ 1 = xk+tkdk + 1/2tzk kk+1 k k k k k when xk is still far from optimality, and a large stepsize tk is expected, then zk (a SD direction) will be dominant (forcing global convergence). While close to optimality, where H(xk) is expected to be positive definite, and tk close to zero, the vector dk (a Newton direction) is the dominant one (forcing quadratic rate of convergence). We would like to emphasize that the SOSD method uses the same data as the Newton method, and requires the same computational effort: evaluating the gradient and Hessian, and solving a linear system to obtain the direction dk, plus a linesearch computation. But SOSD is globally and quadratically convergent. These properties are shared by Trust Region methods which we believe are more complicated and less natural; while they try to "correct" the Newton method, we try to extend (to second order elements) the Steepest Descent method. Thus we put forward the idea that the SOSD method, and not Newton's method, should serve as a fundamental algorithm from which first order modifications (such as quasi-Newton methods) should be obtained, and extensions to constrained problems be made. The limited computational experience, reported in Section 6, indeed supports the fact that the SOSD method performs equally well to Newton's, when the latter converges, and succeeds to solve test problems from all starting points from which the latter failed to converge, and still exhibits quadratic rate of convergence.

7 ~1. DERIVATION OF THE DIRECTION VECTORS d and z We solve the model problem (M) first under the nonsingularity assumption T -1 g O 0, H nonsingular, gTH-g; 0 (4) It is easy to see that for fixed but arbitrary d c R and t > 0, the optimal z is z= -fg and we are left with the problem min min t Af(t,d,i) = tgd+ 1/2t[dTHd- algill 0<tT d (P) s.t. T g d+ lgll < (5) The upper bound T imposed here, will be dropped later. Let X be the multiplier corresponding to the constraint (5). The Kuhn-Tucker conditions for the optimality of d, t are (i) g+tHd+Xg = 0 (ii) g d+t(dTHd-allgll) < 0 (iii) gTd < -l||g|| (iv) 0 < t S T (v) t < T = > equality holds in (ii) (vi) X[gTd+pgli] = 0 (vii) X 2 0 We consider first the case TH- 0 (6) gH g > 0 (6) and define

8 h=: gH g h/p < T If (7) then, the solution of (i)-(vii) is a= g ph t = - -ah T oh if 0 < 2 <T 0 -ah otherwise x= ht- 1 h If h/, - T (8) then the solution is d = - THlg t = T T= 0 We now drop the bound on the stepsize by letting T -+ cv. Then, only case (7) is relevant, giving the direction Ilgll -1 a = - - H g g 'g (9) Next consider the case T — 1 gH g<0 Multiplying (i) by the row vector gTH- wegetg +H tg d +Xg H g = 0 T Thus no1 implying tg d = -(1 + X)g H g > 0 by (10) contradicting t > 0 and (iii). Thus no (10)

9 solution exists to the Kuhn-Tucker conditions under (10), indicating an unbounded solution of (P). Indeed set M -1 dM — H g, M h the value of the objective junction in (P) is 2lg 2 11g 2 2 Af(t,dM,,) = -tl|g||M + 1/2t2 M2 - 1/2 t2aclgil. TH -1 g H g For all t > 0, the coefficients of both M and M are negative, and therefore Af can be made to approach -x by letting M approach +0. Therefore { dM, M -+ } is an infimizing sequence. For any M > 0, dM is the same direction as a in (9). If a common bound on the length of both a and dM is imposed then actually dM = a. T - 1 We conclude that if g H g is either positive or negative, the directions obtained from the model problem (M) are g llgll - z = a -a = - - H g T 1 |Igll gTH- g An important property of the directions a, z is that T z a = aj > 0 (11) hence the angle between them is greater than 90~. This ensures that a movement from x 2 along the curve x(t) = x + ta + 1/2t z, will not bring us back to x. 'e 1 i xxtd+ t> Moreover (11) implies that j|x(t)-x|| is monotonely increasing for t > 0, an essential

10 property for many convergent algorithms, see also the convergence proof in the next section. The length of a is bounded below by 1, indeed ia g1211H 1 2 Ijg H lg11 by the Cauchy-Schwartz inequality. If H is positive definite, with eigenvalues 0< X1 <2... < \n an upper bound can be obtained as follows: let H1/2 denote the root matrix of H and H-1/2 its inverse, set - 1/2 x = H /g, then 2 (x Hx)(xTH x) (X +X ) illI12 = 0 T 2 2 (x x) 4 Xn by the Kantorovich inequality (see e.g. [7]). So, for H positive definite (X +X )/2 p Ilall -I - VXlXn (12) and we see that 0 essentially controls the length of a. We close this section with a few remarks concerning the singular cases g= 0 or TH- 1 g H g=0. If g= 0 (but H is not positive semi definite, for otherwise we are at a point satisfying second order necessary conditions) then the model problem (M) reduces to minimizing d Hd. Adding a bound on Ildll the solution of min {dTHd: Ildll < D} is a direction of negative curvature. In fact the optimal d is

11 D a= v Iiv ll1 where vl is the eigenvector corresponding to the minimal (negative) eigenvalue of H. So, for g= 0, the basic iteration (2) becomes x - x+ta. Consider now the case gO0 but g H g=0: the Kuhn-Tucker conditions for problem (P) are inconsistent for any 0 > 0. Thus for this case an appropriate model is min {Af(t, d, z): gTd < 0} d the optimal solution of which is d = -XH g, X > 0 arbitrary. We might as well take X=0, in which case d=O0 and the iteration step (2) becomes just the SD step. Indeed if SD steps are taken in the algorithm SOSD whenever g H g=0 (or H is singular) then this will not affect either the global convergence, or the rate of convergence for problems with local minima satisfying second order sufficient conditions. Henceforth in the paper we consider only the nonsingular cases.

12 ~2 GLOBAL CONVERGENCE OF THE SOSD METHOD (EXACT LINESEARCH) In this section we study the convergence properties of the SOSD method with exact linesearch, i.e. tk = arg min f(xk +tdk+ 1/2t2zk) t>0 as applied to the minization of a C -function f: R -4 R. This version will be refered to as the SOSD-E method. We will be using the framework of the Global Convergence Theorem as developed by Zangwill [10] and described in [7, Section 6.5]. The SOSD-E method generates a sequence {xk}o by the rule: Xk+1 A(xk) (x0 given initially) Rn where A is the composite point to set mapping from R into 2: A = SG. G is the function maping R into R x R x R: G(x) = (x,z(x),d(x)), where g(x) llg(x) - z(x) =: -a; a(x) = -, H(x) g(x), (~x~)ll gg(x)TH(x) g(x) and S is the point to set mapping S: R x R x R 2 2 2 S(x,z,d) = {y = x +td+ 1/2t z: t = arg min f(x + td + 1/2t z)} t>0 Let r denote the solution set r = {x e Rn: g(x) = 0}. For the function G to be well defined (ouside ) we assume henceforth the nonsingularity condition:

13 H (x) exists and g(x) 0 = > g(x)TH(x) lg(x) t: 0, (2.1) The implication in (2.1) holds if e.g. |y H ly > MII||yI for some M > 0 and p > 0. The main result of this section follows. Theorem 1 [Global Convergence of the SOSD-E Method] Let f: R 4- R be a C -function, and x R a given startgpoig point for the sequence k} generated by the SOSD-E algorithm. Assume that the nonsingularity condition (2.1) holds and that the level set Lo = {x: f(x) < f(xo)} is compact. Then, any limit point x* of a convergent subsequence of {xk}0 is a solution, i.e. g(x*) = 0. Proof We refer to the Global Convergence Theorem (GCT) [7, p. 124]; accordingly, three conditions must hold (i) xk e some compact set S, V k = 0,1,... (ii) 3 a descent function Z: R R (see [7, p. 122]) (iii) The mapping A is closed outside r Since both d(x) and z(x) are descent directions it follows that f(xk 1) < f(xk) with strict inequality for xk r F. Thus (i) holds with S = Lo and (ii) holds with Z(x) = f(x). To prove the validity of (iii), i.e. the closedness of A = SG, it suffices to show that G is continuous and S is closed on the range R(G) of G. The continuity of G follows from f e C2 and (2.1). To prove the closedness of S or R(G) note first that for any triple (x,z,d) e R(G): T z d > 0 (see Section 2). Fix a triple (x,d,z) e R(G) and let (Xk,zk,dk) e R(G), with xk + x, zk ' z, dk - d, and let yk e S(xk,zkdk), Yk ' y; we need to verify y e S(x,z,d). Now,

14 k + kdkt d 2 k; t = arg min f(x +td+ 1/2t2z( Yk -- Xk + tkk. dk k' tO (2.3) hence lyk- Xkl2= lltkdk+ 1/2tkzkII2 or tk1ldkII2+t3kkdk+ l/4tkllzkll 2-IIyk Xkll = (2.4) The function in the lefthand side of (2.4) is a polynomial in tk which (for large k) has T positive coefficients (here we use zkdk > 0), thus it is strictly monotonely increasing for tk 2 0. Since it has a nonpositive value at t = 0, there is a unique nonnegative tk = 4(xkykzk,dk) solving (2.4) and 4 is continuous. So, when k - co, tk 4 (x,y,z,d) =: t 0. Now, y = lim yk = x+td+1/2t z. For all t > 0: f(yk) f(xk +tdk + 1/2t2zk), k by (2.3). Letting k -+ f(y) f(x + td+ 1/2t z) V t 2 0 hence f(y) c min f(x+td+1/2t z) f(x+id+1/2t2z) = f(y) tO0 showing that y e S(x,z,d). Remark 1 As explained at the end of Section 1, we can take a steepest descent step at any point x violating the nonsingularity condition. The modified algorithm will still converge globally by the Spacer Step Theorem (see Section 7.9 in [7]).

15 ~ 3. THE SOSD-E ALGORITHM FOR A QUADRATIC FUNCTION In this section we shall prove the quadratic rate of convergence of the SOSD-E algorithm for quadratic functions. In addition, we will propose a version of the method not requiring a linesearch which will be called "the a-method". Consider the quadratic problem T T min 1/2x Qx-b x xeR Rn where Q is a positive definite (P.D.) symmetric n x n matrix, and b e R. We denote the solution of the problem by x*, i.e. Qx* = b; it is clear then that the above problem is equivalent to: (QP) min f(x) = 1/2(x-x*) Q(x-x*) xeRn and it is this problem which we shall consider in the remainder of this section. Let the eigenvalues {Xi} of Q be ordered as: O-X <X2<...SX <n, and define r = X /Xn (the condition number of Q). Defining y = x-x;, we have for the gradient of f: g = QY and for the Hessian H = Q. We recall that for any v e R and p e R: TQP v Q v 0<1_ < u < 11vII (3.1) for some real positive 1 and u, a fact we shall use frequently. Finally we define:

16 a = lim inf ak a = lim sup ak and analogously for Pk' We begin by deriving expressions for the exact stepsize tk (see eqs. (3.2) or (3.3) below) and for the quantity: 2 4 _lX..l2/ lx -x*11 ~ IIxk+l-X I! /i x-X 1 1 A necessary condition for t to be the optimal stepsize is 0 = f(x,+tdk+ 1/2t2tk).t=t = (dk + tkzk)Vf(x k + t k + 1/2tzk) for our quadratic function, this gives: (dk+ tkzk) TQ(x+ tkdk + 1/2tkzk -X*) = 0 or T ZkQZk 3 T 2 T T T - tk + 3/2dkQZktk + (ZkQ(Xk-X) + d Qdtk + dkQ(Xk-X*) = 0 2 After substituting the expressions for dk, Zk and using Qy=g, we obtain (omitting the index k): 2gTQg 3 ig|I 2 2 |g||2 a - t + 3/2a3 ---- t + (2 --- -alglj) t - |g|l = 0 211gl gTQ g TQ-lg (3.2) || 1||2 oT IIgkl2 kgkQgk Defining: wk = -- and uk = 2 note that by (3.1) the wk's and u's are finite, T -1 2g211gk gkQ gk positive and bounded away from zero. Eq. (3.2) can now be written as: 2 3 2 2 1/2akuktk + 3/2akPkwtkk + (PWk - akiigkI) tk - fIgkj = 0 (3.3) For the rate of convergence we need the following expression:

17 ixk + 1 - x' ll2 \lxk-X* I4 llxk+tkdk + 1/2tkzk-x*112 IlYkll4 After substituting the values for dk, tk and omitting the index k, this becomes: 24 at 4114 + 41 3 apt lY+14 f2Ilgil2 agTQ lg 20jlgll 1 (gTQ - g)21ll 2 Ig lllIly 2TQ IlYI2 ml 4' 2~ (3.4) This can be further written as: 2 4 a t 411y1l4 T -1 2 a(gT- lg)t2 g.+ 3 ulgil. hIIY 1 pllgllt 1 1pgllt _(___.-l 1)) + (- -l 1)) glYll T lg 1Il -l g (3.5) we are now ready to state the main theorem of this section: Theorem 2. [Quadratic Convergence of SOSD-E for Quadratic Functions] Let xk -+ x!, then K llxk 1-*ll a lim sup 2 < 1/4 - k Hcllxk-x'-l2 2 Moreover, if 22r _- /2we have the sharper bound: 1+r 2 |IIX -x'II 2t/r 1-r lim "sup L2 1/2-2 (+2) l+r) The proof of the Theorem is preceded by Lemma 1 and a few calculations. Lemma 1 If the sequence {xk} converges to x* Then there exists ko > 0 such that for k > ko:

18 Lka= P(gll2kI / 1Q gk) - IlgkI where tk is the unique positive root of eq. (3.2). Proof We have: {x.} e x' and therefore: {g} - 0. This means that for k large enough, eq. (3.2) will be of the form: P(t) = At3 + Bt2 + Ct- D with A,B,C.D > 0, corresponding to the coefficients in eq. (3.2). Now P(t) is a continuous increasing function, mapping [0, c) into [-D, ax) hence there exists a unique positive root t. We shall denote by L(t) the linear part of P(t) i.e. L(t) = Ct-D. Since for t 2 0 P(t) > L(t) and P(0) = L(0) = -D, we find that the root tL of L(t) = 0 is larger than the root tk of P(t) = 0, see Figure 2. A4t,, P )t FitcuRG 2 ' llsll we obtain: tk c tL = D/C p2(! g112 / gTQ 1Ig)- IIgI lgkll2 Note that since is bounded, then gT -lg

19 tk = O(IIgk] ) the symbol O(a) meaning lim a) s A < a. We further denote by tk - 7k the fact that a- a tk behaves asymptotically as 7k i.e. lim (-) = 1 k-, 7k Rearranging eq. (3.2) we have: klligkII' k 2 2 t k k - aklgkll + 3/2akfkktk + akuktk (3.6) Now, wk and uk are bounded and assuming that xk - x*, lim lgkll= lim tk=0 and we k-e keoo conclude that Pk[[gk]] pkllgk[] gkQ- gk t k _ _ ~\ k 2 Okwk 2p(lgkll2 / gQ- gk) PkIIg (3.7) We shall use (3.7) in examining the following expression for large k, under the assumption that xk - x: 1 kl Iglkitk -( 1) [lyk[ gk Q- gk (3.8) Using eq. (3.2), (3.8) can be written as: 2 T -1 ak gkQ gk 3 llgklI 2 ak tk ~~k [ -Yk]'[[gk[3 tk - 3/2ak tk + -- -I (gkQ gk)lyPkkll ll (3.9) The first term in (3.9) will converge to zero for k e a, as tk - O(||gk[I) and ||gk,| = O(IIYkl|) With the use of (3.7), the last two terms in (3.9) behave asymptotically as:

20 T -1 l[gke gII Q gk 2 -3/2 ak.I — I (. —. ) (gkQ- l g) lYk1 l klgk Qk gkQ -gk k p\kllgkll'tlykl =-1/2 (- ) Pk llgkll IIkll (3.10) The proof of Theorem 2 follows. Proof of Theorem 2 1 fkllgklltk Using in (3.5) the asymptotic behavior of tk and -- ( k) ad ( ) rl, w (3.7) and (3.10) respectively, we obtain: 1) as given by llxk+ -x ll2 2 lim sup = lim sup { I 4 k - ||xk-x*]4 k- 411yk11 T - 1.( ) Pkllgk 4I Oklk.1 T -1 T -1 akgk Q- gk Q k 2 1+ g~.l.klYJ3 ( Okllgk 11 (-1/2)a gTQ -gk pk( 2 lkgk lllYkll ( -— k ( 1 1-I --- 1 l k k gk ) )2-1 (-l /2)a g|Q 1 gk 2 + ( 2- ( -I)) } Ok \\gW11-Ilykl 2 = lim sup {1/4 4 k-* gk T -1 T -1 1 gQ k 2 g kQ gk 4 lg1. I11-.[yki ) - ligk.11.[ykll } We define sk = k T -1 gkl gk Pkllgt~ II.ykll Note that: 2v/X!Xn Sk = xi +Xn YkQYk C 1 IIQykll llykll Since max s) =: s2 4 = 1/4 (see Figure 3), then Since max {v(s) =: s - s } = 1/4 (see Figure 3), then

21 IIxk+l - x:ll a lim sup 2- 2 k-o: IIXk - x*|| 4I 2 r iS) s FIGURE 3 2VX X, 2_ v(.) We see therefore, that if - 2 -2 then (see Figure 3) v -- ) 5 1/4 giving 1 Xn 1 + Xn the sharper upper bound in Theorem 2. We now present a version of the algorithm without linesearch which will be called the a-method. Here, we choose a suitable stepsize tk and determine ak so as to make tk an exact stepsize. For convenience we use the parameter pk = Pk / ak' Subsitituting Pk = pkak in (3.3) and dividing by ak yields: llgklltk + IlgkllPk ak -3 -2 2 (3 Uktk + 3/2Pkwktk + kwktk (3.11) For ak to be finite and bounded away from zero as xk + x*, it is necessary that IIgkll lim inf- > 0 k -~: tk (3.12) llgkll lim sup - < ko -tk (3.13) Possible choices for tk satisfying (3.12), (3.13) include:

22 gTQ- l1 tk = IIgkll or tk klltgkll the latter explained by (3.7). We can immediately state the following theorem: Theorem 3: [Convergence Properties of the a-method] The a-method, where ak is determined by (3.11) and with tk satisfying (3.12) and (3.13), converges to the solution of (QP) with quadratic rate of convergence, as given by Theorem 2. Proof The proof is immediate from Theorem 1 and Theorem 2 since the a-method is a special case of the SOSD algorithm.

~4. THE SOSD-E ALGORITHM FOR A GENERAL C2-FUNCTION,,....... In this section we generalize the quadratic convergence result from the previous 2 section to general C -functions. Throughout this section we consider the problem: (GP) min f(x). xeR Let x* be a strict local minimum of f, i.e. the second order sufficient optimality conditions g* =: Vf(x*) = 0, H* =: V2f(xl:) positive definite, hold at x*. 2 The generalization of Theorem 2 is obtained by showing that, for a C -function, crucial expressions (such as the inequality in Lemma 1, or eq. (3.4)), which were obtained for a quadratic function, differ only by a term of magnitude O(|Ixk - x*I)). First we derive some auxiliary results. Lemma 2: If (1) f:DcRn - R isC2 (2) There exists an open convex set D in D, containing x, such that: 3 R > 0 —V x,y E Do: j|H(x) - H(y)j| Rljx - yj| (4.1) i.e. H is locally Lipschitz at x:'. (3) V x e D: H(x) and H* are bounded and positive definite Then g = H*(x- x*) + O(l|x - x*|2) (4.2) I|H - H II = o(lx - x*11) (4.3) IgTHg - gH*g = O(llx - x*11) (4.4) 23

24 H-g - gTH- 1g = O(3lx - x* 3) TH- H*g g- lg1121 = (llx - X*113 IgTH H*g - gT HI g = O(|x - x*1 3) IgTH*y - Iglll2 = O(Ilx - x*13), g -X-. (4.5) (4.6) (4.7) (4.8) Proof: We start with the Taylor expansion for g =: Vf(x): Rn R around x*: g = Vf(x*) + H(i) (x - x*) ( = Xx + (1 - X)x* for some 0 X <i 1.) Using Vf(x*) = 0, this can be rewritten as g = H*(x - x*) + (H(X) - H*) (x - x*). Since JIx - x*ll Ijx - x*11, this yields jig - H* (x - x*)>I|| < H(X) - H:*ll.lx - x*lI < RIli - x*ll IIx - x*|, using (4.1), < RI|x - x: 112 In addition we have, for any vector v e R ITg - v H'(x - x*)l < IlvIIR IIx - x* 12 and therefore vTg = vTH(x - x*) + O(llx - x*112) (4.9) To prove (4.3) let us look at: IIH-1-H-111 c IIH- 11 I IH- IIH - H* S IIH ll IIH- lRIlI - x*11 by (4.1) To prove (4.5) note that gTHg - gTH*gI = IgT(H - H*)gl; IIgl2R lix - x*ll But from: (4.2): Ilgll < IIH*I1.lx - x*:'l + O(l|x - x*ll) and therefore:

25 g THg = (x - gH = O(IIx - x*!lj) The proof of expressions (4.6)-(4.8) is similar and hence omitted. ~ Lemma 3: Under the assumptions of Lemma 2, any sequence xk e x* generated by a convergent descent method for (GP) has the following property: n 3 Ko 0 V k > Ko: IIxk+ - x* 11 --. Ixk - x*ll where X and X are the minimum and maximum eigenvalues of H:' respectively. Proof: For any descent method we have: f(xk+ 1) f(xk); using the Taylor expansion of f, the inequality can be rewritten as f(x*) + g (x - x*) + 1/2(xk+ - x*)H(yl)(x - x*) c f(x*) + g*T(xk - x*) + 1/2(xk - x*)H(y2)(xk - x*) where yl is in the interval (xk + 1' x*) Y is in the interval (Xk, x*). Recalling that g* = 0, this further reduces to (Xk+1 - x*)TH(y )(xk+ - x') c (xk - x*)H(y2)(xk - x*) Now, since xk 4 x*, we have: Y1 x* and H(yl) H* Y2 x* and H(y2) - H* The proof is completed by recalling that for any vector v e Rn:

26 xlllv < v THv < Xnllv2 in particular for v = xk+ 1 - x* and v = xk - x*. Let us now look at the necessary condition for the stepsize tk to be exact for (GP): (dk +tzk)T Vf(xk + tdk + t k) It=t 0 This yields, using the boundedness of dk and zk together with (4.9): (dk + tkzk H k(x + tkdk + 1/2tzk - x*) + tk(O(IIxk+l - x:i)2) + O(IIxk+1 - x l112) = Using Lemma 3 we obtain (dk+tkzk)TH*(Xk - X* + tkdk + 1/2t2zk) + tk(O(2xk - x*||2)) + O(IIxk - x*l|2) = 0 1/2 zkH*zktk + 3/2dkH*zktk + (zH*( - x ) + dH*d +O (llx - X*1l ))tk + dTH*(xk - x*) + O(l|xk - x*12) = 0 (4.10) After substituting dk and Z, we obtain, omitting the index k, and denoting y = x - x TH-1g T - 1 T 2 gg gH g 2 g g Hy a — 2 t + 3/2aP- t + (-a + 21gI gTH-lggll gTH-1H H -l gTH--- -----— |gHll + O(ijyll))t - P +O(lIylI) = O T - 1 g2 TH -. 1 With the help of Lemma 2 this can be further transformed to: 2 gTH*g 3 llg112 +O (ly3) 2 -allg|I2 +O~(Iy 3) a - t + 3/2t + 3/2at + 21|M|g gTH - g+O(jy|3) 11gl

27 (gTH - lg+Ollyll3)ligll2 + P -T * --- — (gTH lg(gTH*- +O(jlyjI3)) and finally: 9TH - 1gg (llyll,) + O1IIYII))t - gH +Ojyj IlepII +O(11y112) = 0 T -1 gH g _______ H__lgl2(l + O(IlYl)) IlgJ12(l + O(IlylI)) gTH*g 3 jgJ(1+O(dyI)) 2 2 l )) a 2 t + 3/2a, -.- t + (P 211g112T TH* - 1g(i + O(lgll)) 2||g||j2 gTH g(l+O(llyll)) gTH l -g+o(lgll)) - alg|ll(1+O(lly|))+ o(IIYII)) t - pllgll(l+o(llyll)) + O(lly112) = o (4.12) Introducing the index k, we obtain for tk analogously to (3.6): kllgkIl(1 + (O(llYkll)) k g gTH*g _ I I2(12|g12 ( 1+Y(||yl )) ak 21gk2 t + 3/2akk - O k 2 l 2gH* gk( +o ( I)) k 9k H* lgk- (1+o0(llYkl)) 2 llgk112 (1 +O (Ykll)) tk+ Pk ( gkH*- gll + O(llykl)) + O(llYk,|) (4.13) (4.14) This shows that asymptotically (k - cm) we may write: tk = rk(1 +O(lyklI)) where rk is the solution of eq. (3.2) with Q = H* and g = Vf(x). This also means that, similar to Section 3: tk = O(Igkll) = ~o(llkll We now develop the expression for the rate of convergence which will be the equivalent of 2 (3.4) for a general C -function: >1 IIk - x*11 T T ZT T+dd T _k_+__1 _ Zkz k \ 4 dk zk 3 k k 2 k + 2 t*114 1/4 ---- -=1/4 t + tk + tk --- []XA- X*][4 [Yk114 [nYkii] i lykl4 l]yk114!}Ykll2

28 Substituting dk and Zk and using (4.4)-(4.8), this gives: k k 2 T k 4 a 3 (gkH* -gk)(1+O([IykI[)) 1/4- t + t + (-ak a4 lly 114 k llk114 K IIllgkliY k1l +2 llgkll2lykll2(l + O(lykll)) 2 2pJkllgllllYll( 1+0O(llykl|)) + T ------- )tk (gT -ll - 2 12(l+O4l2 (gkH' lgk)2(1 + o(ilykll)2lykll l[ykl[4gkH*- gk)(1 + O([[yk[[)) 1 tk+ I Iykil Bearing in mind that 1 = l+O(IIyII) T+O = l +O(IIyll) and omitting the index k, this can finally be written as: 2 4 3 T ': - 1 2 a alpt gHt gH 2 lgll 2 ( — 4+ — 4+ (-a - + -2 )t2 4|114 I114 4 g (gTH*-lg)ylly12 llgll 1 - 2 ---- t + -2)(1 + 0Ollyl) (gH -'g)lyl2 IIYII (4.15) We now formulate the main theorem of this section: 2 Theorem 4: [Quadratic Convergence Rate of SOSD-E for C -functions] Let f satisfy assumptions (i)-(iii) of Lemma 2. Assume that the sequence {xk}0 generated by the SOSD algorithm converges to a strict local minimum x* of (GP). Then:

29 IXk+ - x'l lim sup 1/4 -k.c m liXk - xjll 2 21/r*,2 Moreover, if -- we have the sharper bound: l+r* r lXk+l - X:lll 2V/r* 1-r* lim sup 1/2 ( -— ) ( — ) k-G Ilxk - xx*[l 2 +r* 1+r* * l Here r* = X / Xn is the condition number of H*. Proof: For k large enough, xk will be in a neighborhood Do of x* (see Lemma 2). Substituting: tk = rk(l+O(||ykII)) from (4.14) in (4.15) yields: llxk+ 1 -x*12 lim sup = k ||llXk - x*|I 2 4 3 T -1 ak k akOkr k + 2 |gkl gk H* =lim sup { 4 + + ( — k2 k 4||yk[|4 ||y[4 [1 | (gk[gk[[[Yk[Yk llgkl 1 T - k2 + — } (+ (lJyk2 )) (gk" 9ky"Yk Since Yk + 0, 1+O({lykI>) 1. Furthermore the expression in the curly brackets is exactly as in the quadratic case with Q = H* and the result of Theorem 2 applies. Remark: An alternative way of proving Theorem 3 is to use (3.4) in [ 11 ] and to show IIlk+ - xkll that - O. This can be proven, however, we chose not to prove the theorem 11-H1k gl this way since i does not provide us with the explicit bound in Theorem 4, in particular the effect of the condition number of H*- on the rate of convergence.

30 ~5. THE SOSD METHOD WITH INEXACT LINESEARCH In this section we study the SOSD method, with the stepsize computed via an Armijo-Goldstein type rule; this version is denoted by SOSD-I. We demonstrate that SOSD-I maintains the same convergence properties as the SOSD method with exact T linesearch. We define (for g, d and z such that g d + 0) f(x+td+ /2t2z) - f(x) 7(t) = - T tg d At the k-th iteration, the SOSD-I method is then given by: compute: klIgkll - 1 dk = T - Hk gk g9Hk gk ak Zk gk llgkll compute tk: such that: 0 < a! 7(tk) < 1-ao 1 (5.1) starting from the initial guess: T -1 o gkHk gk tk = I update xk: 2 Xk+ Xk + tkdk + 1/2tkk -4 Here, c is a number satisfying 0 < a < 1/2 (usually: a 10 ). We now formulate and prove two theorems concerning SOSD-I:

o3 Theorem 5 [Convergence of SOSD-I] If f is a C -function and bounded below, then (i) for all k such that gk 0, there exists tk satisfying (5.1). (ii) for the sequence {xk}O, generated by the SOSD-I algorithm: lim xk = x such that Vf(x) = 0 ken: Proof: (i) (We omit the index k) Suppose g d < 0 and consider the following Taylor expansion: (0 ~ 8 < 1) f(x+td+1/2t2z) = f(x) + gT(td+1/2t2z) + 1/2t2(d+1/2tz)TH (x+ (td+1/2t2z)) (d+ 1/2tz) This gives: T T 2 z g (d + 1/2tz)TH(x + (td + 1/2t2z))(d + 1/2tz) 7(t)= l+t(( )+ 2Td ) d g 2g d (5.2) Since d, z and H are bounded, it follows that 7(t) is continuous and lim 7(t) = 1. t-0 (5.3) T Now, since g d < 0, and f is bounded below we can always find a t > 0 large enough such that: f(x + td+ 1/2t2z) > f(x) + atgTd. We can therefore always find a t such that 7(t) < a. This, together with the continuity of 7 and (5.3) proves (i). (ii) To prove (2), we distinguish between two cases: Case (a) Suppose that the sequence {tk} of stepsizes satisfies {tk} a t for some t > 0. In this case, the left hand side inequality in (5.1)

32 f(xk+l - f(xk) gives: T > O tkgkdk or: - g^-^k i o: gkdk; t' (f(xk) - f(xk+1)). Now, the sequence {f(xk)} is descending and bounded below and therefore: (f(xk) - f(xk+ )) converges to zero and therefore also gkdk (= - kllgkll) which proves (ii) for Case (a). Case (b) t = 0, in which case, there is a subsequence {t } of {tk} converging to zero. k The right hand inequality in (5.1) then yields (see (5.2)): T TH t gk zv (d + 1/2t z v (x + v(t d + 1/2t2 z))(d +1/2 z Vk k k Vk +k 1 / Vk k k k k Vk k 1,,1 vd av vk vk Vk V T k T vk T or: 0 < - g kd < -( g z + 1/2 (d + )T H(x, +0(tv d +2 z )) Vk V, Vk v vV V VkV kk ~ k k k 2 k kkk vkk t (d +-z )) k 2 k T Now, as tv 0, gv d 4 0, hence g - 0. g k k k Theorem 6: [Rate of Convergence of SOSD-I] If the sequence {Xk}, generated by the SOSD-I method, converges to a strict local minimum x* and if the function f satisfies the conditions of Lemma 2, then {Xk} converges to x* quadratically. Proof: If we can prove that, for k large enough, the initial guess tk in the SOSD-I

33 algorithm will satisfy (5.1), then the proof follows; this is so because in ~4 we showed that the exact stepsize tk behaves asymptotically as tk (i.e. lim tk/t = 1) and the proof of the k-e quadratic rate of convergence was based on this fact. For k large enough, we also have that Hk will be positive definite and we can therefore 0 omit the absolute value sign in the expression for tk. 0 We now prove that tk satisfies (5.1) for large enough k. Indeed, substituting for zk, 0 0 dk and t in (5.2) and letting k - co, in which case tk - 0 and gk 4 0, yields lim 7 (tk) = 1/2 ke k *

34 The "a-methods" of ~ 3, which is an exact linesearch version of SOSD for quadratic function, can be naturally extended to general C -functions: the results is a method free of linesearch: a-method r choose: sequences p, and tk (k=0,1,2,...) I i i iI i I &lb such that: Pk>0, lim Pk>0 k -c tk > O, lim t, - 0 k-4 llgkll lim -< ~ (e.g. tk=||gk{|) k-~ tk and an initial point x0eR. compute: T gkHkgk uk 212!lgkll2 11 112 Wk T -1 gk Hk gk. llgkllctk + Pk') ak 3 2 2 ktk + 3/2PkWktk + Pkwktk llgkll -1 k -Pkak T -1 Hk gk gkHk gk Zk = akgk / lgkll Xk+ 1 = k + tkdk 1/2t i update: -

35 Recall that ak is chosen so as to make tk the minimum of the quadratic approximation to f at xk. Thus it is not hard to verify that (4.14) holds. Also, by the choice of tk another crucial relation: tk = O(llgkl|) = O(I||Yk|) holds, thus it possible to show that, if xk -, x* then the convergence is with quadratic rate. However, no global convergence is guaranteed for the a-method. It is interesting to note that for the single variable case (n= 1) the a-method is exactly the (pure) Newton method xk+ 1= k f (xk)/f (k) and this is true independent of the choice of the sequences pk and tk! The sequence pk can be chosen fixed: Pk - p. The parameter p then controls the "balance" between a SD step and a pure Newton step. To verify this note that if p = 0 then dk = 0 and hence we have a SD step. If p - 0 then llgkll pep Pp co Wktk tkdk 4 -Hk gk and we have a pure Newton step (regardless of the choice of tk.) In our numerical experiments it was found that a large p (s 10 ) gives good convergence results; even with such large p, far from an optimal solution (where 1||g1k is large) the effect of zk is not negligible (since zk is multiplied by tk) and it may be sufficient to secure global convergence.

36 ~ 6. COMPUTATIONAL RESULTS In this section we report the computational results of the performance of the various SOSD algorithms as applied to the minimization of four classical test functions (see [8]). The results are then compared to the performance of three versions of Newton's method. The algorithms and test functions involved are: Algorithms N: The pure Newton method: Xk+ = xk - H gk E: The damped Newton method with exact linesearch N-I: The damped Newton method with iexact linesearch according to the Goldstein N-I: The damped Newton method with inexact linesearch according to the Goldstein -4 rule as described in ([9], Section 8.3.e.), with a = 10 SOSD-E: The SOSD method with exact linesearch (Section 4) SOSD-I: The SOSD method with inexact linesearch as described in Section 5 with a 10-4 = 10 The a-method: See last part of Sec. 5, with tk = Ilgkll, Pk = P' Test Functions Rosenbrock's function (2 variables): R(x) = 100(x1 - x2) + (1 - x) Wood's function (4 variables): 22 2 2 22 w(x1,X2,x3,X4) = 100(x2-x 1) + (1-X1) + 90(x -x3) + (1-X3) + 10.1((x2 -1) + (x4-1)2) + 19.8(x2-1) (x4-1) Extended Wood's function (20 variables): 5 W(x) = w(x4i-3' x4i-2' 4i-l' 4i) i=l

37 Dixon's function (10 variables): Dn(x) = (1-x12 + (x)2 (-x+1 D(x)=(1-X)+ (1-x) + r (x-x i1) i= 1 (n= 10) The results are summarized in the following tables: each entry designates the number of iterations required to satisfy the following stopping criterion: -10 IIxk - x*11 < 10 10 NC(j) means that no convergence was achieved after j iterations, whereas NC means that convergence was not obtained at all.

38 FUNCTION: ROSENBROCK Second Order Steepest Descent Newton Starting I-.EjPoint SOSD-E SOSD-I a-method N N-E N-I (20,200) (-1.2,1) (10,10) (-25,50) (-25,-50) a=l 0=1 31 a=l 0=1 12 a=2 3=4 13 a=1.7 0=2.89 46 a= 1.5 3 =2.25 32 a=l =1 67 a=l 0=1 21 a=l 0=1 37 a=l 0=1 Q=1 p=1 56 a=s = 1 74 p=106 12 p=106 7 = 1/2 10 8 5 6 5 5 5 45 13 27 52 52 78 21 46 85 89 p= 1/2 18 106 p=1/2 106 12

39 FUNCTION: WOOD Second Order Steepest Descent Newton Starting. Point SOSD-E SOSD-I a-method N N-E N-I Startin,... et o d (-3,-1,-3,- 1) (0,2,0,2) (0.1,1.0,0. 1,10) (200,- 300,450,250) (-200,-300, -450,-250) a=4= 16 a= 1 =1 25 a=5 0=25 11 a=10 0=100 9 a=9 =81 23 a=9 0=81 32 a=l 0=1 19 a=l 0=1 10 a=9 0=81 45 a=9 0=81 p=112 106 30 p=1/2 106 19 p=1/2 106 18 NC NC NC NC NC NC NC NC NC p= 1/2 27 106 32 25 33 p=1/2 106 32 17 46 38 31 43

FUNCTION: EXTENDED WOOD Second Order Steepest Descent Newton Starting Point SOSD-E SOSD-I a-method N N-E N-I PI a=5 tp=25 26 a=5 p=50 40 a= 10 = 100 37 a=5 0=25 39 a=5 0=50 60 a=5 0=25 37 P3 P4 P5 p=106 28 p=8 106 53 p=5 106 39 p=5 106 16 p=1/2 106 20 NC NC 49 17 24 NC NC NC 16 NC NC NC 44 17 26 a=10 0=100 a=10 =100 17 16 a=10 = 100 I a=10 =100 25 36 P-=(- 3,- 1,- 3,- 1,...) P2=(- 1,-2,-3,- 4,...) p3=(20,19,....11,-11,-12,...,-20) p4 (10, - 20,30, -40,50,10,10,..., 10, - 50,40,..., - 10)

4L FUNCTION: DIXON Second Order Steepest Descent Newton Starting SD —mto Point SOSD-E SOSD-I a-method N N-E N-I I PI P2 P3 P4 P5 a=10 p=100 21 a=10 p=100 21 a= 10 0=100 28 a=10 0=100 22 a=10 p=100 27 a- 10 0= 100 24 a=10 O0=100 25 a=10 p=100 34 a=10 p=100 27 a=10 p=100 33 I p=5 106 47 p=5 106 31 p=1/2 106 46 p=1/2 106 33 p=1/2 106 47 218 610 418 NC(1000) 685 NC NC I I I I I NC NC NC NC NC NC NC NC Pi=(-3,- 1,-3,...) P2= (- 1,-2,-3,-4,...) p3=(- 100,- 100,1,1,- 100,- 100,,1,...,- 100,- 100) p4=(0,- 10,0,- 10,0,...) p5= (100,200,300,400, -500,600,700,800,900,1000) P5

Remarks (i) The parameters a and p = in SOSD-E and SOSD-I were chosen between 1 and 10. No attempt was made to choose the best values for the parameters, so that the results reflect the typical behavior of the methods. (ii) In SOSD-I, less than two function evaluations were needed on the average to perform the inexact linesearch. Some general conclusions can be deduced form Tables 2-5. First we point out that 3 a should be chosen greater than one. This can be motivated by the coefficient - in the rate of convergence result given in Theorem 4. Our experience so far suggests using 1 O. a As previously mentioned, for the a-method, it was found that a large value of p, typically p = 106, gave good results. Here we remind that the a-method does not require a linesearch and is therefore much more efficient in terms of function evaluations than the other methods. This was further corroborated by additional numerical experiments performed by J. Zowe and his students at the University of Bayreuth. (private communication to the authors)

References (1) Ben-Israel, A., Ben-Tal, A. and Zlobec, S^Optimality in Nonlinear Programming: A Feasible Directions Approach, Wiley Interscience, New York 1981. (2) Ben-Tal, A. "Second Order and Related Extremality Conditions in Nonlinear Programming" J.O.T.A., 31 1980. (3) Ben-Tal, A. and Zowe, J. "Directional Derivatives in Nonsmooth Optimization" J.O.T.A., 47 (1985)483-490. (4) Ben-Tal, A. and Zowe, J. "A Unified Theory of First and Second Order Conditions for Extremum Problems in Topologycal Vector Spaces" Math. Prog. Study 19 (1982)39-76. (5) Dennis, J.E. and Schnabel, R.B. Numerical Methods for Unconstrained Optimaization and Nonlinear Equations Prentice Hall 1983 (6) Fletcher, R. Practical Methods of Optimization, Vol. 1, Wiley, Chichester, 1980. (7) Luenberger, D.G. Introduction to Linear and Nonlinear Programming, AddisonWesley, Reading Mass., 1973. (8) More, J.J., Grabow, B.S. and Hillstrom, K.E. "Testing Unconstrained Optimization Software" ACM Trans. on Math. Software, 7 (1981)17-41. (9) Ortega, J.M. and Rheinholdt, W.C. Iterative Solutions of Nonlinear Equations in Several Variables, Academic Press, New York, 1970. (10) Zangwill. W.I. Nonlinear Programming; A Unified Approach, Prentice-Hall. Englewood Cliffs, N.J., 1967. (11) Dennis, J.E. and More, J.J., "Quasi Newton Methods, Motivation and Theory" SIAM Review 19 (1977), 46-89.