Some Methods for Solving Nonsmooth Convex Minimization Problems J.R. Birge Department of Industrial and Operations Engineering University of Michigan, Ann Arbor MI 48109-2117 USA L. Qi and Z. Wei University of New South Wales Sydney NSW 2052 Australia January 30, 1996 Revised Technical Report 95-3

Some Methods for Solving Nonsmooth Convex Minimization Problems 1 J. R. BIRGE Department of Industrial and Operations Engineering University of Michigan, Ann Arbor, MI 48109, USA L. QI and Z. WEI School of Mathematics, University of New South Wales, Sydney NSW 2052, Australia January 30, 1996 Abstract In this paper, we propose and analyze a class of methods for minimizing a proper lower semicontinuous extended-valued convex function f: Rn -f RU{oo}. Instead of the original objective function f, we employ a convex approximation fk+1 at the k-th iteration. Some basic global convergence rate estimates are obtained according to the properties of the approximation sequence {ffk}'o. We illustrate the applicability of our approach by proposing (i) a new family of proximal point algorithms which possesses the global convergence rate estimate f(xk) - minRn f(x) = O(1/(k- A)2) even if the iteration points are calculated approximately, where {A/k}=o are the proximal parameters, and (ii) a new bundle method which is globally convergent under some assumptions. Key Words. Nonsmooth convex optimization, proximal point method, bundle algorithm, global convergence. 1. Introduction Consider the following optimization problem minf(x): x e R}, (1.1.) where f: Rf -- RjU{oo} is a proper lower semicontinuous extended-valued convex function. The Moreau-Yosida approximation Fx of f is defined by 1 Fx(x) = min{f(y) + 2 \y - x12: y E Rt}, where A is a real positive number. As proved by Moreau [?], Fx is a differentiable convex function defined in the whole space of Rn that possesses the same set of minimizers as the problem in (1.1). Using these properties, Martinet [?] presented a proximal point algorithm for solving (1.1): start from an initial point x0 E Rn and generate {Xk} =( by solving 1 Xk+1 = argmin{f(x) + -- || - xzk2: z E Rn}, (1.2) 2Ak where {)Ak}=o is a sequence of positive numbers. 1This material is based on work supported by the National Science Foundation under Award Number DDM-9215921 and the Australian Research Council. 1

Rockafellar made a major contribution to proximal point methods in [?] by proving, under some additional reasonable assumptions, the local superlinear convergence of the proximal point algorithm for finding a zero of an arbitrary maximal monotone operator even if the iteration points are calculated approximately, which is an important consideration in practice. As an application, Rockafellar applied the results to a lower semicontinuous proper convex function f. In this case, the two general criteria for generating Xk+l are defined by 00 dist(0, k(Xk+l)) <,k l,k < o (1.3) k k=O0 and 00 dist(0, Sk(k+l)) < 2k ||Ik+1 - kll, 2,k < X, (1.4) k k= where 1. Sk(X) = af(X) + (X - Xk). (1.5) Ak For a survey of recent convergence results of the proximal point algorithm we refer the reader to [?,?,?,??????]. Giiler presented in [?] two different proximal point algorithms which used an idea introduced by Nesterov [?] for smooth convex minimization. The main difference from the classical proximal point algorithm in (1.3, 1.4, 1.5) is that the methods given in [?] generate an additional sequence {yk}ko of points in Rn, and calculate Xk+1 from 1 Xk+1 = argmin{f(x) + 2- x - ykI2: x E Rn}. (1.6) 2Ak Giiler also showed that the minimization in (1.6) can be performed inexactly by a modification of (1.3), i.e., 1 ~3,k dist(O, af(xk+l) + -(xk+l - yk)) <, (1.7) Ak Af where r3,k = 0(-k) for some a > 1. Lemarechal combined the proximal point method with the bundle method in his pioneering work [?], (also see [?], [?] and [?]). In Lemarechal's algorithm, a sequence {xk}=o is generated by a sequence of convex functions, {fk =0o. More precisely, 1 Xk+l = argmin{fk(x) + -\x - xk 12: x E Rn) (1.8) 2Ak where fk is a bundle linearization function of f. The power of the bundle methods has yielded many results (see [?,?,?,?,?,?,?,?,?]). In this paper, we study procedures that permit the solution of (1.1) via the construction of a sequence of objective function approximations, {fk }=o. Such approximations are necessary in many optimization problems (for example, stochastic programming, see [?,?,?,?,?,?,?,?]) where the objective functions are too complex for exact evaluation. Within the context of stochastic programming, the objective function involves the expected value f(x) = E[F(x,)] = F(xw)P(dw), (1.9) where w is a random vector defined on a probability space (Q, A,P). Thus, the precise evaluation of f and its subgradients involves multidimensional integration. To avoid the computational burden 2

associated with this evaluation, it is customary to replace the objective function, f, with a sequence of approximations, {fk}=, see [?,?,?,?,?,?,?,?]. The remainder of the paper is organized as follows. In the next section, we describe the model algorithm and give some basic global convergence rate estimates. As the first important application, in Section 3, we present a family of proximal point algorithms which calculate Xk+l with Uk+l E &f(xk+l) by the following form JlUk+1 + (Xk+l- Yk))Il < U4,kIIUk+lil + II/5 xk+l -k, Y(1.10) Ak Ak where (74,k E [0,1] and C5,k E [0,1). In particular, we obtain, under the same conditions, the following global convergence rate estimate obtained by Giiler [?](with exact minimization (1.6)), 1 f(xk) - minxERnf(x) = 0( k-1 ) (1.11) (Ej=o VX/j)2 Noting that (i) we obtain the same convergence results using inexact minimization (??) (for examples, for all k, CJ4,k = 0 and (5,k E [0, O/5 - 2] or C4,k E [0, 1] and 5a,k = 0) and (ii) the convergence rate (??) for algorithm (??) is higher than that obtained for (1.7) in [?]. Moreover, some applications of the results in stochastic programming are discussed in the same section. As another application of the results in Section 2, in Section 4, we present a new bundle method which calculate Xk+1 from 1 Xk+1 = argmin{fk+l(z) + 2I \x - ykl2: x E Rn}. (1.12) The proposed new bundle method may have a better convergence rate than the original bundle methods since the convergence rate (??) obtained for algorithm (1.6) is higher than f(xk) - minxERnf(x) = 0( -1, (1.13) Ej=o Aj which is obtained for (1.2) in [?]. Under some additional assumptions, we establish the global convergence for the proposed bundle method. 2. 'The model of the methods We begin our research with a brief description of the main idea given in the methods in [?] which is in turn suggested by Nesterov [?] for smooth convex minimization problems. The idea of the algorithms in [?] is to generate recursively a sequence {Ok }Lk= of simple convex quadratic functions that approximate f in such a way that at step k > 0, for all x E Rn +k+1(X) - f(x) < (1 - ak) (Ok - f(x)), (2.1) where (ak is a number in the interval [0,1). If (??) is satisfied for each k > 0, then k-1 Pk(x) - f(x) < (H (1 - ai))(o(x) - f(x)). (2.2) i=O If, at step k, we have at hand a point Xk such that f(xk) < W*:= min{(k(z):z E Rn}, (2.3) 3

then we obtain from (??) that k-i f (Xk) - f(x) ~ (J7J(i - ai))((Poo(x) - f (x)), (2.4) i=o which implies that if fJ7-I1 (1 - ai) -~0, then { Xk}I000 is a minimizing sequence for f. The aim of this section is to extend the above idea to a sequence of proper lower semicontinuous extended-valued convex functions, {fk}. Since f is too complex for exact evaluation in some applications, so in each iteration, we assume that there is a simple convex function. fk at hand, which is viewed as an approximation for f at step k. Denote X = {x:x E R', f (x) < +oo}.- Our asymptotic analysis is based on the following assumption: (Al): for any x E X and all k > 0, fk(x) < +00. Let x0 E R', {Xk} C Rn~, Uk+1 G &fk+l(Xk+l) and a constant a > 0 be given. For given ak A [01), we define (P0(x) = fo(xo) ~ a IIx - X0II2, (Pk+l(X) = (1-a~k)Wk(X) ~aj~fk+1(Xk+1)U +u '1(X - Xk+01) (1-a~k) [fk (Xk) - fk+ 1(Xk)].The following two lemmas can be viewed as slight extensions of the related results in [?.The proofs follow closely those of [?]. Lemma 2.1 For all k, the quadratic functions (Pk(X) satisfy the following inequalities: (Pk+l(X) - fk+l(X) ~ (1 - ak)RWk(X) - fk(X)] ~ (1 - a~k)6k+1(X) (2.5) where 6k+ 1() = fk(X) - fk+ 1(X) + fk+ 1(Xk) - fk(Xk). Proof: From the definition Of Pk+l, we have (Pk+l(X) -fk+l(X) = (1 -ak)[(Pk(X) -fk(X)1]+ (1 ak)fk(X) -fk+l(X) +ak[fk+l(Xk+l) + U T 1 (X - Xk+01) - (1 - Cek)[fk (Xk) -fk+1 (Xk)1 =(1 - ak)[(Pk(X) - fk(X)] + (1 -aCk)[fk(X) - fk(Xk) - fk+l(X) ~fk+l1(Xk)] - Cekk[fk+1(X) - fk+1 (Xk+l) - U T 1 (X - Xk+l)1. Using the convexity of fk+j, the conclusion (??') follows. 2 Denote 60 = O,,Eo = O,ca-, = 0, 6k+ 1(X) = fk (X) - fk+ 1(X)~+fk+ 1(Xk) -fk (Xk), k = 0 1,,2,., (2.6) and Ek~1 (X) = a(1 - 0k1)Ek (X) + 6k+1 (x), k = 0, 1, 2j. (2.7) For all j: 1 < j ~ k, let -kj TI3__ (1 - Cak-i). 4

It is easy to show that k Ek+l (X) = 6k+l(X) + Ak,j6k+l-j(X). (2.8) j=1 /From Lemma 2.1, we have k fk+l(X) fk+l() (H1(1 - ai))[0o(x) - fo(x)] + (1 - k)Ek+l(X). (2.9) i=0 Letting k(X) = k + -llx - vkll, we have ak+l = (1 -ak)ak, (2.10) Vk+1 = Vk - -Uk+l, (2.11) ak+l where ao = a and vo = x0. Lemma 2.2 If 4 > fk(xk), then 2 ( k+l > fk+l(Xk+l) + [(1 -U l - ak) - + kVk - Xk+- Uk+1]- (2.12) 2ak+i Proof: From the definition of, k+l, (??) and (??), we have -k+1 = fPk+1(Vk+l) = (1 - ak)k(Vk+l) -+ ak[fkt+1(Xk+1) + Uk+l(Vk+l - Xk+l)] -(1 - ak)[fk(xk) - fk+l(Xk)] = (1 - ak)k + (1-k)ak |Vk+1 - Vk2 - (1 - k)[fk(xk) - fk+l(k)] +Cakfk+l(Xk+l) + akUk+l(Vk+l - Xk+l) > (1 - ak)[fk+l(Xk) - fk+l(Xk+l)] + a-kl - -lUk+~12 +QtkU k(Vk- - k-Uk+l - Xk+l) + fk+l(Xk+l), by the assumption in the lemma. Using the convexity of fk+l, we have sOk+1 > fk+l(xk+l) + (1 -a k)U+l(Xk - Xk+l) + akUk+l(Vk -Xk+l) 2 a^k~, IrUk~llj2 ~~~~~~2ak+l 11Uk+1~~2 = fk+l(Xk+l) + U+l [(1 -a k)Xk + -kVk - Xk+1 - 2aUk+1] So (??) follows. 2 5

Letting YA = (1 - Cak)Xk + akVk, and choosing Xk+1 with Uk+1 E afk+l(Xk+l) such that 2 qk+1 = +1 [Yk - Xk+1 - a Uk+1] ~ 0, = Uk+ ~~2ak+1 we have the following lemma: Lemma 2.3 Suppose that (??) ard (??) hold. Then for k = 0, 1, 2,...I 4~k > fk(Xk). (2.13) (2.14) (2.15) Furthermore, for k = 0,1,2,..., Sok ~ fk(Xkl) + qk. Proof: We prove (??) by induction. From the definitions of (po and co, (??) holds for k that (??) holds for k. From Lemma 2.2 and (??), we have that (??) holds for k + 1. Lemma 2.2 and (??) imply that (??) holds. (2.16) = 0. Suppose 2 The Model Method (MM). Step 0 (Initialization). Select an initial point xO E X. Let vo = xo, ao = a > 0, ao E (0, 1), and k = 0. Step 1. Set Yk = (1 - Ck)Xk + akVk. Step 2. Generate fk+l satisfying (Al). Then compute Xk+1 with uk+1- a fk+l(Xk+l) such that (??) holds, that is 2 qMM UT - - k u1] >0. qk+1 uk'+l[Yk - Xk+1 2(l - a2(1 - c Uk)ak - Set ak+1 a (1 - ck)ak and a~k Vk+1 Vk 1- __ +l 'ak+1 Choose ctk+l (0, 1). Step 3. Increase k by 1 and go to Step 1. It is worth noting that for any Yk G R', any cak E (0, 1) and any ak > 0, we can always find Xk+j and uk+1 E &fk~+l(Xk+l) such that qfMiM > 0. In fact, since 1 afk+l(X) + / (X - Yk) is a strongly monotone mapping with modulus 1 there is a unique solution Xk+1 such that 1 O EE9fk+l(Xk+l)+ k(Xk+l - Yk). 6

Let Uk+i E afk+1(xk+l) such that 1 O = Uk+1 + /((1-k+1 - Yk) a'/(ak(l -^ak)) Then (Xk+1, Uk+1) is a desired solution. JFrom (??) and Lemma 2.3, we obtain the following basic convergence rate estimate. Theorem 2.1 Suppose that {xk)}=0 is generated by (MM). Then for all x E X, k > 1, fk(xk) - fk(x) + qMM < /3k[po(() - fo(x)] + (1 - ak-)ek(X), where /f3 = fJ0 ( 1-~ ai) Theorem 2.1 suggests many possibilities for obtaining convergence methods according to different (a) approximation sequences {fk }=o' (b) solution methods for qkM, and (c) choices of ak. In the remainder of this section, we give some results based on the properties of {fk}=o. We discuss (b) and (c) in the next section. Denote 6 = 0 e 0, e o = 0,~ 6+1 = fk+il(k)-fk(k), k = 0,1,2,..., (2.17) and EO +=(1 - ak-1l)+ 6 +1, k = 0,1,2,.... (2.18) Then k + 6 = + + Akj6 +l. (2.19) j=1 Corollary 2.1 Suppose that {xk}=0Q is generated by (MM) with fk(x) < fk+1(x) for x E X and k > 0. Suppose that fk satisfies the following condition: (A2) There is an index set K such that for all x E X, limsup fk+i(X) < f(x). - (2.20) kEK,k-*oo If lim Pk+l = 0,- (2.21) k EK,k Alim (1 - ak)e+1 = 0 (2.22) k EK, k —,c and limsup fk+1(xk+l) > limsup f(xk) ( limsup fk+l(Xk+l) > limsup f(Yk)), (2.23) kEK,k- oo kEK,k —oo kEK,k —oo kEK,k-Aoo then {xk}keK ({Yk}keK) is a minimizing sequence for f. Proof: From the assumption that fk(x) < fk+l(x) we have, for x E X and k > 0, that 6k(X) < ~0 and ek(x) < EO. From Theorem 2.1 and the assumptions in this corollary, we have for all x E X, limsup f(xk) < f(x) kEK,k —oo which implies that {xk}keK is a minimizing sequence for f. The same property can be obtained for {Yk}kEK- 2 The following result indicates that, for any bounded sequence {601}od, we can choose ak E (0,1) such that (??) and (??) hold. 7

Lemma 2.4 Suppose that Ij160 I}EO% is bounded. (I). If there is d > 0 such that for all k > 0, ak ~ d > 0, then (??) holds and I Id II' 0 is bounded. (II). If ak -- 1, then (??) holds. Proof: From the boundedness of 60, we have M > 0 such that for all k > 0, I16 < M. From the definition of Ak,j we have, for all j: 1 < j < k, that Akj = flj(I - ak-i) ~ (1 - d)j This inequality combined with (??) yields k Ik%+1 I < M(1 ~ Z(1 - 0)i) j=1 which implies that {o e 0 }o is bounded. Since 0 </k = I1kd(1 )k, (??) holds. )C i~J_~~I:-~j=o (1 - aj) 1-d Since the condition in (II) implies the condition in (I), the conclusion of (II) follows from ak -~ 1 and that { d I}0% is bounded. 2 Denote f= infIf(x) xE RE J, X*= {Ix:x R',f(x) = f*}, and fo = infIfo(x):x X*}. Corollary 2.2 Suppose that { Xk}1 O is generated by (MM) with fk (x) ~ fk+l(x) for x E X and k > 0; f* > -oc and fo* > -00. If fk satisfies (A2)* for all k > 0, there is a constant bk ~ 0, dependent on k and f, such that for any x C X, If(x) - fk(x)l ~ bk, (2.24) then the following results (I), (II), and (III) hold. (I). For any x E X, f(Xk) - f(x) + qMIAI M /h[yPO) - fx) - (1 akl0E ~ 2bk. (2.25) In particular, we have the convergence rate estimate f(Xk) - f* ~ qfjM A 1k[fo (XO) _ fJ -Fo p(xoX*)2] - (1 - ak-l1),E + 2bk. (2.26) k 2 k~~~ (II). If lim bk = 0, (2.27) k-0oo (??) and (??) hold, then {Xk}oc'0 is a minimizing sequence for f. In particular, if there is r C (0, 1) such that for all k > 0, ak = 1 - r and bk = rk, then {Xk}I'0 is a minimizing sequence for f. Moreover, f(Xk) - f* = O(krk). (2.28) (III). Suppose that (??), (??) and (??) hold. If X* is a nonempty compact subset in R', then {Xk}I ' is bounded and every accumulation point of I{Xk}I 0 is a minimizer for f. 8

Proof: It is easy to prove the conclusions of (I) by Theorem 2.1. From (??), we have that {xk}lO is a minimizing sequence for f if (??), (??) and (??) hold. We prove (??) now. Suppose that k > 1. JFrom the definitions of 60, Ak,j and (??), we have 60 > -(rk-1 + rk) and Ak,j = r. The above two relations and (??) yield e > -k(rk-1 + rk). This inequality combined with Pk = rk and (??) yields f(xk) - f* + qk < [fo(xo) - fo* + p(xo, X*)2 + (1 + r)k + 2]rk which implies (??). The conclusion (III) follows that {Xk}/=o is a minimizing sequence for f and that X* is compact. 2 If for all k > 0, fk = f, then 6 = 0. In this case, we can choose bk = 0. By noting Theorem 2.1, we have the following corollary. Corollary 2.3 Suppose that {xk}j~=0 is generated by (MM). If for any k > 0, fk = f, then f(xk) - f(x) + qM < k[f(o) - f(x) + a ~ x - xo||2]. (2.29) Consequently, f(xk) - f* k[f(x) - f* + p(o X*)2] (2.30) 2 Where p(z, W) = min{flz - wl: w E W}. Rermark 2.1 Noting that the definitions of ek and e (x), using the same way as Corollary 2.2, we can prove the similar results for Corollary 2.2 hold even if fk(x) < fk+l(x) is not true. Remark 2.2 The results in Corollary 2.2 (or Remark 2.1) are useful for solving some convex complex problems. The following two examples can be viewed as two general models arising from stochastic programming (see [?,?,?,?,?,?]). Example 1. Suppose f has the following structure 00 f(x) = ho(x) + EPkhk(x) + Ox(x), k=l where X C Rn is a nonempty compact convex subset of R" and Ox is the indicator function of X, i.e., ( o if x E X Ex(x)- 0 +oo otherwise. We assume the following: bl) For j 0, hj: Rn -- R is a convex function. 9

b2) For j > 1, hj (x) > 0 and there is a constant Mo > 0, such that supk>l{lhj(x): x E X} < Mo. b3) For j > l,pj > 0 and 00 p, < +0. j=1 Then, we can let k fk(x) = ho(x) + Epihi(x) + ex(x) i=l and solve the problem (1.1) by (MM). If hj(x) > 0 or pj > 0 is not true, we can also solve this example by noting Remark 2.1. Example 2. Another type of these problems can be formulated as min{f(x) = j F(x, y)dy + (x(x): x E R}, (2.31) where Q is a compact subset of Rm and F(.,.): Rn x Rm R is of bounded variation on Q in the sense of Hardy and Krause; F(., y) is convex for any given y E Q. In this case, according to some integration rules (see [?,?,?,?] for details) we can choose, for j: 0 < j < k -1, Qk C Q and yk E Q, such that k-1 fk(x) = E F(x, yj) /(Q ) + O (x) j=o satisfies cl) For all k > 1, Qf n = 0 and UJO = Q c2) limkoo sup{|f(x) - fk(x)l: x E X} = 0. iFrom Remark 2.1, we can solve (??) by (MM). 3. New proximal point algorithms A new family of proximal point algorithms with four parameters (Ak, ak, 4,k and 05,k) is proposed in this section by devoting a new solution method for (??). In fact, Giiler [?] gave one method (1.6) for solving (??). We use (??) here to solve (??). The method described below is based on one important property (Lemma 3.1) which claims that we can choose Ak, a4,k and 05,k such that the solution set of (??) is contained in the solution set of (??). Lemma 3.1 Let u, v and w be vectors of Rn, A > 0,r E [0, 1] and t e [0, 1). If t +u + A (v - w)lI < TrIlul + 11v - wll, (3.1) then uT(W V) > (1 -_ 7-)2(1 - t)2 -_ (1 + ()(1 + t)2). 3 (1 - t)(1 + t)2 10

Proof: The inequality (U + ((V -))T(V - w) < IU + -(v - w)11 11v - w11 and (??) imply that uT(w-v)> 1v - Wtl2 - 7IuII|V - W1. (3.3) On the other hand, from (??), we have ~tAIlu|| > 11v- W| _> A1 llu 1-t l+t which combining with (??) yields (??). 2 Let T (T;t) (1 -r)2(1 -t)2 -(1 + )(1 + t)2 (1 - t)(1 + t)2 LFrom Lemma 3.1, we may now state our proximal point algorithm. The General Proximal Point Algorithm (GPPA). Initialization. Select an initial point xo E X. Let vo = xo, ao = a > 0, and k = 0. Step 1. Choose Ak > 0, 4,k E [0,1], ]5,k [0, 1) and calculate ak > 0 such that ak < Ak (4,k; U5,k), (3.4) 2ak(1 - ak) - Set Yk = (1 - ak)Xk + akVkStep 2. Generate fk+l satisfying (Al). Then compute xk+1 with uk+1 E Afk+l(Xk+l) such that Iluk+1 + -(Xk+l - Yk)ll < 44,k lUk+11~ + k IlX+l - Yk 1- (3.5).Ak Ak Set ak+l = (1 - ak)ak and Cak Vk+l = Vk -- Ukc+l ak+l Step 3. Increase k by 1 and go to Step 1. Theorem 3.1 Suppose that a4,k E [0, 1] and U5,k C [0, 1). If (Xk+l, Uk+1)(Uk+l E afk+i(Xk+l)) is a solution of (??), then UTj+1(Yk - Xk+1) > TI(94,k; O5,k)AkljUk+1112 (3.6) Therefore, (Xk+l, Uk+l) is a solution for (??). Moreover, MM > GPPA = 4,k;,O ]l ll2 > 0. (3 qIk+ - - qk+ 214,k- a2 (3.7) k+1 - k+1 2 (1 - ak)akXk I 11

Proof: For any given k, let U = Uk+1 iV = Xk+1,iW = AYkA = Ak r = U'4,k, and t = cT5,k. 4 Jrom Lemma 3. 1, we have (??). JFrom the definition of qfMM and (??), if (??) holds, then qMjM > 0 so that (Xk+l, Uk+1) is a solution for (??). (??) follows the definitions of q/jM and q GPPA. 2 By the above conclusions, we may state global convergence results similar to Corollary 2.1 - Corollary 2.3 but do not repeat these results here. We are more interested in finding U4,k, 05,k and ak such that A3 tends to 0 as fast as possible for any given sequence { Ak I}2'-o (see [?]). FRom the definition Of /3k, this is equivalent to haxfing a~k as large as possible. To find such ak, for any c > 0, set 2 C (1 - 07=~ (3.8) or 2 ak + cakAkCek - cakAk = 0. Therefore, (ca A)2 4ak~ -cakAk (9 ~k (C) = 239 Similarly to the proof of Lemma 2.2 in [?], we can prove the following lemma. Lemma 3.2 ~1Ak(C)~ 1(3.10) (1+v~~h;Zkl A.)2 (1 +(Vc-/2)Z, A3) Let = C {(Tr't) Tr E [0, 1], t e [0, 1),4' (T, t) > Since for any c C(0, 2], { (T, 0): T E [0, ] C Z (c), Z (c) =$ 0. L rom Cor ollary 2.3 we have the following convergence rate result. Theorem 3.2 Suppose that for all k, fk= f, ak satisfies (??). If (0,4,k, 0'5,k) E Y3(c), then for any x C X, (GPPA-(c)) has the global convergence rate estimate f~xk - fx) ~Aki - 11 2 < f (xo) - f (x) ~ (a/2) IIx xol f (Xk)f W + [2F(U4,k -1; 95k -1) cj1 '~I I UkV / k 3)(311 < ~ O ( ( Z 1 Y 2 Therefore f (k)- *<-4(f (xo) - f * + (a/2)p(xo, X*)2) (.2 f~xk) - f* ca (Zkl V/A )2 (.2 The algorithm (GPPA-(c)) converges (f (Xk) -` f *) if 00 Z VAk = 00. (3.13) k=O In particular, if for all k > 0, Ak > A > 0, then f (Xk) - f* 4/ (aA) ( x)-f+(/)~o * ck2 (x)-f*+a/)xoX))(3.14) 12

Since ak c4'k(C) = 2ak+cakkk+> O, Cek (c) is an increasing function about c. On the other hand, since T1(0, 0) =1 and for all -r c- (0, 1], t e (0, 1), (I1_T)2(l1_t) -r (I1+r) 1 Therefore, c = 2, iLe., ak (2) = (k)2~akk-akAk is the best choice for A~ -f- 0 as fast as possible for a given sequence { Ak} IO L Zrom Lemma 3.2, we have 1 ~~~ A3k(2) < 1(3.15) (1~v'~Z~0 VA~) 2 -I (1A-(V/2)Z I Aj)2 and the! following result. Corollary 3.1 Suppose that for all k, fk =f, c = 2 and ak satisfies ('??). If 04,k =0'5,k =0, then for any x EC X, (GPPA-(2,)) has the global convergence rate estimate f (Xk) -f x) ~ f (xo) - f (x) + (a/2)fl~x - X 11H2.6 Th erefo re, f (Xk) -f* 2(f (xo) - f * + (a/2)p(xo, X*)2) (3.17) a(ZE4 A3) In ["?], Giiler selected c = 1 and gave the convergence rate results (??)-(??) with the calculation of Xk~1 performed exactly by (1.6). Let ZE E {(T, t):T=O0and t [0, V5- -2];TC [0, -1I and t= 0}. '6 S ince, 'I(0;V'5- -2) = T (~0) I~ 'I'(0, t) =(+2(also 4'Qr; 0) =1 - 3T) is decreasing-for. t E [0, 1) (T C [0~ 4), E1 c Z(1). Therefore, we have Corollary 3.2 Suppose for all k, fk = f, c: = 1 and ak satisfies (??). If (U4,k, U5,k) EC EEl, then for any X CZ X, ('GPPA-E1) has the global convergence rate estimate (??), (??) and (??) for the case c = 1. ZFrom (??) and (??) with c = 1, we can deduce that the convergence rate obtained for (GPPA-(2)) is twice! faster than that obtained for (GPPA-E1). Remrark 3.1 From Theorem 3.2 in this paper and Theorem 2.3 in [?], we obtained, for (GPPA-E1), the same convergence rates as those obtained in [?] for The Proximal Point Algorithm (PPA), but our algorithm (GPPA-E1) is more practical than (PPA) because (1) we calculate Xk+j by (??), and (2) the calculation Xk+j by (1.6) can be almost as difficult as the original minimization problem (1.1). On the other hand, the convergence rate obtained for the algorithm with inexact minimization in [?] is lower than the convergence rate obtained for (GPPA-E1) in this paper (see Theorem 3.2 in this paper and Theorem 3.3 in [?]); furthermore, we do not need that 94,k and 95,k tend to 0 (One cannot deduce that LTr4,kIIUk+1H ~I+'I~+ - Yk II= O(9 1 ) for some J > ) 13

Remark 3.2 There are many possibilities for obtaining convergent proximal point algorithms. In fact, the two conditions to guarantee these possibilities are qk > 0 and Ok -* 0. By viewing Corollary 2.1 of [?], we can deduce that (GPPA) has the same possibilities for guaranteed convergence as (PPA). In the following, we will give another choice for cOk and prove that {xk)}=O is an asymptotically regular sequence ( ItXk+1 - Xk I -- 0) under the condition that X* is a compact set. This result has not been discussed in [?] and does not appear clear for this type algorithm. We only prove it in a special case of the choice for ak. Algorithm 1. Step 0 (Initialization). Select an initial point Xo E X. Let vo = Xo, ao = a > 0, Ao > 0, and k = 0. Step 1. For xk, Vk, ak, Ak, set akAk 1 + akAk Yk = (1 - ak)Xk + akVk. Step 2. Compute 1 Xk+1 = argmin(f(z) + - lz - yk 12: z E Rn), 2Ak Vk+l = Vk + (Xk+l - Yk), ak+1 = (1 - Ock)ak, and choose Ak+1 > 0. Step 3. Increase k by 1 and go to Step 1. It is not hard to show that Algorithm 1 is a special case of (GPPA). In fact, since ak = +akk we have Cak a 2 Ak = > ak ak(l - ak) - 2ak(1 - ak)' which implies (??) holds; on the other hand, in this algorithm, we can let uk+1 = - -(xk+1 - Yk), therefore, aCk Cak 1 - Uk+ = -(k+l - Yk). ak+l ak+ l Ak This relation combining with the definition of Ak and the construction of ak yields ak - -Uc+l -= Xk+1 - Yk, ak+l which implies that Vk+l = Vk + (Xk+l - Yk)JFrom above discussions, we can deduce that Algorithm 1 is a special case of (GPPA). Lemma 3.3 Suppose that {A/k}=o, {aIk}=o and {/3k})=0 are generated by Algorithm 1. (I) For k = 1,2,..., + 1 1 + a Ek-1 Ai ^^=o 14

(II) If -k- }=l1 is bounded, then 2j=0 Aj k k —+oo klim kanC^T (piCi-i)= 0 (3.18) i=l and lim Pk+lAk = 0. (3.19) k — oo Proof: (I) For i > 0, since 1 - eia = + and ai+i = (1 - a)ai, we have 1 + a L 0 Xj 1- Ci = 1 aZ + aEj=0 o These equations yield the desired conclusion (I). (II) Set k Ck =k ca( 3 iAi-1). i=1 Then Ck = ka2[ A[ 1 + kj 1+2 -ik ][ Since {k-}k Xj }=l is bounded, there is M1 > 0, such that for all i > 1, -j=O i < M1. i-10 A Ej=o ij Noting that for all i > 0, Ai > 0, we have M3 Eik 1 c < a k Pk 1 Conclusion (??) f6llows using ik1 * - O0. Since 1 kAk 1 kAk k 1 + a(Ao +... + k) - ak 0 +... + Ak-l (??) follows the assumption. 2 Lemrna 3.4 Suppose that {xk}r=0 and {yk}c=O are generated by Algorithm 1. If the conditions of Lemma 3.3 hold, X* is a nonempty compact set and o00 Ak - +oo, (3.20) k=0 then lim IIXk+i - ykI = 0. k —oo Proof:,From Corollary 2.3, we have 1 a f(xk) - f(x) + IIxk - y 1k-1i2 < 3k[f(xo) - f(x) + Ix - xo112] (3.21) 2Ak-1 2 15

iFrom (??) and Corollary 2.2, we have that {Xk}oo0 is a bounded sequence; since X* is nonempty, f* > -oo, which implies that {f (k)}10o is bounded from below. Set x = xk in (??), we have 1 "2 ~~rc\ ~ a 2 1Xk - Yk-11 < f-k[f(XO) - f(Xk) + 2IIXk - X0o2], (3.22) 2Ak-1 2 which implies that {f(xk)}'=o is bounded from above. The boundedness of {Xk} 0o and {If (k) I}2% with (??) yield the desired conclusion. 2 Theorem 3.3 Suppose that the conditions of Lemma 3.4 hold. Then {Xk}'0 is an asymptotically regular sequence. Proof: Since k k IVk - Vo112 < (- _lVi - Vi-lil)2 < k Illvi - Vil112, i=l i=l we have llakv^-a~v ol( 2 < 1ca12 i2 | IlaOkVk - CkVOII2 < ko Zi=l I vi -- vi-1112 = kcZ 1 JJxi - yi-1112 < 2 sup{f(xo)- f(xi) + |xi -o| 2, i = O..., k}kac2 Ek jAi2 ---- "i=- Pi~i —l' Using (??), we have cak = kAk = a3k+lAk -* 0. This result and Lemma 3.3 yield that akllvkll - 0. Hence ||Yk - xkl = I - ackXk + akVkll -~ 0. This conclusion, as well as the fact that i||k+1 - Xk\l < Ilzk+l - Yk I + IAYk - xzk and Lemma 3.4, yields Xk+l - xk -- 0. 2,From Theorem 3.3, Remark 14.1.1 in [?] and the boundedness of {xk}, we have the following corollary. Corollary 3.3 Suppose that the conditions of Lemma 3.4 hold. Then either the accumulation set of {Xk}=0 is singleton or it is a connected set. Remark 3.3 From (??), we can deduce that the convergence rate obtained for Algorithm 1 is lower than the convergence rate obtained for (GPPA-(c)) ( c E (0, 2]), so we may hope that for any c E (0, 2], (GPPA-(c))(with fk = f) also has the properties that zxk+l - Ykll - 0 and IlXk+l - Xkll -+ 0 if X* is a nonempty compact set. Remark 3.4 From (??) and (I) of Lemma 3.3, we obtained, for Algorithm 1, the same global convergence rate estimate ((??)) as obtained for (1.2) by [?]. In [?], it was shown that the condition (??) is necessary and sufficient for the convergence of the classical proximal point algorithm (1.2), but we do not know whether (??) is still a necessary condition for convergence of Algorithm 1. 4. A new bundle method In this section, we give a new bundle method by combining (GPPA) with the original version of the bundle method. In iteration k + 1, Xk+1 is calculated by the formula Xk+l = argmin{fk+l + 2 || X- ykl2: x E Rn) 2Ak where fk+l is a bundle linearization function of f. More precisely, for k > 0, fk+i = max{fk(x), f(xk) + g(x - xk)}, (4.1) 16

where fo (X) = f (XO) ~ gT'(X- XO)(42 JOLc -J U~! Y \-~0 0 (4.2) and g E af f(Xk). The Bundle Method(BM) Step 0 (Initialization). Select an initial point xO E X. Let vo =x, a a a> 0, fo (x)= f (XO)~ gg(-T(XoX), and k =: 0. Step 1. For Xk, Vk, ak, choose Ak > 0 and ak E (0, 1) such that kl -k(4.3) Set Yk = (1 - ak)Xk + akVk. Step 2. Compute 9k E df(Xk). Generate fk+1 by the formula (??). Compute 1 Xk+l = argminffk+l(z) + lAz -yk112 z E R 2Ak Vk~1 = Vk + (Xk~1 - Yk) and ak+1 = (1 - ok)ak. Step 3. Increase k by 1 and go to Step 1. It is not hard to show that (BM) is a special case of (GPPA). ZFrom Corollary 2.1 and Lemma 2.4, we have the following convergence result. Theorem 4.1 -Suppose that {Xk}) 0 is generated by (BM). Suppose that for all k > 0, we choose Ok 1- rk; where r E (0, 1) is a constant. If (1) {f(Xk) - fA(Xk)}I'O is bounded from above; (2J there is an index K, such that {I l9k kEK is a bounded subsequence and lim (Xk+l - Xk) = 0; kc K,k-.oo then {XkjkEK is a minimizing sequence for f. Proof: From the convexity of f and the construction of fk, we have for all k > 0, for all x Rn, fk(X) _ f(X) which implies that fk+1( Xk) = maxffk (Xk),f(Xk)} = f(Xk). Hence 8k I = Ifk~I (Xk) - fA(Xk)I = f(Xk) - fA(Xk). LFrom the assumption (1), {1j811' % is bounded. 17

ZFrom the definitions of ak and Ak, we can deduce that (??) and (??) hold by using Lemma 2.4. Using the construction of fk once again, we have fk+1 (Xk+l) > f (k) + k (Xk+l - Xk) which combined with the assumptions in (2) yields that limsup fk+i1(k+l) > limsup f(xk). kEK,k —oo kEK,k —oo Hence, the conclusion follows Corollary 2.1. 2 Remark 4.1 It is worth noting that in the convergence result of (BM), we must assume that ak is close enough to one and Ak is big enough. Furthermore, we assume some weak conditions (1) and (2). These are disadvantages in this method, but the method has two differences from the original bundle methods (see [?,?,?] for details): (i) It has no null steps. This means that it is not always necessary to have inner iterations from the current point Xk to the next point xk+l. (ii) The calculations of {Xk}lko are based on (??), not based on (1.8). Since the convergence rates obtained up to now for the original proximal algorithm ({Xk}jk=o generated by (1.2)) are lower than the convergence rate obtained for (GPPA) (see [?,?] and Section 2 of this paper), we hope that (BM) has a higher convergence rate than the original bundle methods. Remark 4.2 It is possible to give another choice for fk. In fact, we can choose fo(x) = f(yo) + 9o (X- yo) and fk+ 1 = max{fk(z), f(Yk) + k (x - Yk)}. where gk E Of(yk). From Corollary 2.1 and Lemma 2.4, we can give the convergence result of this method, which is very close to Theorem 4.1. Acknowledgements. The authors are thankful to Claude Lemarechal and two referees for their helpful comments. Indeed, Condition (II) of Lemma 3.3 is suggested by one referee, while Corollary 3.3 is suggested by another referee. References [1] A. Auslender, Numerical methods for nondifferentiable convex optimizations, Mathematical Programming Study 30 (1986) 102-126. [2] K. T. Au, J. L. Higle and S. Sen, Inexact subgradient methods with applications in stochastic programming, Mathematical Programming 63 (1994) 65-82. [3] D. P. Bertsekas and P. Tseng, Partial proximal minimization algorithms for convex programming, SIAM Journal on Optimization 4 (1994) 551-572. [4] J. R. Birge and L. Qi, Subdifferential convergence in stochastic programs, SIAM Journal on Optimization 5 (1995) 436-453. [5] J. R. Birge and R. J-B. Wets, Designing approximation schemes for stochastic optimization problems, in particular, for stochastic programs with recourse, Mathematical Programming Study 27 (1986) 54-102. 18

[6] J. R. Birge, X. Chen, L. Qi and Z. Wei, A stochastic Newton method for stochastic quadratic program with recourse, AMR 94/9, Applied Mathematics Report, University of New South Wales, 1994. [7] J. R. Birge, S. M. Pollock and L. Qi, A quadratic recourse function for the two-stage stochastic program, Technical Report 95-13, Department of Industrial and Operations Engineering, The University of Michigan, 1995. [8] G. Chen and M. Teboulle, A proximal-based decomposition method for convex minimization problems, Mathematical Programming 64 (1994) 81-101. [9] X. Chen, L. Qi and R. S. Womersley, Newton's method for quadratic stochastic programs with recourse, Journal of Computational and Applied Mathematics 60 (1995) 29-46. [10] F. H. Clark, Optimization and Nonsmooth Analysis, Wiley, New York, 1983. [11] R. Correa and C. Lemarechal, Convergence of some algorithms for convex minimization, Mathematical Programming 62 (1993) 261-275. [12] J.. Eckstein and D. P. Bertsekas, On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators, Mathematical Programming 55 (1992) 293-318 [13] M. Fukushima, A descent algorithm for nonsmooth convex programming, Mathematical Programming 30 (1984) 163-175. [14] M. Fukushima and L. Qi, A globally and superlinearly convergent algorithm for nonsmooth convex minimization, to appear in: SIAM Journal on Optimization. [15] C). Giiler, On the convergence of the proximal point algorithm for convex minimization, SIAM Journal on Control and Optimization 29 (1991) 403-419. [16] 0. Giiler, New proximal point algorithms for convex minimization, SIAM Journal on Optimization 4 (1992) 649-664. [17] C. D. Ha, A generalization of the proximal point algorithm, SIAM Journal on Control and Optimization 28 (1990) 503-512. [18] J. L. Higle and S. Sen, Stochastic decomposition: An algorithm for two-stage linear programs with recourse, Mathematics of Operations Research 3 (1991) 650-669. [19] J. Hiriart-Urruty and C. Lemarechal, Convex Analysis and Minimization Algorithms: I and II, (Springer-Verlag, Berlin, 1993). [20] S. Joe and I. H. Sloan, Imbedded lattice rules for multidimensional integration, SIAM Journal on Numerical Analysis 29 (1992) 1119-1135. [21] P. Kall and S. T. Wallace, Stochastic Programming, Wiley, New York, 1994. [22] A. J. King and R. J-B Wets, Epi-consistency of convex stochastic programs, Stochastic and Stochastic Reports 34 (1990) 83-92. [23] K. C. Kiwiel, An aggregate subgradient method for nonsmooth convex minimization, Mathematical Programming 27 (1983) 320-341. [24] K. C. Kiwiel, Methods of Descent for Nondifferentiable Optimization, Lecture Notes in Mathematics 1133 (Springer, Berlin, 1985). 19

[25] K. C. Kiwiel, Proximity control in bundle methods for convex nondifferentiable minimization, Mathematical Programming 46 (1990) 105-122. [26] C. Lemarechal, Nonsmooth optimization and descent methods, Research Report Rr-78-4, International Institute of Applied Systems Analysis, Laxenburg, Austria, 1977. [27] C. Lemarechal, Constructing bundle methods for convex optimization, in: J. B. Hiriart-Urruty, ed., Fermat Days 85: Mathematics for Optimization, (North-Holland, Amsterdam, 1986) 201-240. [28] C. Lemarechal, Nondifferentiable optimization, in Handbooks in Operations Research and Management Science, Vol. 1, Optimization, G. L. Nemhauser, A. H. G. Rinnooy Kan and M. J. Todd, eds., (North-Holland, Amsterdam, 1989) 529-572. [29] C. Lemarechal and R. Mifflin, A globally and superlinearly convergent algorithm for one-dimensional minimization of convex functions, Mathematical Programming 24 (1982) 241-256. [30] C. Lemarechal and C. Sagastizabal, An approach to variable metric methods, in: Proceeding of the 16th IFIP Conference on System Modelling and Optimization, Springer-Verlag, Berlin, to appear. [31] B. Martinet, Regularisation, d'inequations variationelles par approximations succesives, Revue Francaise d'Informatique et de Recherche Operationelle (1970) 154-159. [32] R. Mifflin, A quasi-second-order proximal bundle algorithm, to appear in: Mathematical Programming. [33] J. J. Moreau, Proximite et dualite dans un espace hilbertien, Bulletin de la Societe Mathematique de France 93 (1965) 273-299. [34] H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods, Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania, 1992. [35] Y. E. Nesterov, On an approach to the construction of the optimal methods of minimization of smooth convex functions, Ekonom. i Mat. Metody, 24 (1988) 509-517. [36] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Academic Press, 1970. [37] L. Qi and R. S. Womersley, An SQP algorithm for extended linear-quadratic problems in stochastic programming, Annals Operation Research 56 (1995) 251-285. [38] R. T. Rockafellar, Monotone operators and proximal point algorithm, SIAM Journal on Control and Optimization 14 (1976) 877-898. [39] R. T. Rockafellar, Augmented Lagrangians and applications of the proximal point algorithm in convex programming, Mathematics of Operations Research 1 (1976) 97-116. [40] I. H. Sloan and S. Joe, Lattice Methods for Multiple Integration, to be published by Oxford University Press. [41] J. Spanier and E. H. Maize, Quasi-random methods for estimating integrals using relatively small samples, SIAM Review 36 (1994), 18-44. [42] C. Zhu, Modified proximal point algorithm for extended linear-quadratic programming, Computational Optimization and Applications 1 (1992) 185-206. 20