Pure Adaptive Search In Monte Carlo Optimization Nitin R. Patel Robert L. Smith Zelda B. Zabinsky Technical Report 85-28 September 1985

Pure Adaptive Search In Monte Carlo Optimization Nitin R. Patel Robert L. Smith Zelda B. Zabinsky Technical Report 85-28 September 1985 Abstract Pure adaptive search constructs a sequence of points uniformly distributed within a corresponding sequence of nested regions of the feasible space. At any stage, the next point in the sequence is chosen uniformly distributed over the region of feasible space containing all points that are equal or superior in value to the previous points in the sequence. We show that for convex programs the number of iterations required to achieve a given accuracy of solution increases at most linearly in the dimension of the problem. This compares to-exponential growth in iterations required for pure random search.

PURE ADAPTIVE SEARCH IN MONTE CARLO OPTIMIZATION Monte Carlo optimization is concerned with using randomly generated points to converge to optimal solutions of mathematical programs. The simplest of Monte Carlo optimization algorithms is pure random search (Dixon and Szego [1978], Rinnooy Kan and Timmer [1984]). This algorithm generates a sequence of independent, uniformly distributed points in the feasible region and returns the best point as an approximation to the optimal solution. While the approach in itself is inefficient, it is frequently used to generate starting points for deterministic search algorithms (e.g., Boender, Rinnooy Kan, Stougie and Timmer [1982]). A great merit of pure random search is that its simplicity lends itself to theoretical analysis. A number of analytic results on convergence and performance have been obtained for it (e.g., Archetti, Betro and Steffe [1975], Clough [1965], Rubinstein [1981], De Haan [1981], Patel and Smith [1983]). In this paper we provide a theoretical analysis of what may be considered to be the next simplest Monte Carlo algorithm after pure random search. We call this algorithm pure adaptive search. The algorithm proceeds by generating a sequence of points uniformly distributed in a sequence of nested regions of the feasible space. At any stage, the next point in the sequence is uniformly distributed over the region of feasible space containing all points that are equal or superior in value to the previous points in the sequence. The pure adaptive search algorithm gives a surprising improvement over pure random search. For convex programs, we show that the computational complexity of pure adaptive search increases at most linearly in the dimension of the problem. We also show that the asymptotic distribution of improvement in objective function value using pure adaptive search is bounded from below by a lognormal distribution. Although at this time there is no known efficient implementation of pure adaptive search, any algorithm that generates solutions that can be shown to be stochastically superior to those of pure adaptive search will also share this property of linear time complexity. 1

1. Pure Adaptive Search Consider the standard convex program, min f(x) (P) subject to x ~ S where x is an n-dimensional vector, S is a convex subset of Rn, and f is a realvalued convex function defined over S. We will assume that S is closed and bounded, f is bounded, and there exists a unique solution denoted x*. Let y* = f(x*) = min f(x) and X~S y = max f(x). XES The algorithm for solving P begins by generating a point Xluniformly distributed within the feasible region of P. The associated objective function value is labeled Y = f(X1). The next point is generated from a uniform distribution over the convex region formed by the intersection of the feasible region with the level set of points with objective function values equal to or less than Y1. The procedure proceeds iteratively in this fashion until a preset stopping criterion is satisfied. More formally, Pure Adaptive Search: Step O. Set k = O, SO = S, and YO > y Step 1. Generate Xk+1 uniformly distributed in Sk+1 = {x: x E Sk and f(x) < Yk} Step 2. Set Yk+1 = f(Xk+1). If stopping criterion is met, stop. Otherwise set k = k+1 and return to Step 1. By construction, Yk is a decreasing sequence of points that almost surely converges to the global minimum y*. Although obvious here, a general approach to formally demonstrating such convergence is given in Solis and Wets [1981]. 2

However, the more interesting question is the performance of the algorithm as it converges to y*. This issue will be addressed in the next section. The principal computational effort of the algorithm lies in generating a point uniformly distributed in the improving convex set, as described in Step 1. This is a challenging problem that has as yet yielded no satisfactory solution. Conventional approaches include the rejection and transformation techniques (see Rubinstein [1981], Schmeiser [1981]). The procedure combining the two techniques is to first enclose the set in a simpler region, say a sphere. One then generates a series of points uniformly distributed in the enclosing region by efficient transformational procedures until a point falls in the enclosed region. The accepted point will then be uniformly distributed within the enclosed region. Although the problem of efficiently generating many points uniformly distributed within a single bounded region has recently met with some success (see Smith [1984]), the problem of efficiently generating one uniformly distributed point in each of many bounded regions is still unsolved. Still if an algorithm can generate random points (uniform or not) with associated objective function values stochastically less than those of uniformly distributed points, then the performance of the new algorithm would be bounded by the performance of the pure adaptive search algorithm. It is here in non-linear convex optimization where the proposed pure adaptive search algorithm and its analysis have potential value..2. A Stochastically Worst Case Problem The first step in the analysis of pure random search is to identify a stochastically worst case convex program. We first define a conical convex program and then show in Lemma 2.1 and Theorem 2.2 that it represents a stochastically worst case for the algorithm. Consider a standard convex program P. We formally define a conical convex program Q corresponding to the convex program P as follows: 3

min g(x) (Q) subject to x c S where S is the same feasible region as in P, and g(x) = inf [y: (x, y) c convex hull of (x*, y*) and (S, y )]. Note that the conical program Q has the same solution x* and the same range in objective function values [y*, y ] as P. Also note that the objective function g(x) has a conical shape, and hence the name. In order to compare the two programs P and Q, we define Yk to be the random variable representing the kth objective function value generated by the algorithm on P, and similarly, let YQ be the random variable representing the kth objective function value generated by the algorithm on Q. Lemma 2.1: Let P be a standard convex program, and let Q be the corresponding conical program. Then p has one-step dominance over Q, that is, Pr(Y1 < z|YQ = w) < Pr(Y1 < z|YP = w) for all k = 0, 1, 2,... and y* < z, w < y * Proof: First consider the case when y* < w < z < y. Let Pz and Qz be the level sets at objective function value z for problems P and Q respectively where P = {x: x E S and f(x) < y}. In this case PwC Pz and QwC Qz due to convexity, and therefore Pr(YQ+1 < z YQ = w) = 1 = Pr(YP+i < z|YP = w). Now consider the more interesting case when y* < z < w < y. Then, clearly, Pr(YQ+ < zlYQ w) < Pr(Y+1 zlYk = w) if and only if v(Pz) v(Qz) * p)> (Q) for all y* < z < w < y, v(Pw) - v(Qw) where v(*) denotes the content of the set. In order to prove this equivalent statement, we need to introduce a similarity transformation and some additional notation. Let Xw, Rn be an affine function defined by 4

Xz(x) = x* +I ( - )(x - x*) for y* < z < w < y Let Pwz X= w,z(Pw) = x x = x* +( ) (x - x*), for x E Pw be the wZ. w~zi w \^w- y,/ lz - Y*\ level set Pw shrunken by a factor of ( ) and rerooted at x* so that all P,z is contained in Pz. (See Lemma A.1 in the Appendix for a proof that Pz C Pz). Similarly, let Qw, = XWz(Qw) v(Qz) Now, consider q). Since Qz = Qw (see Lemma A.2 in the Appendix), we have v(Qw) Z wz v(Qz) v(Qw z) (Qw -. Since A is an affine transformation, (Q() (Q) (Pz) Qw, z) v(Qw) =det w,". z v(Q,) * v(P,) Hence = Idet Xw zI for all y* < z < w < y. Similarly, consider (P v(Qw) w, z v(Pw) v(Pz) v(Pw,z) v(Pw, z) Since P C Pz, we have, and also = Idet X W, Z V(Pw) V(Pw) w z (P ~ V v(Pz) * v(Pz) v(Qz) Hence (P > det X,, z for all y < z < w < y. Consequently, for all y < z < w < y. Lemma 2.1 states that, for any specified objective function values, the conditional probability that a point chosen uniformly from Pw lies in the improved region Pz is greater than the conditional probability that a point chosen uniformly from Qw lies in the improved region Qz. This establishes that, on any specific iteration, pure adaptive search performs stochastically better in one step on P than on Q. Theorem 2.2 states that the entire sequence of points generated by the algorithm on P are stochastically better (less) than the sequence of points generated by the algorithm on Q. 5

Theorem 2.2: {YQ} > st{YP} for all k = 1, 2,.... Proof: The proof is by induction on k. We show that Pr(YQ > z) > Pr(YP > z) for all k = 1, 2,... and z E [y*, y ]. Note that Pr(Y4 < z) = Pr(YP < zlY = y ) and Pr(YQ < z) = Pr(Y < z|Y = y) 1 - 1 - zIY by the definition of the algorithm. But then Pr(YQ < z) < Pr(YP < z) for all * Q P y* < z < y follows immediately from Lemma 2.1. Hence YQ > stYP We assume the induction hypothesis that YQ > stYP for some k > 1, and show that YQ > stY1 By Lemma 2.1, we have Pr(YQ+1 < z YQ = w) Pr(YP < z| = w) k+1 - kr(Y k+ - k k+w) ( k for all z, w E [y*, y ]. For ease of notation, for any fixed k and z E [y, y ], let q(w) = Pr(YQ1 > z|YQ = w) and p(w) = Pr(YP+ > z|Y = w). Since q(w) is a non-decreasing function, we have E(q(YQ)) > E(q(YP)) (see Ross [1983], p. 252). Also since q(w) > p(w), we have E(q(YP)) > E(p(YP)). Therefore E(q(YQ)) > E(p(YP)). But, E[q(YQ)] = E[Pr(Y+1 > z|YQ)] = Pr(Y+1 > z), and similarly, E[p(Yk)] = Pr(Y+1 > z). Therefore Pr(Y+1 > z) > Pr(Y +1 > z) for all y* < z < y. Theorem 2.2 states that the objective function values generated on Q are stochastically larger than those generated on P. We can conclude that a conical convex program is a stochastically worse case than its corresponding convex program. We now turn to the analysis of performance of the pure adaptive search algorithm on the conical program. 3. Analytic Performance Bounds for Pure Adaptive Search The second part of the analysis is to measure performance of the pure adaptive search algorithm on the conical convex program and in this manner obtain bounds on performance for all convex programming. In this section we define a measure of complexity that is appropriate for a stochastic algorithm, and then provide two computational complexity results. Analyzing the performance of the algorithm for this conical convex program then 6

provides a worst case distribution for the number of iterations necessary to achieve a desired accuracy of solution. The first result gives the probability distribution for the quantity of improvement per iteration, and shows that the asymptotic minimum (scaled) objective function value obtained from pure adaptive search is stochastically bounded from above by a lognormal distribution. The second result is that the necessary number of iterations grows at most linearly in the dimensionality of the problem. In order to compare performance of the algorithm across convex programs, we define a standardized measure of improvement. Given any feasible objective function value y of P where y, < y < y, define the standardized objective function value of y, as (y -Y*) Z = (y - y*) Let ZP be the standardized objective function value corresponding to the kth point f(XP) - y* Xk generated by the algorithm on P, Zk - -. A standardized value of z =- y - y* where m is a positive integer will be referred to as an m-fold improvement. We say that the algorithm achieves m-fold improvement by the kth iteration if Z <-, or y - ym Y y* * p > Mm. Notice that while the objective function y ranges between [y*, y ], Yk - Y* the standardized value of improvement of ranges between [0, 1]. We now define a measure of performance that will be evaluated for the pure adaptive search algorithm. Let K be the number of iterations of the pure adaptive search algorithm applied to P necessary to ensure an m-fold improvement with probability at least 1 - a. Formally, KPm = min{k: Pr(Z < ) > 1 - a}. Let a,m k - - KQ be the same quantity applied to the conical convex program Q. Ca,m The following result states that the conical convex program provides a worst case distribution for standardized improvement and thus provides an upper bound on the number of iterations necessary to achieve a desired accuracy of solution. It follows as an immediate corollary to Theorem 2.2. 7

Corollary 3.1: {ZP} < st{ZQ} for all k = 1, 2,..., and K < K for all m > 1, 0 < a < 1. We now turn to an analysis of algorithm performance on conical convex programs to establish worst case performance criteria. Our first computational complexity result characterizes the probability distribution of improvement on the conical program. The result, stated below in Theorem 3.2, is that the asymptotic distribution of ZQ is lognormal. Theorem 3.2: Let ZQ be the random variable corresponding to the standardized objective function value on the kth iteration for a conical convex program Q. Let Z =1. Then = E(ZQ) -) and 0 k k n+1k / n \k n(n+2) k a = Var (ZQ) = (-2) ( -(n+)(n+) k) forall k, with ZQ- lognormal (p, o) forlargk. k k \vn+2 (l \ n+l)(n+l)/ k Proof: First we drop the superscript Q with the understanding that this proof is Zk for the conical program Q. Then Z k Yk-1 -. l Y1 where Yk = Zk-1 n The Yk are independent identically distributed random variables with mean -n k n+1 and variance (See Appendix, Lemma A.3). Thus E(Zk) = -- ) (n+2)(n+1)2 n+1 ( k n ( \k ( k E(Z) = — ), and hence Var(Zk) = ) 1 - It remains to k n+2/ k n+2 t \ (n+1s to show that Zk approaches a lognormal distribution. Since Zk is the product of k independent identically distributed random variables, we can write lnZk = i1 nYi, K i=1' and using the central limit theorem we get that lnZk approaches a normal distribution for large k, or Zk - lognormal (Uk, oG) for large k. Our second result gives an upper bound on K and hence K x,m cx,m Theorem 3.3: KP m < 2(n+1) ln(m(1+1/-/)) for all m > 1 and 0 < a < 1. Proof: From Chebyshev's inequality Pr (E(Z ) - (1/V)o(Zk) < Zk < E(Zk) + (1//)o(Zk)) > 1 - where Zk is the standardized objective function value on the kt iteration for the 1 conical program Q. For m-fold improvement, we should have Pr(Zk ) > 1 -. 8

From Theorem 3.2, E(Zk) = (n/n+1)k and a(Zk)= - n21 -2n )k] n+2 n2+2n+1 Combining all of these, we find that at most k iterations are required for an m-fold improvement where k is the smallest integer that satisfies n+1 \ J n+2 n2+2n+1 M n1 k Let g(k) = 1 + ) -). Note that (i) g(k) is monotone decreasing in k for k > 0; (ii) g(k) > h(k) since / nk 1 ( n 2 ( n \k 1 n g(k) > + -- -- + X2 > h(k). - + n+ n n+1 -+-/ 1 n+2 From (i) and (ii) it is clear that if k' is an integer satisfying g(k') < -, then all integers k greater than or equal to k' satisfy h(k) < m and thus k' provides an upper bound on Km. Since ln( 1 is a root of g(x) = we may take k' equal to the smallest integer greater than or equal to this root. 1 1 p 2 In (m(1 + 1/ V/a)) Therefore, since ln(1 + -) > -- or n > 1, K m < ^, __ n+1 - ' ln(1 +/n) 2(n+1) In (m(1 + - )). * The importance of a bound for KP lies not only in providing a stopping ex a,m criterion for terminating the algorithm, but it also has implications on the computational effectiveness of the algorithm. Theorem 3.3 states that the number of iterations needed for a worst case situation will at most grow linearly in the number of dimensions. In fact, since this is not a tight upper bound, the performance is better than linear. Table 1 gives upper bounds on performance versus dimension and for a million fold improvement with 99% certainty. The linear bound on complexity for pure adaptive search is perhaps a surprising improvement over the exponential complexity of pure random search. It suggests that steady improvement is an important feature for an algorithm (Adler [1983]), and that even a "blind" search using a uniform distribution yields good performance. If another algorithm could be shown to be stochastically better than the pure adaptive search algorithm, then the same analysis and results would apply. 9

In fact, other random search optimization algorithms have experimentally witnessed linear results on convex programs such as in Solis and Wets [19811, Schumer and Steiglitz [1968], and Schrack and Borowski [1972]. These findings suggest possible extension of the pure adaptive search algorithms' linear behavior to more practical algorithms. 10

Dimension n 1 2 5 10 50 100 500 1,000 5,000 10,000 65 98 195 357 1,654 3,276 16,246 32,460 162,167 324,301 Upper Bound for K, with a =.01, m = 106 (number of iterations) Table 1. An upper bound for KP versus ' P,m dimension for convex programs for million fold improvement with 99% certainty. 11

4. Conclusion We have shown that pure adaptive search greatly improves on pure random search. Iterations for pure adaptive search are linear in the dimension of convex programs. However, at the present time we do not have efficient procedures for generating uniformly distributed points in general convex regions. For this reason, despite its substantial improvement over pure random search, pure adaptive search cannot compete with existing algorithms for the general case. For special regions it can be more effective. More importantly, our analysis clearly indicates that simple linear-time Monte-Carlo algorithms for convex optimization are possible if we can develop efficient procedures to generate points uniformly distributed over convex regions. 12

References 1. Adler, I. [1983], Private communication with R. L. Smith, University of California, Berkeley, California. 2. Archetti, F., B. Betro and S. Steffe [1975], "A Theoretical Framework for Global Optimization via Random Sampling", Quaderno dei gruppi di recerca matematica del C.N.R., University of Pisa. 3. Boender, C. G. E., A. H. G. Rinnooy Kan, L. Stougie and G. T. Timmer [1982], "A Stochastic Method for Global Optimization", Mathematical Programming, Vol. 22, pp. 125-140. 4. Clough, D. J. [1965], "An Asymptotic Extreme-Value Sampling Theory for Estimation of a Global Maximum", CORS Journal, Vol. 7, pp. 96-109. 5. De Haan, L. [19811], "Estimation of the Minimum of a Function Using Order Statistics", Journal of the American Statistical Association, Vol. 76, pp. 467-469. 6. Dixon, L. C. W. and G. P. Szego [1978], eds., Towards Global Optimization 1 and 2, North-Holland, Amsterdam. 7. Patel, N. R. and R. L. Smith [1983], "The Asymptotic Extreme Value Distribution of the Sample Minimum of a Concave Function Under Linear Constraints", Operations Research, Vol. 31, pp. 789-794. 8. Rinnooy Kan, A. H. G. and G. T. Timmer [1984], "Stochastic Methods for Global Optimization", American Journal of Mathematical and Management Sciences, Vol. 4, pp. 7-39. 9. Ross, S. M. [1983], Stochastic Processes, John Wiley and Sons, New York. 10. Rubinstein, R. Y. [1981], Simulation and the Monte Carlo Method, John Wiley and Sons, New York. 11. Schmeiser, B. W. [1981], "Random Variate Generation: A Survey", Simulation With Discrete Methods: A State of the Art View, T. I. Oren, C. M. Shab, P. F. Roth (eds.), IEEE, pp. 79-104. 12. Schrack, G. and N. Borowski [1972], "An Experimental Comparison of Three Random Searches", Numerical Methods for Non-linear Optimization, F. Lootsma, ed., pp. 137-147, Academic Press, London. 13. Schumer, M. A. and K. Steiglitz [1968], "Adaptive Step Size Random Search", IEEE Trans. Automatic Control AC-13, pp. 270-276. 14. Smith, R. L. [1984], "Efficient Monte Carlo Procedures for Generating Points Uniformly Distributed Over Bounded Regions", Operations Research, Vol. 32, pp. 1296-1308. 15. Solis, F. J. and R. J. Wets [1981], "Minimization by Random Search Techniques", Mathematics of Operations Research, Vol. 6, pp. 19-30. 13

Appendix Lemma A.1: For any convex program P, the shrunken set P, = Aw (Pw) is contained in its level set Pz; Pw zc Pz for all y* < z < w < y Proof: Let x c Pw z By the definition of Pw, -= Xwz (Pw), there exists x E Pw such that - y* = X (x) = x* + w- ) (x - x*). wqz \w - y, Therefore, since f is a convex function, f(x) < - ( f(x*) + (f(x). -- \ \w- y w y- Y/ We have that f(x*) = y* and f(x) < w since x E Pw, hence /w-z\ / z- y* \ z(w - y*) f(x) < - y* + ( w= = z. -- \w - y*/ \w - y*/ w - y* Therefore f(x) < z, and x e S by convexity, giving the result, x e Pz for all y z < w < y. y Lemma A.2: For any conical convex program Q, the shrunken set Q,, = A, (Qw) is identical to its level set Qz; Qw z= Q for all y* < z < w < y. Proof: i) First, Qw, z Q for all y* < z < w < y by Lemma A.1 because Q is a convex program. ii) Now, we show that QzC Qw z. If z = y*, then Qz = {x} = Q by definition. Consider y* < z < w < y, and let x e Q with = g(x). /w - y*\ /w -z\ * Define x' = x - - x* - for any y* < z < w < y. Note that XA (x') = x* + --- ) (x' - x*), and substituting in x', Z Y* w Y* w z X~+wz ( x ) = ww + wZ- y, = X+ X - X - \ Xz ) ) Lw - y* w- y* w - y* - x. Since x = Awz (x'), if x' E Qw we would have x E Qw z It remains to show that x' ~ QW* 14

Since (S, y) e convex hull of (x*, y*) and {(x, y ): x E S}, we know that there exists x e S such that (x, y) is on the line segment connecting (x*, y,) and *-,~~ ^~~~Y - Y* (x, y ); x = (1 - r)x* + rx where r = * (o < r < 1). By our definition of y - Y* w- Y* x', we have x' = (1 - s)x* + sx where s = and s > 1. Substituting in for X; z- Y* x' = (1 - s)x* + s(1 - r)x* + sr x = (1 - sr)x* + sr x. /w - y,\ / Y - Y* \ Now sr w - Y3C\ Y* Y* Now sr = [ -- ) [ -, and y* < y < z, hence /w - y\/z - y* \ w - y* sr < - ) * ) - - * < 1. Therefore x' c S because it is on a line z - y.\ y - y/ y - y* segment connecting x* and x, and S is convex. Now, g(x') < (1 - sr) g(x*) + sr g(x), since g(x) is a convex function. Hence g(x') < (1 - sr)y* + sr y < w as is easily shown. Since x' c S, we have x' E Qw. Zk Lemma A.3: Let Yk = - be the ratio of successive improvements on a conical k-1 program Q. Then {Yk' k = 1, 2,...} are independent identically distributed random variables. Also, for k = 1, 2,. n E[Yk] n+1 n E[Y] - n k n+2 Proof: Now Pr (Yksy)= Ez Pr ( r(Zky Z Z k_ )) Since Zk = - the — kk- p k-1 - /y - y* conditional distribution of Zk given Zk-1 is Pr(Zk < ZIZk-l = w) = Pr(Yk < z' Yk-1 = w ) where z' = z(y - y*) + y, w' = w(y - y*) + y*. When 0 < z < w < 1, v(Qz, ) Pr(Yk < 'IYk-l = w') = v(Qw,)det X ' v(Qi Id - w) z w = ( Y n Z( Z)n 15

since Q is a conical convex program in n dimensions. Therefore, / Z \n ( /n) for z < w, 0 < w, z < 1 Pr(Zk <ZlZk ) 1 = W) 1 for z > w, O < w, z < 1. Let fz (x) be the density function for Zk-l. Then k-1 Pr(Yk < y) = Pr(Zk < xy I Zk-1 = x) fZ (x)dx JO^ lk-1 fz(-fl (x)dx x k-1 = yn ff (x)dx = yn for O < y< 1. Thus, the random variables Yk, for k = 1, 2,... are identically distributed. To establish the independence of the Yk, we need to show that Pr(Yk+l'y IYk'.*.yl)=Pr(k+ly)) for all y and k. This immediately implies that yl, Y2,... are independent random variables, i.e., for all k, Pr(Yk+liy, Yk'yk'*. I'Yi<i) = Pr(yk+i y) Pr(yk<yk).. Pr(y iy,). Now, for O<y<l and O<yi<li=l,2,...,k and for all k, Pr(Yk+liy IYk=Yk,'* Y, Y= ) = Pr(Zk+lsyZk IZk=Yk Yk-l ''*'Yi*** Z =y ) = Pr(Zk+l'y YkYk-i' -Y I k=ykyk-l... Yl,,zl=l) = Pr(Zk+l YYkYk-...y IZk=ykYk_...y) since Zk+i is conditionally independent of Zk-l,...,Z given Zk = (YYkYk-i' ''Y/YkYk-i- '.y)n = yn = Pr(Yk+l y). 16

Finally the rth moment of Yk can be easily determined: E [ly = v ny -dy =n f yr+'- dy n r+n In particular, n nk + k n+2 17