AN ADAPTIVE RANDOM SEARCH ALGORITHM WITH LINEAR COMPLEXITY IN DIMENSION Zelda B. Zabinsky Industrial Engineering Program, FU-20 University of Washington Seattle, Washington 98195 Robert L. Smith Department of Industrial & Operations Engineering The University of Michigan Ann Arbor, MI 48109-2117 Technical Report 90-15 April 27, 1990

AN ADAPTIVE RANDOM SEARCH ALGORITHM WITH LINEAR COMPLEXITY IN DIMENSION Zelda B. Zabinsky t Industrial Engineering Program, FU-20 University of Washington Seattle, Washington 98195 Robert L. Smith Department of Industrial & Operations Engineering The University of Michigan Ann Arbor, Michigan 48109 April 27, 1990 Abstract Random search algorithms have been developed to solve global optimization problems with many local minima. Such methods are only attractive if their computational effort does not grow exponentially with dimension. In this paper, a sequential random search algorithm is analyzed on a restricted subset of convex problems and shown to have linear complexity in dimension. Key Words: Random search, Monte Carlo optimization, algorithm complexity. t The work of Zelda B. Zabinsky was partially supported by the Graduate School Research Fund at the University of Washington. 1

1 Introduction Random search techniques show promise as efficient optimization methods for a large class of problems. Recent research [13,23] shows that it is theoretically possible for a random search algorithm to achieve a complexity that is, on the average, linear in dimension. In this paper, we define and implement a sequential random search algorithm named the Adaptive Mixing Algorithm (AMA). We analyze the computational efficiency of the AMA for a restricted subset of convex programs, and use theoretical and computational results to show that an upper bound on the number of iterations is linear in dimension. Sequential random search procedures are designed to address a standard optimization problem, (P) min f(x) xES where x is an n-dimensional vector, S is a convex, compact subset of Rn, and f is a realvalued continuous function defined over S. All sequential random search procedures generate a sequence of random points {Xk} which may depend on the previous point or several of the previous points. The concept underlying sequential step size algorithms [14,15,12,191-7,9,18] is to generate the next random point Xk+1 by taking a specified step in a random direction from the previous point Xk. These algorithms are based on the iterative formula, for k = 1,2,... Xk+ f Xk + SkDk if f(Xk + SkDk) < f(Xk) Xk otherwise where Dk is the direction vector, Sk is the step size, and Xk is the point generated in the kth iteration. The direction vector is usually obtained by sampling from a uniform distribution on a unit hypersphere. The method of choosing the step size is specific to each algorithm. Several of the step size algorithms have been reported in the literature to have complexity that increases linearly in dimension. Complexity is measured as the number of function evaluations needed to reach a specified neighborhood of the solution. Three key references are: (a) Schumer and Steiglitz [19] prove that the average number of function evaluations for an optimum relative step size random search restricted to an unconstrained hyperspherical objective function is asymptotically linear in dimension. They also provide experimental evidence that the number of function evaluations increases linearly with dimension for their adaptive step size algorithm on the following three test functions; n n n Exi2 E Xi-, Zaixi2. i=1 i=1 i=1 2

(b) Schrack and Borowski [17] report experimental results that doubling the dimension doubles the number of function evaluations required for another random search algorithm. They also used a hyperspherical test function, E.r= xi2. (c) Solis and Wets [21] experimentally verified a linear correlation between the number of function evaluations and dimension for their own variation of the step size algorithm on a hyperspherical test function. They provided a justification of this linearity condition based on the tendency of these algorithms to maintain a constant probability of successful improvement. All of these findings empirically suggest that sequential random search procedures are appropriate for large dimensional quadratic programs. An analysis of a random search procedure called pure adaptive search [13,23] substantiates these findings, and proves that it is theoretically possible for a random search procedure to achieve linear complexity. In Zabinsky and Smith [23], the linear complexity, measured as the expected number of iterations to get arbitrarily close to the solution, was generalized to global optimization problems. Pure adaptive search constructs a sequence of points uniformly distributed within a corresponding sequence of nested improving regions. As pointed out in [13,23], pure adaptive search is difficult to implement directly due to the problem of efficiently generating a point according to a uniform distribution in a general region. We seek an algorithm that can be implemented directly, with a theoretical analysis that the number of iterations is again linear in dimension. 2 Adaptive Mixing Algorithm The Adaptive Mixing Algorithm is intended to be easy to implement and to approximate the efficiency of pure adaptive search. The question is: how can an improving point be easily generated that randomly samples the improving level set? One answer to this question is to uset Hit-and-Run methods [20,1,2,7]. Hit-and-Run generates a sequence of random points by providing a random direction and then providing a random point in that direction. This sequence of points asymptotically approach a uniform distribution [20]. Thus, an implementation of pure adaptive search is to use a sequence of Hit-and-Run points on each iteration. Now a decision must be made regarding the best length of the Hit-and-Run sequence per iteration. In fact, a class of algorithms can be parametrically defined by the length of the Hit-and-Run sequences. In one extreme, we know that when the Hit-andRun sequences are very long and provide a close approximation to sampling from a uniform distribution we have a good approximation to pure adaptive search and can expect the number of iterations to be linear in dimension. In another extreme, consider an algorithm where the Hit-and-Run sequences are very short - a length of one. Now what is the expected number of iterations? We will call the algorithm with Hit-and-Run sequences of length one, the Adaptive Mixing Algorithm. The analysis will express the number of iterations as a 3

function of dimension. The Adaptive Mixing Algorithm also fits into the framework of a sequential random search procedure. Given an initial feasible point, Xo, AMA generates a random direction vector that originates at Xo, and then generates an improving and feasible point Xi uniformly distributed along the line segment associated with that direction. Thus, the next point is in a random direction and a random distance from the previous point. The procedure proceeds iteratively to produce a sequence of improving points until satisfying a stopping criterion. More formally, Adaptive Mixing Algorithm (AMA) Step O. Initialize Xo E S, Yo = f(Xo), and k = 0. Step 1. Generate a direction vector Dk uniformly distributed over a unit hypersphere. Step 2. Generate Xk+1 uniformly on Lk, the improving feasible line segments in direction Dk, i.e. Lk = {x: = Xk + ADk, A E R, f(X) < f(Xk) and x E S}. Set Yk+l = f(Xk+l). If Lk = 0, go to Step 1. Step 3. If the stopping criterion is met, stop. Otherwise increment k and return to Step 1. Details of the implementation of the AMA are given in the appendix. Generating random directions, as defined in Step 1, is straightforward. However, implementing Step 2 is more involved, and we experimented with several approximating forms. Our implementation tries to find a random point on the improving feasible line segment, Lk. If it fails after several tries, it returns to Step 1 to generate another random direction. The computer program stops if no improving point is found after a user-defined maximum number of tries. 3 Complexity Analysis 3.1 Definitions and Notation Before proceeding with the analysis, we define a complexity measure for evaluating AMA. For the optimization problem (P), let (x.,y.) denote the solution, where x* = argmin f(x) xES and * = f(x*) = min f(x). XES 4

The value x* need not be unique; any point x. E S that attains the minimum, f(x*) = y*, may be used for a solution point. It will also be convenient to define the maximum, y* = max f(x). xES Let Py denote the level set of the problem (P) at objective function value y, Py = {x f(x) < y,x e S} for y* < y < y*. Let {Xk}, Xk E S, k = 0,1,2,... be the sequence of points generated by the Adaptive Mixing Algorithm on (P), with the subscript k denoting iteration count. Let {Yk}, Y* < Yk < y*, k = 0,1, 2,... be the corresponding sequence of objective function values generated by the algorithm on (P), Yk = f(Xk). The sequences {Xk} and {Yk} are random variables due to the stochastic nature of a Monte Carlo algorithm. The distribution of improvement is defined to be the probability that the kth objective function value Yk is at most y. This is the probability that the kth point lies within the level set Py, so that P (Yk < y)= P (Xk E Py) for k = 0, 1,2,... and y* < y < y*. The conditional distribution of improvement, denoted, P (Yk+l < YIYk = W) for k = 0, 1,2... and y* < y < y* is defined to be the probability of obtaining an objective function value of at most y in a single iteration, starting from objective function value w. Let {Zk}, 0 < Zk < 1, k = 0, 1,2,... be the sequence of normalized improvements associated with the points generated by the algorithm on (P), where Zk = (Yk - Y*)/(* - Y) The probability of achieving a normalized improvement of z by the kth iteration is, P(Zk <z)=P(Yk< y) for 0 < z < 1 and z = (y - y)/(y* -y*). We say an algorithm has achieved m-fold improvement on (P) when the normalized improvement is within 1/m of the optimal value, for m > 1: 0 <Zk <1/m or equivalently, when y. < Yk < y* + (l/m)(y* - Y). 5

As a measure of computational complexity, an iteration count, Ka,m, is defined as the smallest number of iterations needed to achieve m-fold improvement with I - a certainty, Ka,m =min {k: P (Zk < 1/m) 1 - a} k=1,2,... for m > 1 and 0 < a < 1. This iteration count, Ka,m, will be used as a measure of computational complexity throughout the paper. It is the same complexity measure used in [13]. Other measures of complexity count the number of function evaluations, gradient evaluations, or arithmetic operations, whereas Ka,m counts the number of iterations. The use of normalized improvement in Kc,m is similar to the error measure of an upper bound used in [3]. They also scale a deviation from the optimum by a range on objective function values. A contrasting complexity measure is the rate of convergence, which characterizes the tail of the sequence converging to the solution [10]. Instead of measuring the tail, Kc,m measures the number of iterations needed to get arbitrarily close to the solution. 3.2 Theoretical Analysis We now turn to an analysis of the complexity of the Adaptive Mixing Algorithm. For the analysis, we define ZO = 1 and YO = y* as initial starting conditions. To begin the analysis, we must specify the conditional distribution of improvement for the Adaptive Mixing Algorithm. The conditional probability of making a specified improvement on a single iteration depends on the position of the current point, P (Yk+l _ Y Yk =w) = E [P (Yk+l < YXk, Yk = W)] = E[P(Xk+l PylXk, k = )]. This probability can be expressed in terms of the random direction generated from the point x, and the ratio of the length of improving line segment(s) in that direction, P(Xk+l E PyIXk = X,Yk = w) = j ILpy (d,x)II/ ILp (dx) ldFD(d) where IILpy(d, x)11 is the combined length of the line segments formed by the intersection of the level set Py with the direction vector d originating at x. Also, FD(.) is the cumulative distribution function for the random direction vector. In words, the probability of making an improvement to y, starting at point x with f(x) = w, is the probability of landing within Py given direction d, i.e. IILpy(d, x) ll/lLp (d,x), integrated over all feasible improving directions. For general convex programs, the conditional probability of improvement depends on the exact location of Xk, and makes it difficult to derive a general expression. However, for a 6

subset of convex programs with symmetrical level sets, we can continue the analysis. This class of programs includes all the test functions for which linearity has been cited [19,17,21]. Consider the class of quadratic programs with "spherical" level sets, that is, unconstrained quadratic programs with a constant diagonal Hessian (cI, where c > 0 and I is the identity matrix), which we will refer to as spherical quadratic programs. With spherical quadratic programs, the symmetry of the level sets ensures that the conditional probability of improvement is invariant for all points on a level set; P (Xk+l e PyIXk = x, Yk = w) = P (Xk+l E PylXk = X', Yk = w) for any x and x' with f(x) = f(x') = w. The conditional probability of improvement for the Adaptive Mixing Algorithm on spherical quadratic programs is: P (Yk+il ylYk -w) = E[P(Xk+l EPylXk, Yk=w)] = P (Xk+l E PylXk =- x,Yk = w) for any x with f(x) = w, = I| IILp,(d, x)l/[ILp,(d, x)lIdFD(d). Thus, the conditional probability of improvement depenpeds on the ratio of the lengths of line segments and the objective function value (w), but not on the specific point x. Of course, this class of spherical quadratic programs may be trivial to solve with conventional methods, but the Adaptive Mixing Algorithm is not meant to compete with quadratic programming algorithms. However, if a spherical quadratic program was perturbed slightly, due to noisy data for example, or perhaps a sinusoidal component, it may be very difficult to solve with conventional approaches. On the other hand, we would expect the performance of AMA to be insensitive to perturbations of this type. Finally, to obtain an upper bound on AMA's iteration count K,,m, we define a worst case problem. For any spherical quadratic program (P), define (Q) to be its corresponding conical program, (Q) ming(x) XES where g(x) = inf{y: (x,y) E C}, and C = convex hull of (xa*, y*) and (S,y*). This conical program is a worst case problem, in the sense that KIo,m on (Q) is greater than or equal to Ka,m on (P). The proof is included in the appendix (see also [13]). Now, let p be the expected ratio of normalized improvement on the conical program, =E [Z4k+l1Zk ] For the AMA on the class of conical spherical quadratic programs,,L does not depend on the value of k. In fact, Zk+l/ZkQ is independent and identically distributed for all k = 0, 1,2,... 7

(refer to Lemma 3.0 in the appendix). However, / is a function of the dimension of the problem and may be written, I(n)= E [Zk+/ZQ]. We are now ready to state an analytical bound on complexity for the AMA. Theorem 3.1 Given the Adaptive Mixing Algorithm and any spherical quadratic program (P) with /u(n) for its conical program (Q), then for any m > 1 and 0 < a < 1, 21 n (m(1 + -1/2)) K,,m < -aKm _ ln(1/#(n)) Proof: See Appendix. * Theorem 3.1 provides an upper bound on the iteration count as a function of dimension (n), the amount of improvement (m), and the degree of certainty (a). To see how efficient the Adaptive Mixing Algorithm is in terms of dimension, we need to hold m and a constant, and express i in terms of dimension. Unfortunately,,u(n) is difficult to express analytically. However, we can experimentally approximate it as a function of dimension. 3.3 Parameter Estimation The parameter p(n) = E [ZQ1/ZQ] was estimated by collecting normalized improvements Zk from 2 runs of 100 iterations each on a conical program. Thus, the estimates Xi(n) were based on 200 ratios at each dimension, n = 2,4,6,8,10,20,30,40, and 50. The estimates are presented in Table 1 (analytically,u(1) = 0.5). The experiments were run on the following conical program, minimize 10 (Einl(xi - 5)2)1/2 (Q*1) subject to 0 < xi < 10 for i = 1,2,..., n Notice that the test function (Q.1) is a conical program corresponding to a hyperspherical test function, such as (P.1) defined later. Now the estimates f(n) will be used to establish the effect of dimension on the bound on iteration count. As stated in Theorem 3.1, Km < 2 I((1 ))) ln (m(1 + a-1/2) so the only term with dimension is 1/ n(l/l(n)) Figure 1 plots this function of the estimates, so the only term with dimension is 1/ ln(1//y(n)). Figure 1 plots this function of the estimates, 1/ln(1/'(n)) 8

versus dimension. There is one point per run. A linear model was fit to the experimental data, yielding 1/ ln(1/((n)) = (3.5 ~ 0.2)n + (3.2 + 4.8). Adding a quadratic term to the above regression did not significantly improve the fit. The quadratic model fit to the experimental data was, 1/ ln(1/i(n)) = (-0.02 ~ 0.02)n2 + (4.5 ~ 0.8)n + (3.2 ~ 6.7). Using the coefficients from the linear model, we conclude that the bound on iteration count can be well approximated by a linear function of dimension, Krm < 2 (3.5n + 3.2) In (m(1 + a-1/2)). As an example, Table 2 provides this estimated bound on iteration count, KO,m, for spherical quadratic programs in various dimensions, with a million-fold improvement (m = 106) and 99% certainty (a = 0.01). The numbers were calculated as the next greatest integer of 2(3.5n + 3.2) ln(m(1 + a-1/2). Similar calculations were made to graph the bound on Kam in Figures 2 and 3. Thus far, the analysis has concentrated on evaluating the effect of dimension on the iteration count while keeping m and a constant. It is also interesting to note the effect of the accuracy of solution (m and a) on the iteration count. The m-fold improvement and 1 - a degree of certainty are both represented logarithmically in the upper bound on iteration count. This might explain the experience for the AMA and other algorithms that large improvements are made very efficiently, while refinements of the solution become costly. 4 Computational Results 4.1 Computational Experiments We report two computational experiments to observe the effect of dimension on computation and to give a rough comparison to previous algorithms [17,19,21]. The first experiment was on the "conical" convex program (Q.1). This function was chosen because it is a worst case problem for AMA, as previously discussed. The Adaptive Mixing Algorithm was run using ten random seeds at each dimension, from n = 2 to n = 50, to compute an average number of iterations. The experiment used a stopping criterion of 100-fold improvement, and a starting position on a side of the feasible region, (5,5,..., 5, 10). The results are shown in Figure 2, where the average number of iterations are graphed as a function of dimension. The theoretical upper bound on complexity (as derived in the Section 3) is indicated by the heavy line. This bound is for 50% certainty of 100-fold 9

improvement (Ka,m with m = 100 and a = 0.5), while the data shows 100% certainty of 100-fold improvement. All of the observed points satisfy the bound. It is clear from this experiment that the upper bound is not tight. The light line on the graph represents a linear regression on the mean number of iterations. The regression coefficients were 27n - 83 with r2 of 0.994. The second experiment was on a hyperspherical problem. This function was chosen because it was the only test function common to [17,19,21] that had been varied by dimension. The Adaptive Mixing Algorithm was run on the following hyperspherical problem, (P. 1) minimize l (Xi)2 subject to -10 < xi < 10 for i = 1,2,...,n with a starting point on a side of the feasible region, (10, 0,..., 0). Five random seeds were used at each dimension, from n = 2 to n = 40. The experiment used a stopping criterion of 1,000-fold improvement. The graph in Figure 3 provides the average number of iterations as a function of dimension. The upper bound on complexity is for 50% certainty of 1,000fold improvement (Ky,m with m = 103 and a = 0.5) and is indicated by the heavy line in the figure. The data of 1,000-fold improvement with 100% certainty again satisfies the bound. Again, the light line on the graph is the linear regression line on the mean number of iterations, estimated as 1 n - 15 with r2 of 0.993. To summarize, the experiments indicate that the complexity of the AMA is well approximated by a linear function in dimension. 4.2 Comparisons Between Algorithms A direct comparison between the AMA and other algorithms is difficult for a number of reasons. The first difficulty in comparing these algorithms is that they were tested on different problems. Although the conical program gives a worst case and hence upper bound on complexity, we also tested the hyperspherical test function to be consistent with [17,19,21]. A second difficulty in comparing the performance of these algorithms is due to the dependency on the accuracy of solution. Our complexity analysis shows an exact dependency on number of iterations to achieve an m-fold improvement. Because of this predicted dependency, the AMA used 103-fold improvement as a stopping criterion on the hyperspherical test function to be consistent with Solis and Wets [21]. Schumer and Steiglitz [19] used an = 10-8 as a stopping criterion. This is only approximately 108-fold improvement due to the various starting positions used. Schrack and Borowski [17] used the rate of average convergence as a comparison, instead of a specified level of improvement. A third difficulty in comparing algorithms across dimension is that they were all tested for varying dimensions. The AMA and Schumer and Steiglitz's algorithm were tested on dimensions up to 40 on the hyperspherical function, while Solis and Wets' algorithm was tested up to 10 dimensions. Schrack and Borowski only compared n = 5 with n = 10. 10

The fourth difficulty in comparing performance was that [19,21] both reported average number of function evaluations, while the analysis on the AMA focused on the number of iterations. For comparison purposes, the average number of function evaluations for the AMA on the hyperspherical test function, to achieve 103-fold improvement for dimensions up to 40 was approximately 137n. Schumer and Steiglitz [19] reported the average number of function evaluations of their Adaptive Step Size Algorithm on a hyperspherical problem to get within c = 10-8 of the solution for dimensions up to 40 as approximated by 80n. Solis and Wets [21] tested their version of an adaptive step size algorithm on the same hyperspherical problem up to 10 dimensions, and reported that the average number of function evaluations to achieve 103-fold improvement as roughly 33n. Schrack and Borowski [17] did not attempt to quantify the effect of dimension, but reported "for a given reduction of the function evaluations, a doubling of dimension doubles the number of function evaluations". The attraction to the AMA is that it is relatively simple to implement, and has potential for more efficient implementations. The focus in this paper and in the development of the algorithm was aimed at iterations, rather than function evaluations. This gives some support as to the value of short Hit-and-Run sequences within an optimizing framework. Also, to compare with the linearity results reported, AMA has a linear bound on complexity in terms of iterations for a broader class of functions than just the hyperspherical function. 5 Summary The Adaptive Mixing Algorithm has been shown to have a search effort that is nearly linear in dimension for the class of spherical quadratic programs. The functional bound on complexity also includes the accuracy of solution as a parameter, with a specified degree of certainty. The class of spherical quadratic programs includes test functions for which other adaptive random search algorithms have reported linear complexity in dimension. Thus the theoretical and computational results describing the near linear search effort of the AMA may provide some intuitive explanation for the performance of similar algorithms. When viewing AMA as an optimizing version of Hit-and-Run, the complexity results give some insight to the value of short sequences while maintaining an iteration count that is linear in dimension. Intuitively, why might the AMA be linear in dimension? The key to the AMA's performance, from the complexity analysis, is the proportional reduction in volume of the region to be searched for the optimum. This is consistent with the intuition behind the volume reduction of a pure adaptive search, which achieves linear complexity on global optimization problems [23]. Thus the AMA can be expected to exhibit the same agreeable performance on a nonlinear, possibly global, optimization problem with level sets that are approximately spherical in shape. While conventional approaches that rely on the objective function's derivative will most likely be very sensitive to a starting position, and get easily trapped in a 11

local minimum that may be far from the global minimum. The real usefulness of the AMA is for this type of large dimensional perturbed quadratic program. An example might include a function that is basically convex with minor local minima, such as a function with a sine wave component or an estimation of noisy data, and hence a global optimization problem. This research provides support for the hope that there exists a random search algorithm for global optimization that is linear in dimension. 12

References [1] H.C.P. Berbee, C.G.E. Boender, A.H.G. Rinnooy Kan, C.L. Scheffer, R.L. Smith, and J. Telgen, "Hit-and-Run Algorithms for the Identification of Nonredundant Linear Inequalities," Mathematical Programming 37 (1987) 184-207. [2] A. Boneh, "A Probabilistic Algorithm for Identifying Redundancy by a Random Feasible Point Generator (RFPG)," in: M.H. Karwan, V. Lotfi, J. Telgen and S. Zionts, eds., Redundancy in Mathematical Programming (Springer-Verlag, Berlin, 1983) [3] G. Cornuejols, M.L. Fisher, and G.L. Nemhauser, "Location of bank accounts to optimize float: an analytic study of exact and approximate algorithms," Management Science 23 (1977) 789-810. [4] L.C.W. Dixon and G.P. Szego, eds., Towards Global Optimization (North-Holland, Amsterdam, 1975). [5] L.C.W. Dixon and G.P. Szeg6, eds., Towards Global Optimization 2 (North-Holland, Amsterdam, 1978). [6] W. Feller, An Introduction To Probability Theory And Its Applications, Volume 2, 2nd Edition (John Wiley and Sons, New York, 1971). [7] M.H. Karwan, V. Lotfi, J. Telgen, and S. Zionts, eds., Redundancy in Mathematical Programming (Springer-Verlag, Berlin, 1983) 108-134. [8] D.E. Knuth, The Art of Computer Programming, Vol. 2 (Addison-Wesley, Reading, Massachusetts, 1969) 116. [9] J.P. Lawrence III and K. Steiglitz, "Randomized pattern search," IEEE Transactions On Computers C-21 (1972) 382-385. [10] D.G. Luenberger, Introduction To Linear And Nonlinear Programming, 2nd Edition (Addison-Wesley, Reading, Massachusetts, 1984). [11] K.G. Murty, Linear Programming (John Wiley and Sons, New York, 1983). [12] V.A. Mutseniyeks and L. Rastrigin, "Extremal control of continuous multi-parameter systems by the method of random search," Engineering Cybernetics 1 (1964) 82-90. [13] N.R. Patel, R.L. Smith, and Z.B. Zabinsky, "Pure adaptive search in Monte Carlo optimization," Mathematical Programming 43 (1988) 317-328. 13

[14] L.A. Rastrigin, "Extremal control by the method of random scanning," Automation and Remote Control 21 (1960) 891-896. [15] L.A. Rastrigin, "The convergence of the random method in the extremal control of a many-parameter system," Automation and Remote Control 24 (1963) 1337-1342. [16] S.M. Ross, Stochastic Processes (John Wiley and Sons, New York, 1983). [17] G. Schrack and N. Borowski, "An experimental comparison of three random searches," in: F. Lootsma, ed., Numerical Methods For Nonlinear Optimization (Academic Press, London, 1972) pp. 137-147. [18] G. Schrack and M. Choit, "Optimized Relative Step Size Random Searches," Mathematical Programming 10 (1976) 270-276. [19] M.A. Schumer and K. Steiglitz, "Adaptive step size random search," IEEE Transactions On Automatic Control AC-13 (1968) 270-276. [20] R.L. Smith, "Efficient Monte Carlo procedures for generating points uniformly distributed over bounded regions," Operations Research 32 (1984) 1296-1308. [21] F.J. Solis and R.J.-B. Wets, "Minimization by random search techniques," Mathematics Of Operations Research 6 (1981) 19-30. [22] Z.B. Zabinsky, Computational Complexity Of Adaptive Algorithms In Monte Carlo Optimization (Dissertation from The University of Michigan, Ann Arbor MI, 1985). [23] Z.B. Zabinsky and R.L. Smith, "Pure adaptive search in global optimization," Mathematical Programming (forthcoming). [24] W.I. Zangwill, Nonlinear Programming: A Unified Approach (Prentice-Hall, New Jersey, 1969). [25] G. Zoutendijk, Methods of Feasible Directions (Elsevier Publishing Company, Amsterdam, 1960). [26] G. Zoutendijk, "Nonlinear Programming: A Numerical Survey," SIAM Journal On Control Theory And Applications 4 (1966) 194-210. 14

Appendix A Implementation Details The implementation of the first step of the Adaptive Mixing Algorithm is straightforward. The direction vector Dk is of unit length sampled from a uniform distribution over an ndimensional hypersphere as follows, n \-1/2 Dk =(dl d2,. dn) x I where dii = 1,2,..., n are sampled independently from a standard normal distribution, N(O,1) [8]. The implementation of Step 2 is more involved. The implementation used in our computational experiments assumed the original problem had linear constraints, as follows: minimize f(x) subject to Ax < b where f(x) is a continuous function, A is an m x n matrix, x is an n-dimensional vector, and b is an m-dimensional vector. Step 2 generates Xk+1 uniformly over Lk using a line search. Line search methods are themselves an important area of research [10], but will be kept simple here. It proceeds as follows. First, the direction Dk is checked to be improving and feasible a distance e away by evaluating the following equations: (1) f(Xk + eDk) < f(Xk), and (2) Xk+EDk ES, where e represents a small step (the value of e is a parameter in the computer program, e = 10-5 was used in most experiments). If both equations are satisfied, then a step of length e in direction Dk is possible. Equation (1) determines if direction Dk is improving. If it is not satisfied, then -Dk is checked. In the general case, for regions that are differentiable at Xk, the probability that both directions ~Dk are not improving approaches zero as e approaches zero. There is a positive probability that both directions are not improving if the region is not differentiable at Xk, or e is large relative to the curvature of the improving region at Xk. If no improving direction is found after a maximum number of tries, the computer program terminates with a message to the user. Equation (2) determines if an e step in direction Dk is feasible. In particular, if it is not satisfied, the algorithm returns to Step 1 to generate another direction. Because the points are chosen randomly, the probability of landing on 15

a constraint is zero; however, the point may be within e of the edge in direction Dk. This problem of "jamming" seemed especially evident in solving linear programs. Zangwill [24] (see also [25,26]) discusses jamming, where the points cluster in a corner of the feasible region, in the context of feasible direction methods. These authors also discuss antijamming procedures such as a perturbation method. Such an antijamming technique has not been incorporated into this version of the Adaptive Mixing Algorithm, although it is intended for future research. Given an improving feasible direction Dk, the line search next generates an improving feasible point Xk+l uniformly within Lk satisfying: (3) f(Xk+) < f(Xk), and (4) Xk+l E S. To ensure feasibility (equation (4)) for the problem with linear constraints, a minimum ratio test is used to determine the maximum feasible distance, A, in direction Dk such that Xk + ODk is feasible for all 0 < 0 < A [11]. Then, a random distance 6 is generated according to a uniform distribution on (0, A) until equation (3) is satisfied. Thus the new point Xk + SDk is improving, and is accepted as Xk+l. One way to increase the efficiency of the line search for problems with convex objective functions is to retain the value of 8 as A = 5, and then generate another distance until an improving point is found. This modified line search will still produce uniformly distributed improving points due to the nesting of the convex level sets. This modification was used in the two computational experiments reported earlier. There are many possible line searches that may be more efficient than this simple line search, but our analysis in terms of number of iterations is valid for any line search that implements Step 2 of the algorithm. B Lemma 3.0 Lemma 3.0 Given the Adaptive Mixing Algorithm and any conical spherical quadratic program (Q), the ratios of normalized improvements are independent and identically distributed, Zk+l/Zk i. i. d. for k = 0,1,2,... Let the mean and variance of the ratios of normalized improvement be written as, E [Zk+l/Zk] = s, and Var [Zk+/lZk] = a2, then the mean and variance of the normalized improvement is, E [Zk] = 16

Var[Zk] = (af2+,2)k - 2k = ( 2) (1 - Proof: First we reiterate that this proof applies to any conical spherical quadratic program (Q). Then, for k = 0, 1,2,... and 0 < z < 1 P(Zk+i/Zk z) = Ezk[P(Zk+l ZZk Zk)] = P (Zk+l < zxlZk = x)fz,(x)dx, where fzk(z) is the density function of Zk, and since Zk = (Yk - y)/(y* - y*) = P (Yk+l < (zx)'Yk = (x)')fzk(x)dx, Jo where (x)' = x(y* - y*) + y* and (zx)' zx(y* - y*) + y*. As discussed in Section 3, the conditional probability of improvement on (Q) is invariant and depends on the ratio of the lengths of line segments, P (Yk+, < (zx)'IYk = (x)')= d IILQ(.), (d, p)I/IILQ(, (d,p) Il dFD(d) for any point p with f(p) = (x)'. Now, since the ratio of lengths is preserved by a linear tranformation, this ratio is constant for all 0 < x < 1, and depends only on the scaling factor z. Thus the conditional probability P (Yk+i < (zx)'lYk = ()') is constant for all x and depends only on z. We write P (Yk+l < (zx)'YYk = (x)') = L(z) Using this notation in the expression above, P(Zk+l/Zk< z ) = P(Yk+l < (zx) IYk = (x)')fz,(x)dx = j L(z)fzk(x)dx = L(z) ffzk(x)dx = L(z) for 0 < z < 1. Thus the random variables Zk+l/Zk for k = 0,1,2,... are identically distributed. 17

To establish independence of Zk+l/Zk, we need to show that P (Zk+ll/Zk < ZlZk/Zk-l,..., Z2/Z1, Z1/Zo) = P (Zk+/lZk < z) for all k = 0,1,2,... and 0 < z < 1. Now, since ZO = 1 we have, for 0 < z < 1 and 0 < zi < 1, i = 1,2,..., k and for all k = 0, 1,2,..., P (Zk+l/Zk < ZlZk/Zk-1 = Zk,..., Z1/Zo = 1) =P (Zk+l < ZZkZk = ZkZk-1 *. Zl,..., Z1 = Z1) P (Zk+l < ZZkZk-l.* ZlIZk = ZkZk-1'' Zl) since Zk+l is conditionally independent of Zk-1,..., Z1 given Zk -P (Zk+l < zxIZk = x) where x = ZkZk-1.. z\ P (Yk+l < (ZX)'IYk = (X)') = L(z) P (Zk+l/Zk < z). Thus the random variables Zk+l/Zk for k = 0, 1,2,... are independent. Now that Zk+l/Zk have been shown to be independent and identically distributed, let p and o2 be the mean and variance, respectively for the distribution. Notice the mean and variance do not depend on the value of k. We now derive the mean and variance for the random variable Zk, as a function of k, /, and a2. Since Zo = 1, Zk = (Zk/Zk-_) (Zk-1/Zk-2)..(Z,/Zo) and by independence, E[ [Zk] = E[Zk/Zk-1]E [Zkl_/Zk-2]. E [Z/Zo] _ k Similarly, the second moment can be written, E[Zk] = E[(Zk/Zk-1) ]E [(Zk-l/Zk-2)] *Z E [(Z/Zo)2]. Also, the Var [Zk+l/Zk] = a2 = E [(Zk+l/Zk)2] - 2 18

implies that, for k = 0, 1,2,... E[(Zk+l/Zk)2] = 2 + 2, hence E [Z:k= (2 + )k Thus, Var[Zk] = E[Zk]-E[Zk]2 = (2 + )_ 2 2k and factoring out the (a72 + 2)k term, we conclude with - ( + /)0((2 + 2 )). ) C Theorem 3.1 Theorem 3.1 Given the Adaptive Mixing Algorithm and any spherical quadratic program (P) with yi(n) for its conical program (Q), then for any m > 1 and 0 < < 1, 2 in (m(1 + a-1/2)) Kam - In(/l(n)) Proof: The proof is separated into four parts that establish that (Q) is a more difficult problem than (P), and uses (Q) to provide an upper bound on Ka,m. The superscripts P and Q are used in the notation to specify the problem. The first part shows conditional dominance, P (Zk+ < zlZ? = -) < P (Z+l _< ZZk = w) for all k = 0, 1,2,... and 0 < z, w < 1. This is used in part 2 to show stochastic dominance, { k+l } st {Zk+1 } for all k = 1,2,... which is used in part 3 to show that (Q) is stochastically more complex than (P), KQ, > K Ci'm - Ot'm 19

for any m > 1 and 0 < a < 1. Part 4 establishes an upper bound on the iteration count for (Q), and combining with part 3 yields K <KQ < 21n (m(l + a-1/2)) am - a,m - ln(1/((n)) which completes the proof of the theorem. Part 1 of proof: Show conditional dominance, i.e., show that P (Z+l < ZIZ = w) < P (Zil < zlZp = w) for all k = 0, 1,2,... and 0 < z, w < 1. First consider the case when 0 < w < z < 1. Then, by the definition of the Adaptive Mixing Algorithm, P (Z+ < zZ? = w) = 1 = (Zr+ zlZ Z w) for k = 0,1,2,.... Second consider the more interesting case when 0 < z < w < 1. Now, using the definition of Zk, clearly for either problem (P) or (Q), and for k = 0, 1, 2.. P (Zk+l < zZk = w) = P (Yk+l < z'lY = w') where z' = z(y* - y*) + y. and w' = w(y* - y*) + y*. It then remains to show that P ( iYk < Z|YQ = W) = 1 = P ( Yk ^ I = w) for k = 0,1,2,... and y* < z < w < y*. In order to prove this equivalent statement, we need to introduce a similarity transformation and additional notation. Let A,: Rn - Rn be an affine function defined by Aw,Z(x) =,X + ((Z - y,)/(W - Y*)) (X - x*) for y* < z < w < y* Let Pw,z be the level set PW shrunken by a factor of (z - y)/(w - y) and rerooted at x* so that all P,,z is contained in Pz, i.e., P,, = {x: = Aw, (x) for x E Pw}. Similarly, let Qw,z ={:X= Aw,z() for x E QW}. 20

Notice that P,,z is contained in Pz, and Q,,z = Qz (the proofs are in [13]). Using this, we have P (Y 1 < zY = w) = P(X+ E Qw,ZiYk = w) (1) and P ( P ( e E Pw,z = w). (2) Now, P (xk+ E Pw.lY = w) = )= IILpw,(d,p)ll/lLpw(d,P)IldFD(d), for any point p with g(p) = w, and similarly, P (XQ+l (,zIY = w) = j IIL,(d,q)l/l|LQ,(d,q)l|dFD(d), for any point q with g(q) = w. Due to the linear similarity transformation AW,z, for every d that intersects Pw,, the corresponding d also intersects Qw,z. Also, for any feasible improving direction d, the linear similarity transformation preserves the ratio of the lengths of these corresponding line segments, IIL,,(d,p) lllLpw(d, p)lI = IIL-,,(d, q) j//ILQw(d, q)ll. Therefore, P (Xk+l E QwZYk = w) =P (X+I +l PzlYk = w), and using equations (1) and (2) above, we have the desired result P (Yk+1 < zIYk w) z) >P( > z) for all k = 1, 2,... and 0 < z < 1. 21

The proof uses induction on k. To show the first iteration, recall that the initial values are defined to be equal to 1, ZQ = ZP = 1. Thus for any z, 0 < z < 1, P(ZQ>z) = P(ZQ>zIZQ l) > P(ZP>zlZP=l) by conditional dominance (part 1) P (Z > z) Therefore, {ZQ}>,t{Zi}. To continue the induction argument, we assume that ZQ>,tZ for some integer k > 1, and show that Zk+1.>tZP+1. For any fixed z, O < z < 1, let q(w) = P (+i > zLZQ = w), and p(w) = P (Zk+ > |ZP =w) for 0 < w < 1. Then, q(w) > p(w) for 0 < w < 1 by conditional dominance (part 1). Also, q(w) is a non-decreasing function because the conditional distribution of improvement is nonincreasing in w. (This is a technical point but can be readily proved using properties of a cumulative distribution function and the fact that ZQ+/ZQ are independent and identically distributed as shown in Lemma 1.) Thus, because q(w) is non-decreasing and ZQ>_tZk, we have [16] E [q(Z)] > E [q(Z)] > E[p(ZkP)] by q(w) > p(w) for 0 < w < 1, and thus E [q(Z)] > E[p(Z)]. Now, E [q(ZQ)] = E [P (Zk+ > zIZQ)] = P (Z k > Z) and similarly, E [(Z)] = E [P (+l > z|Zf)] P (7l > z ) Thus, P (Z > > ) P (Z+i > z) for all 0 < z < 1, and therefore {ZQ+i}>t{zP+i}. By induction, we have {Zk}>t{ZfkP} for all = 1, 2.... 22

Part 3 of proof: Show that (Q) is stochastically more complex than (P), i.e., show that KQ > KP am - K,m for any m > 1 and 0 < a < 1. By stochastic dominance (part 2 of the proof), we have that P(? z) < 1. Let M = K = mink=,2...{k: P (Z < 1/m) -a} for m > 1 and 0 < a < 1. Then P (ZP < 1/m) P (ZQ < 1/m) > 1 - a hence, Im K,= min {k: P (Z < 1/m) > 1 - a} < M. atm fk=l,2.... -- -- - Therefore, KQ, > /_Pm for any m > 1 and 0 < a < 1. Part 4 of proof: For any m > 1 and 0 < a < 1, establish an upper bound on the complexity measure on (Q) as follows, 21n (m(l + a-/2)) KQ,m < Of'm - ln(1/p(n)) To establish this upper bound, let k = E [zQ], and ac = Var [zQ]. From Chebyshev's inequality [6], P(|Zk - kl < t)> 1 -a/t2 for any t > 0. Letting t2 = ao/a for 0 < a < 1, implies P (IZk - k < kc-1/2) > 1-a or P (-ka-k1/ < Z < + ak-1/2) > 1 a. Now, using the definition of complexity, KQm = mi2n {k: P(Z < /m) > 1-a} 23

to bound complexity we seek k such that P (ZQ < 1//m) > 1 - a. Since 0 < z < 1 to~~~~~~~~~k P (Z < 1/) > P (O< Z < 1/m) and letting 1/m >~ /k + aka-1/2, ~ P(O I Zk? +k Q-1/2) P (k - aka 1/2 < Z? < k + k-1/2). Combining this with the expression from Chebyshev's inequality yields, if k is such that 1/m > Jk + CTca-1/2, then P (Z? < 1/m) > 1 -a, and hence KQm < k. Now, using Lemma 3.0, and the expressions given for Ik and ar2, I'k = E[Zk] = k O = Var[Zk] = (a2 + 2)k 2k = (2 + 2)k (1(-,2)k) we next perform a series of substitutions to find some k that satisfies 1/m > /k + aka-1/2 or equivalently, we seek k that satisfies l/rn > Pk + (a2 + 2)k (1 - )) /.. Because 0 0, we have a2 + p2 > #2 > 0, which implies 82/ (<2 + L2)k > 0 which implies 1 -( / (a2+ I2)) < 1. Therefore, if k' satisfies 1/m > Pk' + ( +2)k'/a, then 1/rn > 1k' + _+ (,2 + _2)l 1-() /a 0.2 + __2 24

then k' > KQ Because 0 PIk"/2+ 2(2 + a2)k"/, then k" > KQ Now, c2 + -y2 < #,which implies for k = 0, 1,2,..., we have (2 + 2) k/2> 2)k/2 Therefore, if k'" satisfies 1/m > k'/2 + k"'/21/a, or 1/m _> k'"/2 ( + a-1/2) Letting ck = k"' (for convenience in notation), and manipulating the last equation gives, 1 > Pm^s/2 ( + c-1/2 which implies 0 = In (1) > (K/2) In (p) + In (m(1 + a-1/2)), which implies -(c/2) In (/) > In (m(1 + a(-1/2)) which implies -2 In (m(1 + a-1/2)) 21 n (m(1 + a-1/2)) - in (A) In (1/p) Therefore, if K = 21n (m(1 + c-1/2))/ln (1//), then 1/m > P + [7 O,-1/2 thus P (ZQ < 1/m) 1 - a. Hence, K provides an upper bound for KQm and, combining this bound with the bound on Kam from part 3, we have P 221n (m(l1+a- -/2)) Kam ~K m - ln (1/p) for any m > 1 and 0 < a < 1, which completes the proof. 25

LL n 2 4 6 8 10 20 30 40 50 i>(n) 0.910 0.945 0.956 0.965 0.969 0.986 0.991 0.994 0.994 Table 1: Estimates For Expected Ratio Of Normalized Improvements (/(n)) As A Function Of Dimension (n) n Ko.Ol,lO0 1 218 2 331 5 672 10 1,239 50 5,779 100 11,454 500 56,851 1,000 113,598 5,000 567,573 10,000 1,135,043 Table 2: Bound On Complexity For The Adaptive Mixing tion Of The Number Of Dimensions (n) Algorithm (Ko.oi,1o6) As A Func 26

200a) E 150E a Q) >100CZ 0 I),O 50 c _o0 0i ~ w c 0 x x x x Estimates - Linear Regression x I I I ~ I, -, - I o 10 20 30 40 50 Number of Dimensions 60 Figure 1: Estimates Of Average Normalized Improvement Versus Dimension 27

3000 Upper Bound M- Data and Linear Regression 0'j 2000 (1 0. - 1000 z0 0 1 0 20 30 40 50 Dimension Figure 2: Average Number of Iterations for AMA On The Conical Program (Q.1). 28

UNIVERSITY OF MICHIGAN 9011 111115 II04732 1 6916 3 9015 04732 6916 cn 0 40 1. (1).0 E z 3000 2000 1000 0 0 10 20 30 Dimension 40 Figure 3: Average Number of Iterations for AMA On The Hyperspherical Problem (P.1). 29