CONTINUOUS APPROXIMATION SCHEMES FOR STOCHASTIC PROGRAMS John R. Birge Department of Industrial and Operations Engineering The University of Michigan Ann Arbor, MI 48109-2117 Liqun Qj School of Mathematics The University of New South Wales Kensington, N.S. W. 2033 Australia Technical Report 92-59 December 1992

CONTINUOUS APPROXIMATION SCHEMES FOR STOCHASTIC PROGRAMS t John R. Birge * Department of Industrial and Operations Engineering University of Michigan Ann Arbor, MI 48109 USA and Liqun Qi ** School of Mathematics The University of New South Wales Kensington, N.S.W. 2033 Australia (Revised on December 18, 1992) Abstract One of the main methods for solving stochastic programs is approximation by discretizing the probability distribution. However, discretization may lose differentiability of expectational functionals. Besides, the complexity of discrete approximation schemes increases exponentially as the dimension of the random vector increases. On the other hand, stochastic methods can solve stochastic programs with larger dimensions but their convergence is in the sense of probability one. In this paper, we study the differentiability property of stochastic two-stage programs and discuss continuous approximation methods for stochastic programs. We show that the objective function of a stochastic two-stage program with continuous probability distribution has a locally Lipschitzian derivative. We present several ways to calculate and estimate this derivative. We then design several continuous approximation schemes and study their convergence behavior and implementations. The methods include several types of truncation approximation, lower dimensional approximation and limited basis approximation. Keywords: approximation, derivative, continuous distribution, stochastic programming. t The authors thank Andrew Snelson, Sanjeev Acharya, and Samer Takriti for their assistance in performing the computational tests. * His work is supported by Office of Naval Research Grant N0014-86-K-0628 and the National Science Foundation under Grant ECS-8815101 and DDM-9215921. ** His work is supported by the Australian Research Council. 1

Section 1. Introduction Consider the following stochastic program with recourse: minimize z = cx + I(x) subject to Ax = b (1.1) x> O where I(x) = E(,k(Tx -,)) and I is the recourse function, defined by +(w) = inf{qy: Wy = -w,y > 0}. (1.2) Then (x) = y(Tx - )P(d). (1.3) The dimensions in (1.1), (1.2) and (1.3) are: x, c E Rn,b E sm, y, q E Ek, E The random I-vector C is defined on a probability space (E, A, P). A popular approach to solve (1.1) is to discretize P to get a large-scale linear program approximation to (1.1) [2][6][7][13][15][21][24][31]. The approximation function to I in this case in general is nondifferentiable. However, if P is a continuous distribution, i.e., P(d)) = p(C)dC, (1.4) where p is a Lebesgue integrable function, then I is differentiable. In fact, by (1.2), if is a piecewise linear function. Hence, Ob is piecewise constant except in a Lebesgue zero measure set. If P is discrete, then this Lebesgue zero measure set may have positive measure in P. Thus, I may be nondifferentiable. If P is continuous, then we may omit this set and have V((x) = JV(Tx - )p(0)d. (1.5) This observation leads to continuous distribution approximations to solve (1.1). Convergence properties and uses in algorithms are discussed in [4]. To make this approach feasible, one needs to approximate (1.4) efficiently. This paper considers such schemes. Section 2 further studies V4t and shows that it is locally Lipschitzian if P is continuous. Section 3 presents approaches to estimate VI(x) based on direct probability approximations. Section 4 discusses truncation approximations and gives convergence rates. Section 5 presents convergence results and approaches for lower dimensional approximations. Section 6 gives results 2

on using combinations of the previous approaches and Section 7 provides some examples and illustrative results. Section 2. The Local Lipschitzian Property of the Derivative Suppose that P is a continuous distribution defined by (1.4). In the introduction we have already observed that I is differentiable in this case. The following proposition gives a formula for VI(x) based upon basic optimal dual solutions of (1.2). Proposition 2.1 Let the basic dual optimal solutions of (1.2) be {rj: j = 1,, N}. Let the basic matrix of (1.2), associated with 7rj, be Bj. Then V((x)= caj7rjT, (2.1) l<j<N where caj — P(~ I 7rj optimal) = P( | Bi1(Tx- -) > 0) = P(E I B-'Tx > B'S,). (2.2) Proof: By (1.4) and (1.5), V(x) = E {rjTP(( rj; optimal)}= E acrjT. (2.3) 1<j<N 1<j<N This formula holds for continuous distribution since the set of ( in which the dual problem of (1.2) has no unique solution has zero measure in P in this case. By the optimality condition of (1.2), cj - P(~ I Irj optimal) = P( I Bj.'(Tx- ) > O)= P(T I Bj'Tx > B-1). This proves (2.2).. By this, we may show the following theorem. Theorem 2.2 If P is continuous, then VW exists and is locally Lipschitzian. 3

Proof: We know the existence of VI by previous arguments. Let (B -l)i be the i-th row of B.-l. Let,,ji = (Bj1),, and sj,(x) = (Bj-)iTx. Denote the event rjji < sji(x) by Aji(x). Then aj = P(C I Bj-Tx > B7-1C) = P(Aji(x)..Ajl()). Let x' and x" be two points close to x. Then laj(x') - aj(x") = IP(Ajl(x'). Aj(x')) - P(Aj(x").. A()) < IP(Al (X")... Aj (X")Aj(x')...A(')) - P(A(x")... Aj()A,i )... A )) i=1 = I IP(Ajl(x") Ajil(.x")AjAi+1(x')... Aji(x'), ji < sji(x')) i=l -P(Ajl(x")... AjAi_(x")Aj, i+l(') Ajj(x'),Ajij < Sji(x+))( = X IP(Aj1(x")... Aj,i_(x")Aj,i+(')... Aj,(x'), sji < ji < s, i) si = min{Sji(x'), sji(X")} and Then Iaj(x') - aj(X,) < I pji(jii)d77mii (2.4) i= mx where pji is the marginal density function of ji. Since the distribution of P is continuous, pji is also continuous. By this and (2.4), we see that caj is locally Lipschitzian for all j. Since 7rj and T are constants, by (2.3), V' is locally Lipschitzian, this completes the proof. ~ 4

This shows that VI has a nonempty and bounded generalized subdifferential at every x [3] [9]. A function with locally Lipschitzian derivative is called a Cl'function or an LC1 function [20] [25] [26]. This opens a way to develop second-order analysis and algorithms for (1.1). Section 3. Calculation and Estimation of the Derivative Suppose that P is a continuous distribution defined by (1.4). We now discuss some practical ways to calculate VI(x). The first approach is based on approximating probabilities as in the Boole-Bonferroni approach of Prekopa [23]. Theorem 3.1 In (2.1), we have acj- P( I Tirj optimal) 1 2b (cj - 1)aj 2(1+ cj(cj + l))bj = i - -aj - + tj[ (3.1) l( ) Cj + -1 l (cj(cj + 1)) where aj = P(vji > sji(x)), l<i<l bj = E P(7ji > sji(x), qji' > sji(x)), l<i<i'<l 2bj cj -1= aj J 0 <tj < 1,, ji(, = (Bj) sji (x) = (B Tx, (Bj 1)i is the i-th row of B1. Proof: Let Aji = Aji(x). Then P(E I B^.Tx > B-l') = P(Ajl. Aj3) = 1 - P(Ajl +... + Ajl). (3.2) By the inequality of Dawson and Sankoff ((7) of [23]), 2 2 P(Aj +... + Aj) > 1 aj- + bj, (3.3) 5

where aj = E P(Aji) = P(ji > sji(x)), 1<i<l 1<i<l bj= E P(Aji.Aji) 1<i<i' <L = E P(qji > sji(x), jir >'ji,()), l<i<i'<l Cj-1= L- J. Similarly, by the inequality of Sathe, Pradhan and Shah ((8) of [23]), 2 (34) P(Ajl +... + A) aj -bj.(3.4) Combining (3.2)-(3.4), we get the conclusions of this theorem. ~ We may use (3.1) to approximate V'(xr) by assigning tj a value in [0, 1], e.g., 0.5. We may also use higher orders of the inclusion-exclusion formula for aj (see Section 7 for further discussion of this approach). The random variable rqji is one-dimensional. If we know the marginal distribution of r/ji and the joint distribution of rji and r7ji, where 1 < i < V' < 1, then we may calculate aj and bj. In general, r/ji and,7ji, are linear combinations of {(h, h = 1,.., l}. Suppose that 7 = Sl<h<l/3hh and r' = S1<h<l /3Ph, where Ch, h = 1,..., 1, are one-dimensional random variables, /3h and /3, h = 1,..., 1, are real numbers. By probability theory, E(7) = PhE(3h), (3.5) 1<h<l Var() = 2 thVar(Ch)+2 E /3hh, Cov(h, h,), (3.6) 1<h<l l <h<h'<l Coo(,7,') = E w3hhar(~h) + E, PhPC (hhC). (3.7) 1<h<1 1<h,h'<I hah' Therefore, we may calculate the expectations, variances and covariances of rqji and rji' by the expectations, variances and covariances of {$h, h = 1,.., }. If th, h = 1,..., 1, are normally distributed, then mji and (ji, jv) in (3.5-7) are also normally distributed and bivariately normally distributed respectively. Thus, we may calculate aj and bj in Theorem 3.1 by some single and double integrals. Otherwise, we may use (3.5)-(3.7) to calculate expectations, variances and 6

covariances of those random variables 7ji and yji,', and then use the method in [7] to approximate the relevant probabilities. In fact, with gaussian variables, we may also take advantage of techniques designed specifically for these distributions as in Gassmann [17] and Deak [10]. Through the transformation, 77j = BJ'1, we obtain: j = j.I. j p(7l,...,7j,)d,7j... dfj1, (3.8) nt i = ^S<(i) rj <a j (x) where the variables have the same definitions as in Theorem 3.1. Notice that p(,rj) is a normal density so that the calculation of aj reduces to evaluating the probabilities of the sj(x) translation of the lower orthant for the normally distributed vector, r/j. We may also approximate other distributions using gaussian random variables. By fixing the means, variances and covariances of the relevant random variables, we can use a gaussian random vector with the same characteristics. Any lack of precision may then come from higher order moments. Since it is often difficult to determine exact multivariate distributions, the use of normal distributions seems especially relevant even if they are not completely justified by the practical situation. Section 4. Truncation Approximations By truncation approximations, we mean any approximation of the form: PV(A) = P(A n ^)/P(Z^), (4.1) for some EV C for all A C A and where P(E \ SE) = 6v. In other words, PI is the restriction of P to E^ or it results from truncating \'SY. These types of approximations have an advantage in that they only depend on that set.E which can be found in many ways. In each case, however, we not only have the convergence cited above but we can also describe the rate of that convergence. We need only suppose the following condition on b. (i) The set D and V./(Tx - i) are bounded so that, wherever it is defined, <ll iAoll < A for all i, E E and x e D. The result is that the convergence of any subgradient in aQv to an element of 89 can be bounded. Theorem 4.1. Suppose (i) and that Pv is generated as in (4.1), then for any 77" E a8"v(x), there exists 77 E 9(zx) such that, 11t7 - r711 2^V6A^. (4.2) 7

Proof: First, note that 117 - r71l can be written as {{l f r7()P"(d() - f rJi()P(d)I}2}}4, where we choose the same r/(C) from 8ax(Tx - C) for each C. For this, we observe that Ji,()P"(dc) - Ji(O)P(d)) ()( )P(d) + i(()P(d ).~~,/ /^ 1 - \ =~v.5s(4.3) < 2A6. From (4.3) and the definition, we have 11r7 - rill < 2v/nA5. * Theorem 4.1 gives conditions under which distributions of the form in (4.1) may converge at a given rate. Other rates may be obtained using results about the difference between function values for different distributions as for example in Romisch and Schultz [28]. The rate can then depend on the some metric between the approximating and true measures. In the lower dimensional approximations, we again use this to obtain 11rv" - rll < a", where a" - 0. The choice of E2" is the crucial step in these types of approximations. In general, we would like fast convergence by having 6' converging to zero quickly but we would also like to have relatively easy computations. The following alternatives are considered for the truncation method: 1) Ball truncation: Let E" = ({111 - 11 <- y^} where 7v -+ oo. This method is easy to describe but integrals over the ball may be difficult to evaluate. 2) Box truncation: Let' = {(C||(i) - f(i)l < y"(i) where ay(i) - oo for all i}. The only difference from the ball approximation is that we use boxes of possibly varying shapes. Depending on the distribution, these integrals may be easier than those in ball truncation. 3) Scaling truncation: Suppose 2 is bounded. Let E," - = a(2 - ) where ca -> 1. In this case, we use the shape of - to determine the form of 2v. This may present an advantage over the previous approaches in that each component of C is treated similarly relative to -. 4) Level set truncation: We assume that, similar to the scaling truncation, is distributed so that for some ~, P(a(~0 -)+ ) = 1 - 6(a), where 6 -+ O as a - oo. We then let 2" = a(v)(E~ - ) + J for some sequence of a(v) -- oo as is -+ oo. These values may for example correspond to choices of v such that 6(a(v)) = 1/v. For unimodal continuous distributions with mode t, the 2E correspond to level sets of the density function. This approach may be especially useful for gaussian distributions with ellipsoidal level regions. 5) Limited basis truncation: In this approach, we suppose that fvx = {7-: i = 1,...., v} where each iri is optimal in the dual of (1.2) for some w = Tx-a. We then let _^(xz) = {(|S(Ta?-_) = maxi,=.., Ivr(Tx()}, i.e., such that the dual of (1.2) is optimized by some ir E 1I1. Note that this approximation also 8

depends on x but that it is finitely convergent in v for any fixed x. Many strategies may be used to determine the VII set. By choosing only 7r with high probability of optimality in (1.2) we may be able to limit the v and still produce accurate results. To alleviate the difficulty, we may combine this procedure with the integration approximation mentioned in Section 2 or with the lower dimensional approximations in Section 5 below. The combined approximations will still converge as given in Section 6. Each of these procedures has the convergence rate in Theorem 4.1. They also are continuous and retain the differentiability properties mentioned above. The form of the gradient also allows the use of an algorithm that only evaluates the probability of individual bases or dual vectors 7ri being optimal in (1.2) and its dual. Since probabilities of these convex polyhedral regions may be relatively easier to evaluate than the conditional expectations required for function evaluations, these procedures are especially attractive. Section 5. Lower Dimensional Approximation In addition to the methods considered above that use full dimensional but hopefully simpler integration, we may also approximate the distributions using lower dimensional distributions that still retain differentiability properties and avoid higher dimensional integrals. These procedures are generalizations of discrete point approximations and, as we show below, obtain improved convergence rates over discrete approximations. We may begin by replacing ( by HY"77, where H" E ZRNxN,, or, in other words, we may let PV be: P"(A) = Q^"{77IHV"?l E A}, (5.1) where Qv is a probability measure on f'. Building on this, we can construct generalizations of discrete approximations. We may in fact create several lower dimensional HV with different probabilities and some translation, i.e., Hi'rd' + Pi would be used in (5.1) and given a probability pi so that (5.1) becomes: K(v) P"(A) = Z piQ{,7l'IH'I + /i E A}, (5.2) i=1 where Q\' is the probability measure on r'. In this way, we obtain direct generalizations from the discrete distributions. We obtain convergence of these distributions by allowing the number of lower dimensional approximations K(v) or the dimension N, to increase sufficiently. We concentrate on increasing the number of the 9

approximations since increases in the dimension Nv lead eventually to integrations as difficult as the original. We establish weak convergence of the measure Pv to P and, moreover, convergence of moments, using Wasserstein metrics. These results build on R6misch and Schultz's work in [29]. In our context, we let P1 be the class of Borel probability measures on WY. This is further restricted to: MP = I{Q E PlJ IIPQ(Y) < o}. We then let (Q1, Q2) = {Q E P21Q o I1 =- Q o'-1 = Q2} where II (C) is projection on the first I coordinates of C and IH2(C) is projection on the second I coordinates. For p > 1 the Lp- Wasserstein metric is defined as: Wp(Qi, Q2) = [inf{ / IIi1 - 2I1PQ(d1, d2)IQ E D(Q1, Q2)}]P In the following, we will use several results given previously. We use -,, to denote weak convergence. Theorem 5.1. For Mp defined above, (Mp, Wp) is a metric space. Proof: See Givens and Shortt [18]. u Theorem 5.2. If Q E Mp and Qn E Mp (n = 1,...), then Wp(Qn, Q) O0 as n + oo if and only if Qn — w Q and limn-,oo f IIIIPQn(dz) = fe IIZIIPQ(d)~ Proof See Rachev [27]. * Theorem 5.3. If h: - — + l is Lipschitzian on bounded subsets of R1, then, for all P, Q E Mp, If h(x)P(dx) - fh(x)Q(dx)l < {Mq(P) + Mq(Q)}Wp(P,Q), where 1 + 1 = 1, p > 1, and Mq(P) = reTIIX II~qp.~dl lhx)-h(x'")l}. {f Lh(||a||)P(d)}i, where Lh(r) = supx,,x,,IIx<r,<,1l<r 11x-x'11} Proof: See Romisch and Schultz [29].. We now construct the lower dimensional approximations to employ the results above and obtain convergence results. We make the following assumptions relating the distributions of P and the Q'. We I- = (H^TH')-lHVT be projection into the affine space spanned by H" and let HITV = {(lnCI( - 3i) = 77}. (i) The distribution P is continuous, P E Mp for some p > 1, and P(A) = f^EAp(S)d4. (ii) The distribution PV is constructed as in (5.2) such that (a) there exists a partition of 3' into Al U A2 U... U AK(V), where Ai n Aj = 0 for i 7 j; (b) each Ai is partitioned into disjoint subsets Al and A? such that sUp t^EAfl n-v(,)II( - (11p < e(v) for all i and 77, and fAZ II~IIPP(d() < 6(U) with e(V) -- 0 and 6(u) -- 0 as v -- oo; 10

(c) the weights pi = P(Ai) and Q^'(B) = f qi^(()d&7 where qi(77)dr = d JA.nn) P() The conditions in (i) and (ii) may appear difficult to satisfy in general. We, however, wish to consider distributions that can be constructed as linear transformations of independent random variables. This is always possible, for example, with multivariate normal distributions. Other multivariate distributions may be explicitly constructed this way. In this case, if S = Hr77 + H"'C where r7 and C are independently distributed and HI' spans the null space of H^, then q,^(r7)dr7 = SH+HV+EA p1()p2()ddC which is P1(r-)dr7 multiplied by the relative probability that HIr7 + Hv'C +,3i E Ai given?7. By choosing the Ai = {JI| = Hrl7 + H'C( + /,i for some r7 and IIHV'C -/311 < 7i}, the relative probability can be calculated easily. In fact, we can also just let the 3i result from the product of HI' and some random C as a generalization of the empirical measure. The next theorem shows that this distribution satisfies the conditions for weak convergence and convergence of the p moments. Theorem 5.4. If P and pv satisfy (i) and (ii), then Wp(P, P") < e(v) + 6(v). Proof: We will construct a distribution Q E D(P, P") that leads to the conclusion. For ( E R21, let.1 = Ill() and let ~2 = HI2(). We let AY' = {( E Ail[ - HlI^h'(r -,i) = 0} and Ai(r7) = ({-1 - /i = H'7rl} We then define K(v) Q((lI E A) = /, P(2)o. i=1 A^(n)=4;eA 2EAinn, (7) First, we wish to show that Q E V(P,P"). For this, we write p(~)d4 as p'(r, C)drdC (where p'(r, ) = I[HH']lp(Hg + H'C)), then K(v) QoI-l(A)= i /Q -i ()dr = P(A), (5.3) i=1 A((n7)EA Pi and K(v) Qoll (nA) = / J IA p(S)d = P(A). (5.4) i=l (Ai)EAi eAinn,"(n) 11

We next use the assumptions to obtain bounds on the Wasserstein metric. We have K(v) | n6i - 211^PQ(d1 2)= a /[/ 112- H^in (6 -0i PAW42 21 iil&I&~, dz=l A.(n)=(1;tEA 2AEAjnnI,-(n) K(v) <= E/ ( [+ (v)( ^i~ JA.ni)=f1;EA J2EAlnn) K +v) This completes the proof. ~ Using Theorem 5.4 and 5.3 plus the Lipschitz continuity of the stochastic programming recourse function (see Wang [30]), we obtain convergence of' values. With Theorems 5.4 and 5.2, we also see that weak convergence is established and hence convergence of subdifferentials as in [3, Theorem 3.1]. Moreover, depending on the properties of the distributions, we also obtain Lipschitz continuity of directional derivatives and hence convergence rates for gradient convergence as in Theorem 4.1 for truncation approximations. As mentioned above, the advantage of this approximation concerns the maintenance of differentiability and improved convergence over discrete approximations. If we, for example, suppose the strong convexity conditions used in Romisch and Schultz [29], then we can use the random generation of 3i in the null space of some matrix H such that tj is independent of C as above to obtain an expected Hausdorff distance between minimizers with P and P" that is O(n-ir) where K = Nullity(H) instead of I as with the discrete empirical measure. To realize the differentiability result, we suppose the simplest case of a single HI and generation of 3i in the null space of H". In this case, we have the following result. Theorem 5.5. If P" is generated as in (5.2) and satisfies (i) and (ii) where H" = H for all v, H E EIRxr, rankH = r, H/3i = 0 for all i, and no row of B 7H is a zero vector for any optimal basis in (1.2), Bj,j = 1,..., N, then'V(x) is differentiable for all v and x E int(domI). Proof: Suppose that *^"(x) is not differentiable for some x E int(domI). Then, there exists some i,j, k such that Q.(~lr Ei C) > 0 where C = {r\Bi 1(Tx - H7r - 3i) ~ 0, B'j-(Tx - H - Pi) > 0 for optimal bases Bk and Bj with k 7L j}. Since Q1 is a continuous distribution, this is only possible if the dimension 12

of this set is r. For Bk and Bj both to be optimal, for all q7 E C, there must exist some row t such that (Bj 1)t.(Tx - Hr - Pi) = 0. Since the dimension of C is r, dim(, I(B l ).(Hr) = (B )t.(Tx - i)) = r. (5.6) It follows from (5.6) that (B' l)t.H = 0, completing the proof. ~ The result in Theorem 5.5 gives a simple way to check whether differentiability is maintained. A limited basis approximation can also be applied in conjunction with this approach to guarantee differentiability for that approximation without requiring that all optimal bases be available. We can see further advantages of differentiability by examining the structure of the approximate objective function gradient, VI^. We can write this as: VI"(x) = E Pi %j(3i)rjT, (5.7) l<i<K(v) 1<j<N where acj(,i) = Q' (rl7Tj is optimal in the dual of (1.2) for w = Tx-H^r7-3i). We write the gradient in this manner to show that it is an expectation of aj values taken at different observations, pi. The true value is the expectation over all possible pi. This observation can allow the use of numerical integration ideas in the choice of the Pi. With differentiability established, we may also ask for higher order derivatives based now on the properties of the qV density functions. With appropriate distributions, we may then use the Peano kernel to bound our approximation and, in the event of a polynomial density, to guide the choice of 3i to establish rapid convergence. The lower dimensional approximations can also be combined with the separable function approaches as in Birge and Wets [6] and Birge and Wallace [5]. These approaches create an upper bounding function on L that is separable in its components. For example, suppose that 0 can be decomposed into 1(r7) and 4'2((). The separable function approach is to create an upper bounding function q such that q(77) is for example Ei q(r7i). Computing gradients of q then involves only finding the relative probabilities of different intervals of ri individually. In this way, we can quickly calculate aj(/3i) in (5.7). The goal in lower dimensional approximations may then be to find transformations of independent random variables r and C that yield ( and to choose 77 so that the recourse function is most nearly separable in its components. This ability requires some knowledge about the specific problem but may be possible in certain instances. Note that if the recourse function f is also separable in C then we should observe that only a single acj(Pi) needs to be calculated and then only updated with the relevant probability pi. Section 6. Combined Approximation 13

In the last two sections, we discussed approaches to approximate a continuous distribution by a simpler continuous distribution. In this section, we discuss the approach to approximate a continuous distribution by a combination of several simple distributions. The combined approximation is of the form P'(d)= E APi(d), (6.1) 1 <j<N where Al' > 0, E Alj = 1. For brevity of symbols, we use Aj and Pj instead of AXj and Pj though they actually depend upon v. By (1.3), (1.5) and (6.1), I(x) and VI(x) are now approximated by ^(x)= E Ij(x) (6.2) 1<j<NV and VI((x)= j V (j(x) (6.3) 1<j_<N where j(x) = J (Tx - )Pj(d), (6.4) and V^((x) = / V((Tx - )Pj(d). (6.5) Pj should be simple such that the right-hand-sides of (6.4) and (6.5) are easy to be calculated. Depending upon the choices of Pj, we have different variants of combined approximations: 1) Discrete Approximation: Let P3j =f 1, if =J, j(d) 0, otherwise, where (1,..., are points in 3e. Then we have a discrete approximation, which has been discussed intensively in the literature of stochastic programming. 2) Combined Box Approximation: Let P (dl) { J 1/(26^)l, if [- -J l e',, 1 0, elsewhere. Then Vj'(x) = | Vi (Tx - )d. Here, Vi'(Tx - 1) only assumes a small number of values of {711,..., lrN}, yet VIj is continuous. 14

3) Combined Normal Distribution Approximation Let Pj be a multi-variable normal distribution such that the expectations, variances and covariances of {41,..., E} are specified. Then we may use Theorem 3.1 to estimate V'Ij(x). Section 7. Examples In this section, we consider some specific examples and demonstrate how the basic procedures can be implemented. The two examples will be a simple two-dimensional problem to demonstrate the geometry and a slightly larger problem with three random variables that comes from a power planning problem (see Louveaux and Smeers [22]). The methods are chosen to be representative and to give some idea of the efficiency of the continuous approximations presented above. The first example has the following form: min 5yl +10ly21 +10|y31 =(w)= St yl +Y2.1-x (7.1) Y1 Y3 = 2 - X2, Y1 > 0, where-w = - Tx, T = (1, 1)T. In the computational tests below, we assume that xl = x2 = 0 and the random variables (i are independently uniformly distributed on [-0.5,1.5]. The second example is described in Louveaux and Smeers [22]. The first period variables x correspond to investments made into capacity available in the second period. We assume the second period demands in three different sectors are random. This example has the following form: min 40y1 +45Y2 +32y3 +55y4 +24ys +27y6 +19.2y7 +33y8 +4yg +2.5yo1 +3.2y11 +5.5Y12 s. t. Y1 +Y5 +Y9 < xl, Y2 +Y6 +Y10 < 2, (w) = Y3 +Y7 +Y11 < x3, (7.2) Y4 +Y8 +Y12 < X4, Y1 +Y2 +Y3 +Y4 > 1, 5 +Y6 +Y 7 +Ys > 2, Y9 +Y10 +Y11 +Y12 ~> 3, i 0, i 1,..., 12, where -w again represents the right-hand side coefficient vector. The xi correspond to input capacities while the (i random variables correspond to demands in the three operating modes. In the computational tests below, we assume that xzl = 2, x2 = 5, X3 = 5, X4 = 6, and the random variables (i are independently uniformly distributed on [3, 7] for i = 1, on [2, 6] for i = 2, and on [1, 5] for i = 3. 15

We will compare the following techniques: (i) Sampling distribution; (ii) Refinements of Jensen and Edmundson-Madansky Bounds; (iii) Boole-Bonferroni probability approximation; (iv) Box truncation; (v) Limited basis truncation; (vi) Lower-dimensional approximation. In each example, for an approximation represented by AV, we consider the gradient error, 11[VV - VWl. This gradient error was used as a metric because our main goal in this exercise is in evaluating gradient-based methods. (i) Sampling distribution: Here we sample the random vector. In our tests, since we had low dimensions, we used a low discrepancy quasi-random sequence to chose the samples. We followed the Hammersley sequence (see Fox [14], Deak [11]) which has been shown to produce small errors in low dimensions. 0.16 0.14 0.12 g 0.1 0.08I 0.06- \ 0.04- 0.02 -\ 0 500 1000 1500 2000 2500 Iterations Figure 1. Simulation errors for Example 7.1. The errors from using the quasi-random sample appear in Figure 1 for Example 7.1 and in Figure 2 for Example 7.2 as functions of the number of observations. Figure 2 also includes the error function from using a lower-dimensional sample described below. Note that the errors fluctuate, indicating some difficulty 16

0.8 0.5 ~0.57 " E — * — Lower Dim.,, 0.: 0.~4 1 \\ — C -- Full Dim.. 0.2 Error c- O.2' 0 500 1000 1500 2000 2500 Iterations Figure 2. Simulation errors using Full and Lower Dimension for Example 7.2. in providing coverage with a sampling procedure. The errors are approximately 1% for an error of 0.01 in Example 7.1 and 0.1 in Example 7.2. Note that this level of accuracy is achieved in about 1000 iterations for the Example 7.2 although the lower dimensional Example 7.1 requires more than 2000 iterations to achieve a similar percentage error. The difference is attributable to higher overall variance in the sample gradients in Example 7.1. (ii) Refinements of Jensen and Edmundson-Madansky Bounds: These bounds result from extremal measures in the space of measures satisfying the same first moment condition as the true distribution (see [6]). We consider the refinements suggested in Frauendorfer and Kall [16] to observe their convergence to the optimal objective value and their effectiveness in predicting the gradient values. These bounds will be taken as examples of bounds using discrete measures although refinements exists (see [8]). The gradient errors from the E-M and Jensen bound approximations appear in Figure 3 for Example 7.1 and Figure 4 for Example 7.2. Here, iterations refer to the number of partitions used for these approximations. The errors in function value in each case are reduced to less than 0.5% for each bound by 40 iterations, but gradient errors may still remain high. The main reason for this, in terms of the upper bound in particular, is that degenerate bases are chosen at extreme points. Their incorporation into the bound may always lead to some error. The lower bound estimate does not suffer this effect so dramatically (because indeed these estimates converge to true values, see [4]), but evaluations at nondifferentiable points still affect the bound. The overall observation here is that the bounding approximations are not indicated for gradient-based methods although they still appear quite useful in cutting plane methods (see, e.g., [1]). (iii) Probability approximation: For this approach, we use the Boole-Bonferroni method to approximate the 17

12T 10 [ -- *- Lower 8 Bound 6 * Error 6 -0 —- Upper 4 Bound Error 0 I 0 20 40 60 Iterations Figure 3. Bound errors for Example 7.1. 3 —2.5 — Lower ~ 2 Bound p 22 Error ~ 1.5 - -, dQ - Upper Bound C0 0.5 Error 0. — - I — I_ ---- 0 20 40 Iterations Figure 4. Bound errors for Example 7.2. probability of a basis's optimality. This approximation is not useful for the first problem, with only two variables, but Example 7.2 contains 7 constraints so that evaluating the probability of each basis involves evaluating the probability of a region corresponding to the intersection of 7 half-spaces. Some preprocessing was required to find each optimal bases for the range of possible demand outcomes given. To obtain these, we use the following extreme point generation approach. 18

Step 1. Solve 4'(Txa -) to obtain an optimal basis, BI (also optimal as long as (Bi)-'(-Tx) > 0). Let k = 1, r= l, i = l. Step 2. If (B1)'1(t - Tx) < 0 for any $ E -, perform a dual simplex pivot step to drop the variable which is basic in row r for Bi (maintaining dual feasibility). Let the new basis be B*. Otherwise, let r = r + 1, go to 4. Step 3. If B* E {B,..., Bk}, let r = r + 1, go to 4. Otherwise, go to 5. Step 4. If r < 1, then let i = i + 1. If i < k, let r = 1, go to 2. If i = k, STOP - all optimal bases have been identified. Step 5. If {( E _I(B*)X.l( +eie - Tx) > 0} # 0 for i = 1,...,1 and some e > 0, then let Bk+l = B*, k = k+ 1, r = r + 1, and go to 4. Otherwise, let r = r + 1, and go to 4. The use of variations of e in each component of (i is used to ensure that B* is feasible with some positive probability (assuming 2 has full dimension) so that zero probability bases are not included. The above procedure identifies all such optimal bases for the linear program in (1.2) with -w = - Tx by ensuring that any infeasible component leads an additional iteration with the branch only ending when all adjacent dual feasible bases are either infeasible or have already been identified. Under degeneracy, several bases B* may be adjacent to a given basis. We assume that a lexicographic ordering is used to avoid this difficulty. The result is then an unambiguous listing of the bases. Given identification of the bases, the Boole-Bonferroni approach calculates bounds using the formula in Proposition 2.1. In Example 7.2, the aj probabilities were calculated exactly using tj = 1 for five of of the six alternative optimal bases. For the other basis, the optimal tj = 0.75. The result for the five bases with tj = 1 is that it is sufficient to calculate aj only using aj and bj, or there is no probability that three regions of the form, 77ji > sji(x), intersect. It appears that performing calculations with combinations of up to three intersecting r7ji > sji(x) regions yields good results. In Example 7.2, all basis probabilities were calculated exactly using three (of a possible seven) terms of the inclusion-exclusion approach, i.e., j = 1 - aj + bJ - l<i<i,<i",< P(rji > s(), 7i' > s ji(), j77ji > s ji(x)). If tj = 1 was used for all bases, the gradient error was 0.66 or approximately 5%. For tj = 0.5, some basis probability estimates became negative. In general, it has been observed (see [23]) that the bound with tj = 1 (due to Dawson and Sankoff) is much sharper than with tj = 0, so that one might generally bias toward higher tj values and use tj = 1 in most instances. Given the difficulty in choosing tj universally, it appears best to consider the values of Zli<i7'<i< P(~ji >9 sji(x), bji' > sji(x), Tlji,' > Sjiu(x)) if this additional computation is not difficult and the distributions of 19

the 7ji are known (as in the normal case). It appears that relatively small numbers of intersecting regions need to be considered to obtain fairly accurate bounds on a basis's probability of optimality. This is again consistent with Prekopa's observations. The remaining difficulty in this probability approximation choice is the identification of optimal bases. This number may be too great for easy computation. For this reason, we consider methods for reducing the number of optimal bases considered. The next approaches do this explicitly. (iv) Box truncation: We use the box approximation method by considering boxes centered at the mean value of each random vector and considering progressively larger fractions of full range of each component of the random vector. The errors for Example 7.1 as a function of the fraction of range in each component covered appear in Figure 5. The corresponding errors for Example 7.2 appear in Figure 6. 2.5 2 1 — S 0.5 0 0.5 1 Fraction Covered Figure 5. Box truncation errors for Example 7.1. Note that the bounds do not improve until new bases beyond those immediately adjacent to the mean point are added. In each case, this improvement only begins when half of each component's region is considered. The error then improves almost linearly to zero. The advantage of box truncation would be in conjunction with a basis probability approximation. For smaller regions, fewer bases are required so that less computation is necessary. In Example 7.1, two bases are optimal for component fractions less than 0.5, but all five bases are optimal for fractions above 0.5. In Example 7.2, four bases are optimal until the fraction covered in each component exceeds 0.5. At this point, 20

0.6. 0.5 0.4 o 0.3 0.2 0.1 - O-I0 0 0.5 1 Fraction Covered Figure 6. Box truncation errors for Example 7.2. five bases are optimal until the fraction exceeds 0.75, when all six optimal bases are included. The relatively small range of these numbers of bases indicates that box truncation may not have significant advantages although in larger problems, the number of optimal bases may become significant. (v) Limited basis truncation: We considered the sets of bases identified near the mean values as in the box truncation step. For Example 7.1, using the two bases optimal at the mean 5 = (0.5,0.5)T yields the same error, 2.43, as in Figure 5 for fractions less than 0.5. The only other consistent choice would be to include all optimal bases with resulting zero error. In this cases, limited bases do not appear too useful. In Example 7.2, the error using the optimal basis found at the mean 5 is 4.89 or more than 40% while the error with the four neighboring bases is 1.09 or about 10%. Note that this error is higher than the box truncation error when these bases alone are optimal. This difference results from the lack of symmetry in including the complete region where each of these four bases is optimal. It appears from this observation that symmetry makes box truncation more effective in computing gradients while using limited number of bases than using the full probability of each basis's optimality. (vi) Lower-dimensional approximation: We suppose that the x7 vector from Section 5 is one-dimensional so that all computations of Q.(B) only involve single integrals. In both examples, 7.1 and 7.2, we used the last random component for 77 and used the quasi-random procedure used in sampling to determine the Ai areas (implicitly defined as equal probability regions around each point, 3i). For Example 7.1, the use of this lower dimensional computation led to rapid convergence. Errors were 0.78 after ten iterations, but became less than 0.001 after thirty iterations. This represents a considerable improvement over the results in Figure 1. Some additional calculation was necessary for the calculation of each Q'(B) but this integration simply involved a single parametric linear program solution with little 21

additional work (at most two pivot steps) over the linear program required for each function evaluation. For Example 7.2, the results of the lower dimensional approximation appear in Figure 2 as mentioned earlier. The results here are less conclusive. The lower dimensional approximation does have somewhat lower error after 600 iterations (by approximately 50% over iterations from 600 to 2500), but improvement is not as dramatic as in the smaller example. For larger problems, the modest improvements of the lower dimensional approach for a single gradient evaluation may still hold. One key to its effectiveness is in choosing the most critical component as the,r variable. The other advantage of the lower dimensional approximation is that it is possible to maintain differentiability and to achieve better algorithm performance based on this attribute. This aspect of the approximations is a subject for future study. Section 8. Conclusions This paper presented several results on using continuous distribution functions in approximations for stochastic programs. The main motivation is in providing differentiable approximate value functions that will lead to more stable computational implementations. The results are based on abilities to approximate probabilities accurately in higher dimensions and on using low dimension integration combined with discrete approximation to achieve differentiable but computable estimates. Some example results indicate that probability and gradient approximations can be accurate with separate low dimensional integrals. Future research will investigate these procedures in the context of optimization methods. REFERENCES [1] J. Birge, "Decomposition and partitioning methods for multi-stage stochastic linear programs," Operations Research 33 (1985) 989-1007. [2] J.R. Birge and L. Qi, "Computing block-angular Karmarkar projections with applications to stochastic programming," Management Science 34 (1988) 1472-1479. [3] J.R. Birge and L. Qi, "Semiregularity and generalized subdifferentials with applications for optimization", Mathematics of Operations Research, to appear. [4] J.R. Birge and L. Qi, "Subdifferentials in approximation for stochastic programs," SIAM Journal on Optimization, to appear. 15] J.R. Birge and S.W. Wallace, "PA separable piecewise linear upper bound for stochastic linear programs" SIAM Journal on Control and Optization 26 (1988) 725-739. 22

[6] J.R. Birge and R.J-B. Wets, "Designing approximation schemes for stochastic optimization problems, in particular, for stochastic programs with recourse," Mathematical Programming Study 27 (1986) 54-102. [7] J.R. Birge and R.J-B. Wets, "Comuting bounds for stochastic programming problems by means of a generalized moment problem," Mathematics of Operations Research 12 (1987) 149-162. [8] J.R. Birge and R.J-B. Wets, "Sublinear upper bounds for stochastic programs with recourse," Mathematical Programming 43 (1989) 131-149. [9] F.H. Clarke, Optimization and Nonsmooth Analysis, Wiley, New York, 1983. [10] I. Deak, "Three-digit accurate multiple normal probabilities," Numerische Mathematik 35 (1980) 369-380. [11] I. Deak, Random Number Generators and Simulation, Akademiai Kiado, Budapest, 1990. [12] Y. Ermoliev, "Stochastic quasigradient methods and their application to system optimization" Stochastics 9 (1983) 1-36. [13] Y. Ermoliev and R. Wets, Numerical Techniques in Stochastic Programming, SpringerVerlag, Berlin, 1988. [14] B. L. Fox, "Implementation and relative efficiency of quasirandom sequence generators," ACM Trans. Math. Software 12 (1986), 362-376. [15] K. Frauendorfer, "Stochastic Two-Stage Programming", Habilitationsschrift, University of Zurich (1992). [16] K. Frauendorfer and P. Kall, "A solution method for SLP recourse problems with arbitrary multivariate distributions - The independent case", Probability, Control and Information Theory 17 (1988) 177-205. [17] H. I. Gassmann, "Conditional probability and conditional expectation of a random vector", in Y. Ermoliev, R. Wets, eds., Numerical Techniques for Stochastic Optimization, Springer-Verlag, Berlin, 1988, pp. 237-254. [18] C.R. Givens and R.M. Shortt, "A class of Wasserstein metrics for probability distributions," Michigan Mathematical Journal 31 (1984) 231-240. [19] J.L. Higle and S. Sen, "Stochastic decomposition: An algorithm for two-stage stochastic linear programs with recourse," Mathematics of Operations Research 16 (1991) 650-669. [20] J.B. Hiriart-Urruty, J.J. Stroidt and V.H. Nguyen, "Generalized Hessian matrix and second-order optimality conditions for problems with C11 data", Applied Mathematics and Optimization 11 (1984) 43-56. 23

[21] P. Kall, "Stochastic programming - An introduction" Tutorial talk at the Sixth International Conference on Stochastic programming, Udine, Italy, September, 1992. [22] F. V. Louveaux and Y. Smeers, "Optimal investments for electricity generation: A stochastic model and a test-problem", in Y. Ermoliev, R. Wets, eds., Numerical Techniques for Stochastic Optimization, Springer-Verlag, Berlin, 1988. [23] A. Prekopa, "Boole-Bonferroni inequalities and linear programming," Operations Research 36 (1988) 145-162. [24] A. Pr6kopa and R. J-B. Wets, Stochastic Programming 84, Mathematical Programming Study 27 & 28 (1986). [25] L. Qi, "LC' functions and LC1 optimization problems", AM 91/21, Applied Mathematics Preprint, University of New South Wales, Australia, June, 1991. [26] L. Qi, "Superlinearly convergent quasi-Newton methods for LC1 optimization problems", AM 91/35, Applied Mathematics Preprint, University of New South Wales, Australia, September, 1991. [27] S. T. Rachev, "The Monge-Kantorovich mass transference problem and its stochastic applications," Theory of Probability and its Applications 29 (1984) 647-676. [28] W. Romisch and R. Schultz, "Distribution sensitivity in stochastic programming," Mathematical Programming 50 (1991) 197-226. [29] W. Romisch and R. Schultz, "Stability analysis for stochastic programs," Annals of Operations Research 31 (1991) 241-266. [30] J. Wang, "Lipschitz continuity of objective functions in stochastic programs with fixed recourse and its applications," Mathematical Programming Study 27 (1986) 145-152. [31] R. J-B. Wets, "Stochastic programming: Solution techniques and approximation schemes," in: A. Bachem, M. Grotschel and B. Korte, eds., Mathematical Programming: The State of the Art - Bonn 1982, Springer-Verlag, Berlin, 1983, pp. 566-603. 24

UNIVERSITY ~ MICHGlAN 3 5 04733