Extension of Some Results for Channel Capacity Using A Generalized Information Measure Aharon Ben-Tal*' & Marc Tebouile** Department fIndustrial & Operations Engineering University of Michigan Ann Arbor, MI 48109 Technical Report 86-38 September 1986 *Technion, Israel Institute of Technology and Department of Industrial and Operations Engineering, The University of Michigan, Ann Arbor, MI 48109, U.S.A. **Department of Mathematics, Statistics and Computing Science, Dalhousie University, Halifax, Nova Scotia, Canada B3H 3J5 'Research supported by NSF Grant ECS-8604354

EXTENSION OF SOME RESULTS FOR CHANNEL CAPACITY USING A GENERALIZED INFORMATION MEASURE Aharon Ben-Tal*' & Marc Teboulle** September 1986 *Technion, Israel Institute of Technology and Department of Industrial and Operations Engineering, The University of Michigan, Ann Arbor, MI 48109, U.S.A. **Department of Mathematics, Statistics and Computing Science, Dalhousie University, Halifax, Nova Scotia, Canada B3H 3J5 'Research supported by NSF Grant ECS-8604354

Abstract A new formulation for the channel capacity problem is derived by using the duality theory of convex programming. The simple nature of this dual representation is suitable for computational purposes, The results are derived in a unified way by formulating the channel capacity problem as a special case of a general class of concave programming problems involving a generalized information measure recently introduced by Burbea and Rao [10]. Key words: Information Theory, Channel capacity, convex programming duality, generalized information measure.

- 3 - 1. INTRODUCTION Channel capacity, a basic concept in information theory, was introduced by Shannon [27] to specify the maximum rate at which information can be reliably conveyed by the channel. Roughly speaking, the basic theorem of information theory, the so-called "noisy channel coding theorem", states that if a given noisy channel has capacity C, it is possible to transmit over this channel messages of sufficiently large length and still be able to decode them with an arbitrary small probability of error, provided the rate of transmission is less than C Methods for computing the capacity C of a discrete channel have been studied by Muroga [21], Cheng [12], Takano [28]. The best known algorithm however is the one introduced independently by Arimoto [2] and Blahut [7]. A somewhat similar iterative procedure based on the method of quasi-concave programming was proposed by Meister and Oettli [20]. In these previously mentioned works, the computational schemes are derived using the primal formulation of the channel capacity problem. In this paper the classical channel capacity problem is embedded in a family (P%) of linearly constrained concave programming problems, each member of which is determined by a choice of a convex function; the classical case corresponding to 4(t) = t log t. The objective function in (P%) is the generalized average mutual information measure, recently introduced by Burbea and Rao [10], and the optimal value of (Pt) is our generalized channel capacity C. A duality theory is developed for (P.) resulting in a dual representation of C. As a special case, a new formulation of the classical channel capacity is obtained. The dual of (Pt) (denoted (Dq)) is a specially structured unconstrained minimax problem, and thus rendering

- 4 - itself to efficient computational methods. The dual formulation is also very useful to obtain upper bounds for C. The paper is organized as follows. In section 2 the formulation of the classical channel capacity problem is given and the iterative method of Arimoto [2], Blahut [7] is briefly reviewed. In section 3 we formulate the generalized capacity C, and develop the theory leading to two different dual representations, the first (Theorem 3.1) is suitable for computations and the second (Corollary 3.1) is useful to derive upper bounds. The bounds are given in section 4, it is also shown there that for symmetric channels, the bound is attained, i.e. an explicit formula for C is obtained. Section 5 contains concluding remarks and a brief discussion on possible extensions. 2. THE CHANNEL CAPACITY PROBLEM Consider a communication channel described by an input alphabet A = {1,2,...,m}, an output alphabet B = {1,2,...,n} and by a probability transition matrix Q = {Qkj}, where Qkj is the probability of receiving the output letter k E B when input letter j E A was n transmitted, i.e. Z Qki = 1 for all j E A and Qkj > 0 for all k=l k E B, j E A. The capacity of the channel is defined as m n Q C:= max I(p,Q):=max Z z pjQkjlog - (2.1) PPm PePm j=l k=l J JPQ:=1 where m = 1} Pm p E Rm m pj > 0 V j E A pj 1 (2.2) <m~~~ ~j=1

-5 - is the set of all discrete finite probability measures on the channel input, and I(p,Q) is known as the average mutual information between the channel input and channel output, considered here as a function of p. The utility of the concept of capacity is widely discussed in the literature and for more details the reader is referred to Shannon [26], Gallager [15], Jelineck [16] and to the more recent book of Csiszar and Korner [14]. For a given probability transition matrix Q, it is shown in Gallager [15] that I(,Q) is a concave function of p and therefore problem (2.1) is a concave programming problem over the simplex Pm ' then any of a number of readily available nonlinear programming codes can be used to compute C. However as reported by Blahut [7], computational experience with nonlinear programming codes applied to problem (2.1) have proved to be inefficient even for small alphabets sizes and to be impractical for the larger alphabet sizes. This motivates, independently Arimoto [2] and Blahut [7] to develop a systematic iterative method for computing the capacity. This was done by exploiting the special structure of the objective function I(-,Q). More specifically, let P = (Pjk) denote a transition matrix from the channel output alphabet to the channel input alphabet, then m n P I(P,Q) = max J(pP;Q): k PjQk log p (2.3) PeTP j=1 k=l kj where m T (PRm kll kl nP O all j and k T:= {PE FMxR ~ Z P = 1 all kl n,P. >0 all j and k} This can be verified by noting that the maximized J is given by

- 6 - p = Qkj jk m QkkPP The Arimoto-Blahut algorithm can be summarized as follows: (0) Choose an initial probability vector p(O) Pm At iteration (i) Compute (ii) Update ('iii) Iterate r, where p r) is given, p(r) = argmax J(p(r)P;Q) PET p(r+l) = argmax J(p,p(r);Q) PEPm r r + 1 (2.4) The solution of (2.4) is explicitly given by p(r+l) () H(r)i r) cj(p(r)) E pjcj(p) (2.5) where for any p E P m c(p) exp Q log PQkj cj(p) = exp E Qkj log pZQ k x k2,qk That the method (2.5) converges, i.e. lim I(pr),Q) = C, see Arimoto r-M [2] and Blahut [7]. The amount of Computation involved depends upon the size of the channel matrix. Some conditions under which the amount of computation can be reduced are discussed in Cheng [12] and Takano [28]. In this paper we suggest a dual formulation to the channel capacity problem. The simple nature of this dual problem opens the possibility

- 7 - to apply many recent numerical schemes available in the Mathematical programming literature, in particular for large alphabet size (see e.g. [27]). This will be discussed in the next sections. 3. THE GENERALIZED CAPACITY PROBLEM AND ITS DUALS In this section we derive in a unified way a dual representation for the channel capacity problem. Recall that the channel capacity C is given as the optimal value of the following optimization problem: Qk. sup E Z PjQkj log (3.1) pEPm k k m pm E PA9Qk m We denote by qk the output probabilities, then qk: PQkj n j qk 0 for all k=l,...,n, Z qk=, i.e. with our notation k=l q E Pn In the decision variables (pj,qk), problem (3.1) can be written equivalently as n (P) sup sup Z Q lg Q - log Q o q (3.2) Pm qEPn j k k m s.t. Z PjQk qk V k=1,...,n.(3.2) j=1 The objective function in (3.2) is concave in (p,q), therefore problem (P) is a linearly constrained concave program. Also note that the feasible set of (P) is a compact convex polyhedron in m+n, hence the sup in (P) is actually attained. The special structure of the objective function formulated in (3.2) (linear in p minus strictly convex in q)

-8 - motivates us to consider the following general class of concave programming problems m m n i(P') max max Z p Z ((Qkj - ((qk) (3.4) PEPm qEPn j=1 J k=l k=l m s.t. pjQkj =qk k= 1,...,n = JJ Throughout the rest of this paper, we assume that 4 is a given twice continuously differentiable strictly convex function defined on an interval containing (0,1], normalized such that 4(0) = 4(1) 4'(1) < X, and satisfying the additional assumption lim+ ~'(t) = -I t-O Note that the latter assumption holds if 4 is essentially smooth in [0,+o) (see e.g. Rockafellar [24]). We denote the class of such 4 by +, accordingly C, will denote the optimal value of problem (P,). An important example of functions 4 ~ is provided by the family ( of functions (parametrized by a): 1 - (t-ta) for 0 < a < qba(t) = t log t for a = 1 Clearly with 1l(t) = t log t, problem (P,) is just the classical channel capacity problem (P). The objective function used in (P,) is exactly the generalized average mutual information measure introduced and studied by Burbea and Ro [10]. A related generalized measure of information was also recently introduced by Ben-Tal and Teboulle [6] and the associated rate

- 9 - distortion function was studied. For additional generalizations and applications of generalized information measures the reader is referred to Aczel [1], Arimoto [3], Burbea [8], [9], Burbea and Rao [11], Csiszar [11], Rao and Yayak [22], Renyi [23], and Ziv and Zakai [30]. The dual representation of (PO) will be derived via Langrangian duality. Before stating the main result of this section we introduce the following notations and definitions. For any p E ~, let f: R"n -R be defined by: n f(v) = inf{n + Z 0*(v n) (3.5) nd1R k=1 k f where ~*(y) = sup{ty- (t)} denotes the usual convex conjugate of. t Also let Z: Rn -+R be defined by: t(v) = max {Q.(v) +b} (3.6) 1 <j <m m where {i.} are the linear functions given by aj(v) = - Z v Q m kkj m and bj denote the constants b. = E (Qkj) k=l Theorem 3.1. The dual problem of (P ), for A E $, is given by (D) infn {f(v) + a(v)}. (3.7) vER If (P ) is feasible the minimum in (D0) is attained and the optimal values coincide: C = max(P ) = min(D )

- 10 - Proof. The Lagrangian for problem (Pq) is n m n n L(p,q,v) =- Z ((q k) + pQj E q - + PVQi k=l j=1 k k=l k=l j,k kj and is separable in the decision variables (p,q). The dual objective function is then: m n n h(v) = max 7 p ( Z q(Qkj)-vkQkj) + max E qkvk -q(qk) (3.8) PEPm j=1 J k=l ' qEPn k=1 and the dual problem is defined as minn h(v). The first "max" in vdR (3.8) is easily computed: ( n max E p (Q ) - ) max (Q - vkQ (3.9) p Jm j k kj j lj<m k= kl i k J To evaluate the second "max" in (3.8) we note that a Langrangian dual of n a = max ~ qkvk - 4(qk) (3.10) qEPn k=l is given by minE n + E max {qk(vk-n) - (q k)}} * (3.11) nrER k qk_ O where n is the Lagrange multiplier for the equation Zqk = 1 Problem (3.10) is a linearly constrained concave program satisfying trivially the Slater constraint qualification and hence by standard duality arguments [23] we have a =. To compute 8, we consider the problem e(y) = sup{ty-(t)} t>0 By the monotonocity of q', it follows that

- 11 - tdR e(y) = - (0) if y < q'(O) But since we assumed that &'(O) = -o, we conclude in fact that e(y) = ~*(y). Using this result in (3.11) we get n: = min n + Z q*(vk-n). (3.12) ndR l k=l Substituting (3.9) and (3.12) in (3.8) and using the notations (3.5), (3.6) we thus obtain h(v) = f(v) + z(v) and hence the dual representation (3.7) is proved. Further, for Q E D, q is convex and then n + E q*(vk-n) is jointly convex in (n,v) E R x Rn, hence (by k k Rockafellar [25] Theorem 1) f(v) is convex. Finally since (P ) is assumed feasible and it is linearly constrained the Slater regularity condition holds trivially then it follows from standard duality results (see e.g. [24]) that the inf (Dj) is attained and max(Pj) = min(D%). The next result shows how to obtain the optimal solution of the primal problem (PQ) from an optimal solution of the dual. Theorem 3.2. Let v be the optimal solution of (DQ). Then the optimal solution (p,q) of the primal problem (PQ) is computed as follows: qk = (') 1( -), k=l,...,n (3.13) where n is the unique solution of the equation:

- 12 - z (')l(vk- n) = 1 (3.14) k and p is the optimal solution of the linear program (L ) max z p. (~ (E(Qkj)) j J k j s.t. PjQkj qk k = 1,...,n j kj =9k Z p. = 1, p. O, j = l,...,m jJ J Proof: The expressions (3.13) for qk is just the optimality conditions for qk = qk to solve the inner maximization in (3.11) (recall that the optimal qk cannot be zero as explained in the proof of Theorem 3.1). The optimality condition for n = n to solve the convex unconstrained problem (3.12) is: (q*)'(vk-n) = 1. (3.15) k But ((*)' = (,')- 1 (see e.g. [24] section 26) and thus (3.15) is exactly (3.14). Now since 0 E ~: q'(0) = -oo and p'(1) < ~ implying (1' )-l(-_) = 0 and (p')-(oo) > 1. Thus equation (3.14) has a solution n = n(v) for all v, which is also unique since (' (and hence (,)-) is strictly monotone. The statement concerning p follows immediately from (3.4). 0 The dual problem (3.7) is an unconstrained discrete minimax problem. For such problems, many algorithms have been proposed in the non smooth

- 13 - optimization literature, see e.g. Wolfe [29], Lemarechal [18], [19] and more recently Kiwiel [17]. Alternatively, the dual problem (3.7) can be reformulated as a linearly constrainted convex program in n+l R+. Indeed, by defining vn+l: max (v)+b}, problem l<j _n (3.7) is equivalent to min. f(v) + Vl (3.16) vEIRnl n+l s.t. aj(v) + bj - vn+l < 0 for all j = 1,...,m in which case many nonlinear programming codes are readily available to solve the dual formulation (3.16) (see e.g. Shittowsky [26] or the recent generalized reduced gradient code of Lasdon). The special structure of the dual problem minn {f(v) + max {Qj(v)+bj}} vER 1<j_<m suggests a method in which f is approximated at the r-th iteration by a polyhedral function (i.e. pointwise maximum of finitely many affine r+l functions) Wr(v), and the next iteration point v is the optimal solution of min{7r(v) +max{j.(v) +b}}. (3.17) V j J Since (3.17) is a linear minimax problem it can be solved efficiently with simplex like algorithms (e.g. [4], [5]). The new polyhedral approximation nr+l(v) is obtained by

- 14 - r+ (V) = max{7rr(v),s r+l (v)} r+l where sr+ (-) is the affine support of f at vr Let us derive the dual representation of the classical channel capacity C. This is done simply by substituting p(t) = t log t in problem (D%). The conjugate is q*(t*) = et* and so f(v) = min{n + z e nER~ k by simple calculus. Then, using Theorem 3.1 a dual C = minn log Z e + max { vEnRn k=l1 1 _ sjm k -1 n vk = log Z e k=l representation of C is given by Qkj log Q kj- Qkj vkj} kk (3.18) In the last part of this section we present an alternative dual problem to (P ) which is given directly in terms of the problem's data. Corollary 3.1. Let 4 E (, then C = min max { 0'(Yk)(Yk-Qkj) + -(Qkj) - O(Yk)} yEPn l<jan k (3.19) Proof: From the proof of theorem 3.2 it follows that f(v) defined in (3.5) is given by: f(v) = n(v) + E(*(vk-n(v)) k (3.20) where -n(v) is the unique solution of the equation

- 15 -. (*) '(vk-n) = 1 k k Define new variables Yk = (*) (vk-n(v)). Then k y= 1 and k Yk 0 since ({*)'(-0) = 0 and (4*)' is increasing, hence y E Pn From the definition of Yk and using ()*)' = (4,')1 vk-n(v) = q'(Yk) (3.21) Substituting (3.21) in (3.7) and using (3.20) we obtain C = n(v) + E(*('(yk)) + max Z q(Qkj) Qkj(n(v) +'(Yk)) 1 <jm k But Z Q = 1, hence k C =: *(p'(yk)) + max Z +(Qkj) - Qkj'(k) (3.22) k l<j<m k Finally using in (3.22) a simple fact concerning conjugate functions: *(+'(t)) = tW'(t) - C(t) we get the derived result (3.19). a Applying Corollary 3.1 to the classical case q(t) = t log t a little algebra show that a dual representation of C is C = min max E Qk log k YEPn lj m k Yk and we recover here a result given in Meister and Oettli [20]. 0

- 16 - 4. UPPER BOUND FOR C The dual representation derived in corollary 3.2 is also useful to derive upper bound on C Theorem 4.1. For any c E 4, C < Z {0'( 3 m) Z-E -( O j)}+ max Z (Qk) (4.1) k j j j Idj-m l _j mk j Proof: From the dual representation (3.19) n + C _< max E 7 q'(yk)(Yk-Qkj) + O(Qkj) - O(Yk) 1 <j:km k=l (4.2) for every particular (4.2), and n y satisfying Z yk = 1 and yk 2 0, k=l,...,n, in k=l m zQ for Yk = m —. Substituting this special choice in rearrange term, we obtain the upper bound in the theorem. Example 4.1: Let us consider the family $ with functions qa given by 4 (t) = 1 1 (t-tt) t log t 0 < a < 1 a = 1 We denote respectively by C, and U, the corresponding capacity and its upper bound. Using Theorem 4.1 we get

- 17 - C <U Ca a with I z (E Qk )a m+ ( Qkj U = a log m + max 1 <jm 1 +- 1 max 1-c lj_ m Z Q k log k ar Q (Z Qk )(-l J Qki if 0<a<1 m-l1 k kj j k j Qkj Qk if a = 1. In particular we see is recovered i.e. that the classical lower bound derived in Arimoto [2] C E C1 U1 An interesting special case for which the upper bound is attained is in the case of symmetric channel i.e. those with the same set of entries in columns and rows with possible permutations. In that case we have m Z Qkj = const. = 6 for all k. j=1 Theorem 4.2. If the channel is symetric, then for any 4 E, C is equal to the upper bound, i.e. C = -n (-) + m Z t(Qkj) n m j,k (4.3) Proof: From the primal formulation of C, (see (3.4)) a lower bound is given by m n C > - E O(qk) + E pij E (Qkj) k k j= k=l (4.4)'

- 18 - for any (p,q) Pm x Pn satisfying: PQkj = q ' k = 1,...,n. (4.5) J Since the channel is symetric E Qk = 6 for all k and this implies J jI~ 1 that m = n6. Thus pj =- and q =- satisfy (4.5). Substituting k n.:k n (p*,q*) in (4.4) we obtain the lower bound C > -n() +1 m (Q) (4.6) From Theorem 4.1, using the fact m = n6 we have the upper bound: C < -n(-) +max (Q ) (4.7) n j k kj Since the channel is symetric, it has the same set of entries in each column, thus Z 1(Qkj) = Yj= const. for all j and hence, k J max 7 q(Q )(Q ) (4.8) l<jsn k k m jk kj Therefore substituting (4.8) in (4.7), we see that the upper bound for C. coincides with its lower bound given in (4.6). Example 4.2. Consider the Binary Symetric Channel (BSC) defined by 1-B B l Q =. Using Theorem 4.2 we obtain. For any q E, CBSC = -2(-) + q(1-s) + (S) 2

- 20 - In particular for ~(t) = t log t we get the well-known result (see e.g. [15], [16]): CBS = log 2 + (l-B)log(l-3) + 8 log. 5. CONCLUSIONS AND EXTENSION A new formulation of the channel capacity problem has been obtained by using the duality theory of convex programming. This new dual representation seems useful for computational purposes and derivation of bounds. Furthermore the results in the paper demonstrate that the new information measure of Burbea and Rao [10] can be used successfully to develop a generalized channel capacity theory. Finally we remark that our duality framework can be easily extended to the multiple constrained channel capacity problem (i.e. which include additional linear inequality constraints on the input probability p, see [5]) to produce simple dual formulation. Also, at the price of some additional technicalities, the continuous alphabet channel problem [7], [15] can be considered via a duality theory for infinite dimensional optimization problems framework to obtain the continuous version of Theorem 3.1 and its corollary.

- 21 - References [1] Aczel, J. (1984). "Measuring information beyond communication theory - Why some generalized information measures may be useful, others not", Aequationes Mathematicae, 27, pp. 1-19. [2] Arimoto, S. (1972). "An algorithm for computing the capacity of arbitrary discrete memoryless channels", IEEE Trans. on Information Theory, vol. IT 18, pp. 14-20. [3] Arimoto, S. (1977). "Information measures and capacity of order a for discrete memoryless channels", in Topics in Information Theory Edited by I. Csiszar and P. Elias, North-Holland Publishing Company, Amsterdam, pp. 41-52. [4] Armstrong, R.D. and Godfrey, J.R. (1979). "Two linear programming algorithms for the discrete X norm problem", Mathematics of Computation 33, pp. 289-300. [5] Barrodale, I. and Roberts, F.D.K. (1973). "An improved algorithm for discrete 2,-linear approximation", SIAM J. Numer. Anal., 10, pp; 839-848. [6] Ben-Tal, A. and Teboulle, M. (1985). "Rate distortion theory with generalized information measures via convex programming duality", to appear in IEEE Transactions of Inform. Theory. [7] Blahut, R.E. (1972). "Computation of channel capacity and rate distortion functions", IEEE Trans. on Information Theory, Vol. IT 28, pp. 489-495. [8] Burbea, J. (1983). "J-Divergence and related topics". Encycl. Statist. Sci., 4, pp. 290-296, Wiley, New York. [9] Burbea, J. (1984). "The Bose-Einstein Entropy of degree a and its Jensen difference", Utilitas Mathematica, 25, pp. 225-240. [10] Burbea, J. and Rao, C.R. (1982). "On the convexity of some divergence measures based on entropy functions", IEEE Trans. on Information Theory, Vol. IT28, pp. 489-495. [11] Burbea, J. and Rao, C.R. (1982). "Entropy differential metric, distance and divergence measures in probability spaces - a unified approach", J. of Multivariate Analysis, 12, pp. 575-596. [12] Cheng, M.C. (1979). "On the computation of capacity of a discrete memoryless channel", Inform. Control. 24, pp. 292-298. [13] Csiszar, I. (1967). "Information-Type Measures of Difference of Probability Distributions and Indirect Observations", Studia Sci. Mat. Hungar.,2, pp. 299-318.

- 22 - [14] Csiszar, I. and Korner, J. (1981). "Information theory: coding theorems for discrete memoryless systems", New York, Academic Press. [15] Gallager, R.G. (1968). "Information theory and reliable communication, New York, Wiley. [16] Jelinek, F. (1968). "Probalistic information theory", New York, McGraw-Hill. [17] Kiwiel, K.C. (1985). "Methods of descent for nondifferentiable Optimization", Lecture Notes in Math., No. 1133, Springer Verlag, Berlin. [18] Lemarechal, C. (1975). "An extension of Davidon methods to nondifferentiable problems", Math. Progr. Study, 3, pp. 95-109. [19] Lemarechal, C. (1980). "Nondifferentiable optimization" in Dixon, Spedicato and Szego (eds.) Nonlinear Optimization Theory and Algorithms, Birkhauser, Boston. [20] Meister, B. and Oettli, W. (1967). "On the capacity of a discrete constant channel", Inform. Control 11, pp. 341-351. [21] Muroga, S. (1953). "On the capacity of a discrete channel", J. Phys. Soc., Jpaan 8, pp. 484-494. [22] Rao, C.R. and Nayak, T.K. (1985). "Cross entropy, dissimilarity measures, and characterizations of quadratic entropy", IEEE Trans. on Information Theory, 31, pp. 589-593. [23] Renyi, A. (1961). "On Measures of Entropy and Information", in Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, Vol. 1, pp. 547-561. [24] Rockafellar, R.T. (1970). Convex Analysis, Princeton University Press, Princeton, N.J. [25] Rockafellar, R.T. (1974). Conjugate Duality and Optimization, Regional Conference Series in Applied Mathematics, No. 16, SIAM. [26] Shannon, C.E. (1948). "A mathematical theory of communication", Bell Syst. Tech. Journal, Vol. 27, pp. 379-423, pp. 623-656. [27] Schittowski, K. (1980). "Nonlinear programming codes", Lecture Notes in Economics and Mathematical Systems, Vol. 183, Springer Verlag, Berlin. [28] Takano, S. (1975). "On a method of calculating the capacity of a discrete memoryless channel", Inform. Control 29, pp. 327-336. [29] Wolfe, P. (1975). "A method of conjugate subgradients for minimizing nondifferentiable convex functions functions", Math. Program. Study 3, pp. 145-173.

- 23 - [30] Ziv., J., Zakai, M. (1973). "On functionals satisfying a dataprocessing theorem", IEEE Trans on Information Theor, IT19, pp. 275-283.