Division of Research Graduate School of Business Administration The University of Michigan May 1976 EXPLORATORY ANALYSIS OF MULTIVARIATE DATA FOR BUSINESS RESEARCH Working Paper No. 130 by Roger L. Wright and W. Allen Spivey The University of Michigan I. (i) 1976 by The University of Michigan FOR DISCUSSION PURPOSES ONLY None of this material is to be quoted or reproduced without the express permission of the Division of Research.

TABLE OF CONTENTS Introduction................ 1 Multiple Regression Analysis......... 4 Graphical Analysis of Residuals..... 12 Selecting Explanatory Variables...... 20 Transforming Variables...........28 Hypothesis Testing in Data Analysis.....47 Appendices A. Absenteeism Data Base.........57 B. Absenteeism Data Base Documentation. 59 ii

Introduction In an enormous variety of business research problems one is naturally interested in analyzing relationships between two or more variables. If data on such variables are available, one has a multivariate data set, and statistical analysis provides many tools for analyzing such data. Perhaps the most important analytic tool is multiple regression analysis, but it is often difficult to comprehend the full power of multiple regression in a conventional discussion of the subject. One is usually presented with a situation in which a regression problem has already been formulated —the dependent and explanatory variables having been selected —and one merely uses the techniques that are introduced to perform the necessary final calculations. Such simplifications obscure the important fact that much of statistical analysis is a multistage process of trial and error, that a good deal of exploratory work must be done to select appropriate variables for study and to determine relationships between them, and that a variety of statistical tests and other procedures must be performed and sound judgments made before one arrives at a satisfactory choice of dependent and explanatory variables.

2 In these notes we place much emphasis on using multiple regression analysis in conjunction with graphical techniques, on methods of selecting variables and constructing new variables by means of transformations, and on the use of statistical tests as a guide in model building. And although each of these methods makes a unique contribution to the analysis, the combination of them, used in an interactive mode, provides a powerful means of exploring, analyzing, and summarizing useful relationships in the data. Access to a computer is, of course, essential for effective application of these concepts. If appropriate programs are readily available, the computer can execute the necessary graphical and statistical analyses quickly and inexpensively. With its storage devices the computer can provide convenient access to the large data sets often encountered in business research and can also retain new variables constructed in the course of an investigation. Moreover, the computer relieves the researcher of the need to involve himself personally in the internal algebraic complexities of the methods used, so that he can concentrate on the applied aspects of his investigation. How can one learn to use these methods effectively? As in the use of many other statistical

3 concepts, one develops an understanding of important technical aspects of each method and one practices and gains experience by applying these methods in realistic and challenging situations, using data of the kind encountered on the job and, of course, using appropriate computer programs. Experience has shown that each of these methods has some characteristics which the user must keep in mind and a multitude of additional properties which, although interesting to the statistician, may not be particularly useful to the business researcher. In studying these methods it will therefore be appropriate for us to focus our attention on their especially useful characteristics and,whenever possible,to minimize discussion of mathematically interesting but inessential details. Several of the sections that follow describe the basic methods of statistical data analysis that will be especially useful to us: multiple regression, graphical analysis of residuals or errors, methods of selecting variables, transformation of variables, and testing of hypotheses. Our discussion, although concise, comprehends many important features of these concepts and we utilize throughout an illustrative example which is typical of a variety of problems encountered in business research.

4 Multiple Regression Analysis Nearly everyone interested in business research has seen an application of multiple regression analysis and knows some of its properties. Nevertheless, a brief review in the context of a problem may be useful. Suppose we are interested in studying absenteeism among employees of the ABX Company. The data in Table l(a) provide three characteristics of a sample of 77 production employees in the company. The first column shows the number of occasions of absenteeism during 1975 for each of these employees. We will regard the dependent variable in this analysis to be absenteeism, as seems natural, and denote it as Y. Job complexity, denoted as as X1, and employee seniority, denoted as X2, are regarded as explanatory variables. X1 is an index ranging from 0 to 100 and measures the complexity of the activities making up the job; X2 is the number of complete years that the employee has been with the company (see Table l(b)). The regression equation relating absenteeism of employees to level of job complexity and seniority for the data in Table l(a) is found by an appropriate computer program to be

5 Table 1(a). Absenteeism, Job Complexity, and Seniority of 77 Employees X CC (N 21 7 10 2 7 321 24 42 4-) Q x (n N >1 CD N >1 Z 4 g 04 Z U) 0 g 6 4 M E P 1 4 5F 0 47 5 0 4 3 025 15 0 ) Q4 k -H z 0 o) 0 H CO U C a) U I a) U 1 E O O d O ) ) O a ) r (U b - u3 ( rt m u S ( a u U) IQ r it) 1 0 45 3 26 3 89 18 51 0 8 3 2 1 76 10 27 3 21 2 52 1 45 2 3 0 56 9 28 0 34 4 53 3 43 5 4 2 76 7 29 2 12 6 54 6 23 1 5 0 70 14 30 3 70 2 55 3 1 7 6 1 69 9 31 1 69 11 56 2 82 1 7 1 56 3 3 2 4 13 1 57 2 1 1 8 1 56 1 33 2 30 13 58 4 1 1 9 2 43 9 34 1 43 1 59 3 70 4 10 1 76 1 35 3 8 2 60 0 76 6 11 3 30 1 36 2 69 2 61 0 82 7 12 2 50 9 37 4 30 1 62 1 50 9 13 1 10 1 38 4 23 1 63 1 70 8 14 3 69 4 39 4 16 1 64 1 81 5 15 2 674 3 40 3 11 1 65 2 70 9 16 0 69 4 41 2 16 1 66 3 1 2 17 4 70 8 42 6 50 2 67 2 8 1 18 7 13 1 43 3 50 2 68 2 23 2 19 3 16 3 44 1 69 4 69 2 21 12 20 2 52 5 45 2 10 2 70 2 82 7 21 2 52 16 46 1 43 26 71 1 67 28 22 4 3 2 47 1 12 1 72 0 81 18 23 2 6 4 48 3 76 5 73 1 43 6 24 0 67 6 49 2 56 2 74 4 6 3 25 3 10 1 50 0 6 8 75 3 13 8 76 2 52 7 77 3 52 1 SOURCE: Computer simulation by one of the authors.

6 Table l(b). Description of Variables for Which Data Are Shown in Table 1(a) Variable Name Description Absenteeism Job Complexity Seniority The number of distinct occasions that the worker was absent during 1975. Each occasion consists of one or more consecutive days of absence. An index ranging from zero to one hundred, measured according to procedures developed by Turner and Lawrence.* Number of complete years with the company on December 31, 1975. *Arthur N. Turner and Paul R. Lawrence, Industrial Jobs and the Worker (Boston: Harvard University Press, 1965).

7 (1) Y = 3.07 -.015X1 -.063X2. This equation can be interpreted as providing an estimate of mean absenteeism for a given level of job complexity and seniority. Moreover, if seniority is held fixed, the equation shows that mean absenteeism tends to fall by.015 for each unit increase in job complexity. Also, if job complexity is held fixed, it shows that-mean absenteeism tends to fall by 0.63 for each unit increase in seniority. It is clear that such information provides a useful summary of the data. How does the computer determine a regression equation of the form (1)? First, let us consider a more general equation of which (1) is a special case, (2) Y = b0 + blXl + b2X2. ih The i of the n observations (n = 77 in the example) can be denoted as Yi, Xil, Xi2, i = l,...,n. The residual e. of the ith observation is defined as (3a) ei = Yi - Y (3b) = Y - (b0 + blXil + b2Xi2). Table 2 illustrates residuals determined by equation (1).

8 Table 2. Construction of Selected Residuals from the Regression Equation Y = 3.07 -.015X1 -.063X2 Case 1 x2 Y e 1 45 3 0 2.21 -2.21 2 76 10 1 1.30 -0.30 3 56 9 0 1.66 -1.66 4 76 7 2 1.49 0.51 5 70 14 0 1.14 -1.14 6 69 9 1 1.47 -0.47 7 56 3 1 2.04 -1.04 8 56 1 1 2.17 -1.17 9 43 9 2 1.86 0.14 10 76 1 1 1.87 -0.87 76 52 7 2 1.85 0.15 77 52 3 1 2.23 -1.23

9 The regression coefficients b0, b1, and b2 are chosen so as to minimize the sum of squares of these residuals, denoted SSE, n 2 n (4) SSE = e (Yi - 2 i=l i=l The residuals (3a) and the sum of their squares (4) are extremely important entities in regression analysis. They are the basis for the well-known least squares criterion and in this context they are considered to be functions of the regression coefficients b0, b1, and b2. When the values of these coefficients have been determined from the data, both the residuals e. and SSE become fixed numbers and can be used to supplement the regression equation in summarizing the data. Various steps will be introduced below to make both the residuals and their sums of squares more meaningful for this purpose. The sum of squared residuals is the basis of two important statistical measures of the discrepancy between the observed values of Y and the corresponding values of Y determined by the fitted regression equation: the standard error of the regression equation and the multiple correlation coefficient.

10 The standard error of the linear regression equation, which we denote by s, is the square root of the sum of squared residuals adjusted appropriately (or divided) by the number of degrees of freedom. The latter is calculated by subtracting the number of regression coefficients in (1) from n, the total number of observations. We thus write SSE ( Y n i2 (5) s = n-3 n-3 n-3 and s equals 1.36 in the absenteeism example. Under appropriate conditions, about 95 percent of the residuals will be smaller than 2s in absolute value, which means that for the corresponding observations Y will be within 2s of Y. The multiple correlation coefficient R can be thought of as the square root of the "fraction' of the variation in Y (from its mean) explained by the linear regression equation." It is obtained by comparing the sum of squared residuals SSE to another quantity, the total sum of squares, SST. The total sum of squares is the sum of the squared deviations of Y about the mean Y, n (6) SST = (y. - Y)2 i==l 1

11 The multiple correlation coefficient can be defined by the equation (7) R = iSST - SSE SST When there is only one explanatory variable in the regression analysis, we call this simple regression and the corresponding co3:relation coefficient is called a simple correlation. In this case the correlation coefficient is given the sign of the regression coefficient of the explanatory variable. It is useful to observe that the multiple correlation coefficient of Y with X1 and X2 is equal to the simple correlation coefficient of Y with Y. These ideas can be extended easily to the more general case of regression analysis involving p variables, X1,X2,...,Xp 1, and Y. The regression equation then has the form (8) Y = b0 + blX + + bp-Xp. The residuals are defined by (3a) and form the basis for the least square criterion as well as for definitions of the standard error of the regression equation and the multiple correlation coefficient. The only modification

12 required in equations (4) through (7) is appropriate adjustment of the number of degrees of freedom in the definition of s. Because there are p coefficients in the linear regression equation (8), the number of degrees of freedom is n - p so that (9) s= SSE s n-p Graphical Analysis of Residuals Graphical examination of the data provides an effective means of assuring that the regression equation and its standard error and correlation coefficient summarize the data adequately. Without the use of graphs one runs a risk of being seriously misled by a linear regression analysis. Indeed, graphical examination of the relationship between the data and the regression equation should be regarded as essential and integral to regression analysis. In the case of simple regression involving two variables X and Y, the scatterplot is the central tool. In addition to plotting the sample data points (X,Y), it is helpful to graph the regression line Y = a + bX as well as the error bands determined by the two lines Y = a + bX + 2s as is shown in Figure l(a), where the linear regression seems appropriate for the data.

;oTdfl44tOS aAflPtIsfiTtt snowflA '(9)T Ce)T t6TLT I e) (p) x x - r a - a - * -4 S r r '4 - e a S A IA (0) (ci) x x a U - -4 --.4- - -4 a 0 4 - -4-.4.4- 6 -4. -. -44.4 -S A Qe) x -p -,V. * 44.. p4. * / I / / 0 -p. 6 -p - / 0 A

14 Recall that in linear regression we assume that for any given value of X, the observed errors are a sample from a normally distributed (conditional) population with mean or expected value 0 and with variance 2 a, which does not vary with the given value of X. This implies that the error band Y + 2s should contain approximately 95 percent of all the data points and that for any given X approximately 95 percent of the corresponding points are also within the band. With this background in mind, it is easy to see how the situations illustrated in Figures l(b) - l(e), which are of the kind commonly found in practice, render the linear regression equation or its standard error misleading. Figure 1(b) shows a situation in which it appears that the variables X and Y have a nonlinear relationship. Moreover, the mean of the errors does not appear to be zero for this sample; the mean (of the conditional distribution) appears to be positive for X values at the extremes of the data and the mean appears to be negative in the middle of the data. Figure l(c) indicates the existence of outliers and these suggest that the population is not normal for the corresponding values of X. In Figure l(d) the scatterplot suggests that the distribution of errors is skewed, and Figure 1(e), it seems reasonable to believe, shows that the

15 variances of Y for larger values of X are not equal to those for smaller values of X. Any of these situations calls for caution in interpreting the regression analysis. In serious cases appropriate corrective steps (some of which will be discussed below) should be taken. In the case of multiple regression it is more difficult to obtain effective graphical displays of the data. When three or more variables are involved in the analysis, several two-dimensional scatterplots are required to depict adequately the relationship between the data and the regression equation. One useful scatterplot is obtained by plotting the predicted values Y and the actual values Y, i.e., the points (Y,Y) as in Figure 2.* One should include on the scatterplot A \ a graph of the line Y = Y and the error bands Y = Y + 2s. Any point on the line corresponds to a situation in which Y = Y; the vertical distance from the line to any point represents an error e = Y - Y. In addition to plots of Y against Y, it is important to examine scatterplots describing the *On examination, Figure 2 indicates a possible problem in using linear regression with the absenteeism data. The errors are not a sample from a normal distribution. This occurs because the dependent variable Y is constrained to be integer valued. This situation, not uncommon in practice, causes little difficulty in the present example.

16 7 - X & a I-L.0 5 - I J- W o1o* v* 3 A I -.^r^ a PA *,# * 0 " 'O f l 410 /r _* I~~r,, +,~~/~~~~~~~~~~~~.O Ni-t / 3 Y 0 J Fig, 2. Scatterplot of (Y,Y) Using Regression Equation (1) and the Data of Table 1(a)

17 relationship between the residuals e and each of the explanatory variables X. as illustrated in Figure 3. 3 Here the points (Xj,e) are plotted as well as the line e = 0 and the band e = + 2s. Sometimes these scatterplots are difficult to interpret and one can be assisted by a simple but infrequently used device called a component scatterplot. For any explanatory variable X. included in the multiple regression equation, let Y. denote b.X.; then the 3 jj regression can be rewritten as (10) Y = b + Y1 + + Y - and the actual values of the dependent variable can be expressed with an error term as (11) Y = b0 + Y1 +. + Y 1 + e. The variables Y1,...,Ypl are called the systematic components of the dependent variable Y. The component scatterplot associated with an explanatory variable X. is then obtained by plotting the points (Xj,Yj + e), and it is also helpful to graph the line Y. = b.X. and the error band Y. = b.X. + 2s. These are illustrated in Figure 4 for the explanatory variable X1.

18 CSt 3, C Ae!h = s - - - - - 1 4 0 v lb a 0.0 / p.& C C 9 t C C & 0 0 / ) 1 - - - - -, - - - - - - -- - - -.-.- 1 - - - Tj -Tp I 10 S * 210 30 to-o so 701. BO v 0~~~~~~~~~~~~U U C F~~~~~~~~~~~~~~~~ 4 9JW0 Y ~ox, (er=o).O140 a.6 a a 0 S 9I C 9 - - - - - AfN IM -31 YF~q. 3. Scatterp1ot of tbhe Begrepslon E~zror e wtth. Job Coxnpjhexity, X1

y 4 3 r _ 2 f * 0 1 - lb = ~e ~.. & * 0 * * O O 0 ~~~ _. *..a. -2 - -3 - 4 4 - l I 10 20 30 40 50 - 60 '- 70.. 80 1 22~0 0 40 50- - - 70 80 9 -- 0 90 100 1 Fig. 4. Component Scatterplot of Y, Absenteeism With Job Complexity, X1, fo. the Regression Equation (1)

20 In a component scatterplot we are assuming that b0 = 0 and that Y. = 0 for i A j in (11); this enables 0 1 us to examine the relationship between the dependent variable Y = Y. + e and X. in temporary isolation. One can then use these scatterplots to detect problems such as those discussed in relation to Figure 1(b) through l(e). For example, a component scatterplot which has an appearance like that of Figure 1(b) may indicate a nonlinear association between-the variable Y and the explanatory variable Xj, and a component scatterplot like that of Figure l(e) would suggest that the error variances for various values of X. are not equal. When the existence of such problems is detected, it may be possible to correct for them by means of appropriate transformations of variables, as discussed below, or by using weighted regression. It is important to remember that none of the statistics provided by the usual tabular regression output are indicators for the presence of these kinds of problems and that graphical examination of the sort discussed above is essential. Selecting Explanatory Variables In developing a regression analysis of crosssection data, the dependent variable is often not

21 difficult to choose. In many cases it is identified early in the study and usually remains fixed throughout the analysis. Although in a complex investigation there may be several dependent variables of interest, each is usually analyzed separately and serves as the center of attention in its own part of the project. The problem of selecting explanatory variables is more difficult and typically occupies the researcher throughout most of his project. In the formulation stage of the typical project, the researcher builds up his personal understanding of the problem under study by reviewing relevant literature, discussing the problem with experienced people, extending his own direct observation of it, and assimilating underlying theory. Even at this early stage he is seeking factors that may help explain important features of the problem. Out of this work he chooses the dependent variable and a set of candidates for explanatory variables. An important part of this process is the investigation and development of appropriate measurement techniques. At the conclusion of this stage a sample of individual entities is often selected and a measurement of the dependent variable and each of the candidates for explanatory variables is obtained for each individual in the sample. These data are then transcribed and stored in the

22 computer files which comprise the data base for the study. Appendix A presents a small data base of this type for the absenteeism study in which data on other candidate explanatory variables besides job complexity and seniority are contained. Working with such data, the researcher uses statistical methods, especially regression analysis, to investigate relationships between the dependent variable and the candidate explanatory variables. He may begin by inspecting a table of simple correlation coefficients which measure the association between each pair of the variables (Table 3). He may then wish to obtain and evaluate a large number of multiple regression equations and scatterplots involving the dependent variable and various subsets of the candidate explanatory variables. In most cases it will be impractical to examine all possible multiple regression equations, so his selection must utilize his understanding of the problem at hand in addition to various statistical aids. Typically, one of the researcher's goals is to select a regression equation which yields a high multiple correlation coefficient but which utilizes only a few, carefully chosen, and well-understood explanatory variables. He wants a high multiple correlation coefficient, because this measures the association between the dependent variable Y and the corresponding variable

23 Table 3. Correlation Matrix for Absenteeism Study D > >0 >d En 4- >1 r- -P ) rX d L4 U N a) OH 14dd O 0 (n 0 0 4-)n tn.q OU P) -4 *H 0) c OH (1) < U 4-4 n 04 a)L aN ) < IQ r! (1) Q Absenteeism Job Complexity Base Pay Foreman Satisfaction Seniority Age 1.00 -.36 -.23 -.19 -.34 -.31 -.05 -.36 1.00.50 -.25.37.28 -.08 -.23.50 1.00 -.02.49.33.06 -.19 -.25 -.02 1.00 -.01.20.16 -.34.37.49 -.01 1.00.75.15 -.31.28.33.20.75 1.00.15 -.05 -.08.06.16.15.15 1.00 Number of Dependents

24 Y determined by the regression (or between the dependent variable and the explanatory variables jointly). Roughly speaking, the higher the multiple regression coefficient, the more useful is the regression equation. On the other hand, the researcher needs to limit the number of explanatory variables chosen from among the candidates, because including too many variables in the regression equation complicates interpretation and application of the regression equation, reduces its statistical reliability, and, of course, increases the cost of data collection and manipulation. A useful guide in selecting explanatory variables from the candidates is their partial correlation coefficients with the dependent variable. To define this statistic, suppose that the regression equation relating Y to the explanatory variables Xl,X2,...,Xp_1 has a multiple correlation coefficient R1 and that a second regression equation relating Y to these same variables together with the additional variable X has a multiple correlation coefficient R2. The partial correlation coefficient of X with Y, adjusted for the variables P Xl,X2,...,Xp_1 is defined to be +

25 with the sign taken to be the same as the sign of b in the second regression equation. In selecting variables the absolute value of the partial correlation coefficient is used —the greater is the partial correlation coefficient in absolute value, the greater is the increase in the multiple correlation coefficient obtained by appending Xp to the explanatory variables already included in the regression equation. Partial correlation coefficients can also provide insights into the effect of removing one of the explanatory variables already in a multiple regression equation. Most regression computer programs include as part of their output the partial correlation coefficient of each individual variable with the dependent variable, adjusted for all other explanatory variables included in the regression equation. The variable whose removal from the regression equation will cause the smallest decrease in the multiple regression coefficient is the one having the smallest (absolute) partial correlation coefficient. There are several computer programs for automatically selecting explanatory variables from among specified candidates using partial correlation coefficients calculated from the data base of the study. Three that are commonly used are forward selection,

26 stepwise regression, and backward elimination. The forward selection program proceeds in an iterative fashion by first selecting a single explanatory variable from those specified by the researcher, then appending a second explanatory variable, and so on. The first variable that is selected is the candidate having the greatest (absolute) simple correlation coefficient with the dependent variable. In each subsequent step the forward selection program chooses for inclusion the candidate variable having the greatest (absolute) partial correlation coefficient with the dependent variable, adjusted for the other explanatory variables already selected. Selection continues as long as a variable can be found having a sufficiently large partial correlation coefficient. Stepwise regression proceeds in a manner similar to that of the forward selection procedure but with one important difference. At each step beyond the first, after the new variable is appended, the program re- examines all explanatory variables currently in the regression equation to determine if any can be removed without unduly decreasing the value of the multiple regression coefficient. This is accomplished by evaluating the partial'correlation coefficient of each included explanatory variable with the dependent variable,

27 adjusted for all other explanatory variables included in the equation. The explanatory variable with the smallest (absolute) partial correlation coefficient is removed, provided its partial correlation coefficient is sufficiently close to zero. Consequently, the final result of stepwise regression is less dependent upon the early steps of the process than is the case with the forward selection program. The other principal selection process is backward elimination. This procedure begins with a multiple regression equation which includes all of the candidate explanatory variables and then removes explanatory variables one at a time using the same criterion as that applied in the deletion stage in stepwise regression. Although there seems to be wide agreement that stepwise regression is preferred over forward selection, there is no general rule to determine whether stepwise regression or backward elimination is preferable. Many researchers often try both, compare the results, and engage in further experimentation when the differ. There is general agreement that no automatic selection procedure should be used uncritically. None of these procedures will always arrive at the best selection from among candidate variables in terms of the highest possible multiple correlation coefficient

28 for a fixed number of candidate variables. More importantly, they may fail to identify the regression equation most consistent with the researcher's understanding of the problem. But, used with care, scepticism, and willingness to experiment further, they can be effective tools. In any case they are no replacement for the graphical methods of examination described in the preceeding section, which should be used in conjunction with them. Transforming Variables One way of greatly extending the capabilities of linear regression analysis is to make use of nonlinear transformations of variables. A strategy for choosing such transformations is evident from a simple example. Consider the data in the scatterplot of Figure 5; no straight line given by the equation (12) Y = b' + blX can summarize these data adequately. The scatterplot suggests that as X increases, Y tends to increase but at a diminishing rate. Among the many equations which

29 Y 0* o *. 0 * 0 0 0 *a Fig, 5. A Scatterplot of an Apparently Nonlinear Relationship

30 summarize data having this property, some of the simplest are of the form (13) Y = b0 + bl(l/X). How can one choose b0 and bl in (13)? The approach developed earlier for handling equation (12) can be applied to (13) with very little modification. It is natural to define the sum of squared errors of (13) to be n SSE = (Yi - bO - b/Xi) i=l 0 and to require that b0 and bl minimize SSE as before. It might appear that a new computational procedure is needed to find the values of b0 and bl, but it turns out that this is not so. All that is necessary is to proceed with a new variable X* given by (14) X* = 1/X. If SSE is rewritten in terms of X* as n SSE= (Y b - b *)2 i=l 1 0 1 i it is clear that the desired values of b0 and b1 can be

31 found by obtaining the linear regression equation relating Y to X*, Y = b0 + blX*. One can therefore fit a nonlinear equation to the data following the least squares criterion by first transforming the variable X to the new variable X* using (14) and then developing the ordinary linear regression equation relating Y to X*. Operationally, the actual values of X* can be either computed and stored as values of a new variable in the data base or they can be temporarily computed when needed by the regression program. In the first case, the researcher utilizes a separate program for computing the transformed variables required and then uses a standard multiple regression program with these new variables. In the second case, the researcher uses a single program incorporating transformations and regression analysis; he specifies the variables to be included in the regression equation and any preliminary transformation of these variables that is required. In either mode of operation, many transformations in addition to the reciprocal transformation (14) can be employed. Three simple classes of transformations are commonly used in business research:

32 (a) reciprocals: X* = l/X, 2 3 (b) powers: X1 = X, X2 = X2, X* = X3 etc. 1 2 X (c) logarithms: X* = log(X). Figure 6 shows some of the equations that can be fitted by combining linear regression with these transformations. The reciprocal transformation has already been discussed. A polynomial Y = bo + blX + b2X +...+ b p-1X of any degree p-l can be fitted by the use of power transformations together with multiple regression, although polynomials of degree two or three suffice in many cases. The logarithm transformation is often useful in dealing with a variable whose values are all greater than zero. These transformations can also be applied to several different explanatory variables. For example, one could compute a regression equation of the form 2 (15) Y = bo + b1 (l/) + b2X2 + b3X2 The main difficulty is that graphs cannot be so easily prepared in these multivariate cases. Despite this difficulty, appropriate graphs are extremely -

Equation Alternate Form _ _ _ _ _ Graphs (1) Y = b0 + bl(l/X) (X A 0; usually X > 0) 2 (2) Y = b0 + blX + b2X2 (3) Y = b0 + bllog(X) (X > 0) (4) Y = a (al) (a0 > 0, al > 0) b (5) Y = aX (X > 0, a > 0) None ----— b 0 -- - -- b /. b < 0 None None / _ \ I b > 0 b > 0 bl < 0 I/ bl > 0 log Y = b0 + blX (bj = log aj, j = 1,2) log Y = b0 + bllogX (b0 = loga) al > 0 al > 0 a2 > 1 a2 < 1 bl > 0 a > 0 0 < b < 1 a > 0 b < 0 a > 0 Fig. 6. Some Nonlinear Equations Obtainable Through Trans formations

34 important in visualizing these multivariate, nonlinear regression equations and their relationships to data. Most of the techniques described earlier can be used with appropriate modifications, and component scatterplots are especially useful. For example, the component scatterplot associated with X2 in (15) can be obtained by graphing the equation ^A~~ 2 Y2= b2X + b3X2 as well as the points (X2,Y2 + e) where, as usual, the residual e is determined by Y - Y. Some of the equations shown in Figure 6 involve transformations of the dependent variables as well as of the explanatory variables. For example, the equation Y = a0 (a1) can be reformulated by taking natural logarithms of both sides of the equation to get log(Y) = b0 + blX, where bj = log(aj) for j = 0,1. Thus b0 and b1 can be computed by finding the ordinary regression equation

35 relating the transformed dependent variable log(Y) to X. Then a0 and a1 in the original equation are found by means of the antilogarithms of b0 and b1. Care must be taken when a transformation of the dependent variable is utilized. In interpreting such an analysis it is best to return to the original equation rather than using the one containing the transformed dependent variable. In particular, the correlation coefficient associated with the transformed equation can be misleading, and it is better to compute directly the correlation coefficient of Y with Y. It should also be recognized that the least squares criterion itself is altered by a nonlinear transformation of the dependent variable. One should carefully inspect the residuals of the transformed equation to determine the suitability of the regression analysis, as discussed previously in conjunction with Figure 1. A particularly useful type of variable is an indicator variable, sometimes called a dummy variable. Any variable having exactly two values, zero and one, is called an indicator variable. Such a variable is used to record the presence or absence of a particular characteristic or condition of each observation. A simple example is the use of an indicator variable in the answer to a yes/no question, in which the integer one indicates

36 yes and zero indicates no. It is frequently useful in regression analysis to create an indicator variable by means of a transformation of some variable already included in the data base. For example, an indicator variable S1 can be established from the variable Foreman Satisfaction, denoted S and indicative of a worker's satisfaction with his foreman, which is included in the absenteeism data base shown in Appendix A. The Foreman Satisfaction variable takes on values according to the following coding: 1 = very dissatisfied, 2 = somewhat dissatisfied, 3 = neither satisfied nor dissatisfied, 4 = fairly well satisfied, 5 = very satisfied. An indicator variable S1 can be developed which indicates whether or not an employee is very dissatisfied with his foreman. S1 is made to take on the value one whenever S takes on the value one —the employee is very dissatisfied —and S1 is given the value zero for all other values of S. Table 4 shows the variables S and S. An indicator variable can, of course, be included among the explanatory variables in a multiple regression

37 Table 4. Illustration of Indicator Variables Based on the Explanatory Variable S, Foreman Satisfaction Values of Value of Indicator Variables Foreman S Case Satisfaction S1 S 2 S3 4 S5 1 4 0 0 0 1 0 2 4 0 0 0 1 0 3 1 1 0 0 0 0 4 3 0 0 1 0 0 5 3 0 0 1 0 0 6 3 0 0 1 0 0 7 4 0 0 0 1 0 8 4 0 0 0 1 0 9 1 0 0 0 0 10 3 0 0 1 0 0 76 1 1 0 0 0 0 77 3 0 0 1 0 0 SOURCE: Appendix A

38 equation. Standard regression computer programs can handle an indicator variable in the same manner as any other kind of explanatory variable. For example, in using the absenteeism data one can find the regression equation relating absenteeism (A) to job complexity (C), seniority (SE), and the indicator variable S1 introduced above, which indicates whether the employee is very dissatisfied with his foreman. A = 3.06 -.015C -.064SE +.220S1. Interpretation of the coefficient of the indicator variable is simple to explain. If we recall that S1 takes on the values zero and one, and note that the coefficient of S1 is.220, then whenever S1 = 1, A from the equation above is larger by.220 than it is when S1 = 0. Thus we can say that employees having the same level of job complexity and seniority and who are very dissatisfied with their foreman have an absenteeism that is.220 higher, on the average, than workers who are not very dissatisfied. It should be clear that in general the regression coefficient of an indicator variable represents the increment to Y associated with the characteristic or condition that the indicator variable represents. It is often helpful to use several indicator

39 variables to represent several mutually exclusive conditions or characteristics. For example, the variable foreman satisfaction, S, records the five mutually exclusive conditions: very dissatisfied, somewhat dissatisfied, etc. The indicator variable S2 in Table 4 indicates whether or not an employee is somewhat dissatisfied; S2 takes on the value one if the employee indicates that he is somewhat dissatisfied and the value zero for any other condition. The variables S3, S4, and S5 are defined analogously. Note that only four indicator variables need be used, because the fifth condition, very satisfied, is implied whenever the four other indicator variables are each equal to zero. When the first four indicator variables are included with job complexity and seniority as explanatory variables for absenteeism, we obtain the regression equation (16) A = 3. 048 -.017C -.042SE +.175S + 1. 063S -. 181S -.462S Assistance in interpreting this regression equation can be provided by calculating the values of A associated with the several foreman satisfaction conditions while job complexity and seniority are held fixed. Table 5 shows that the regression coefficients of the indicator

40 Table 5. Values of A from Equation (16) with C = 40 and SE = 5 Foreman Satisfaction, S Conditional Mean Absenteeism, A 1, Very Dissatisfied 2, Somewhat Dissatisfied 3, Neither Satisfied nor Dissatisfied 4, Fairly Well Satisfied 5, Very Satisfied 2.333 = 2.158 +.175 3.221 = 2.158 + 1.063 1.977 = 2.158 -.181 1.696 = 2. 158 -.462 2. 158

41 variables S1, S2, S3, and S4 represent the increments to A associated with the corresponding levels of foreman satisfaction, relative to the value of A associated with the fifth level of foreman satisfaction. Some appreciation of the advantages of using indicator variables can be gained by comparing the regression equation (16) to one relating absenteeism to job complexity, seniority, and foreman satisfaction, S, itself, (17) A = 4.435 -.019C -.055SE -.415S. In (17) we see that A decreases by.415 whenever S increases by one. This suggests that the average absenteeism of workers who are somewhat dissatisfied is.415 less than that of workers who are very dissatisfied, and that a similar comparison of workers who are very satisfied with those who are fairly well satisfied leads to the same difference in average absenteeism. Although this situation may not be realistic, it is required by the direct use of the variable S in the regression equation (17). The use of indicator variables in equation (16), on the other hand, allows a more flexible representation of the changes in average absenteeism with increasing foreman satisfaction.

42 In selecting indicator variables for a multiple regression equation, care must be taken to avoid a situation called singularity or unidentifiability. If one attempts to find the regression equation which relates absenteeism to the seven explanatory variables C, SE, S1, S2, S3, S4, and S5 by using a typical regression computer program, then one will get none of the usual output but only a cryptic comment such as "matrix singular" or "equation unidentified." The problem is that no unique regression equation is determined by the least squares criterion and there are in fact infinitely many different equations which fit the data equally well. This situation occurs whenever one of the explanatory variables can be written as a linear function of the other explanatory variables, and it is the case in our example, because we have S= 1 S - S2 - S3 - S4 The difficulty can be eliminated by excluding the redundant variable S5 from the equation. In general, whenever a set of indicator variables is used to represent a set of mutually exclusive conditions or circumstances like degrees of foreman satisfaction, one of the logically possible indicator variables should be excluded from

43 the regression equation. An indicator variable can also be used as a dependent variable in a regression equation. Suppose it is company policy to review the performance of any employee having three or more occasions of absenteeism. Let A* denote an indicator variable that is one if absenteeism is three or larger and zero otherwise. The regression equation relating A* to job complexity and seniority can be found in the usual way, A*=.645 -.041C -.019SE. As usual, A* is interpreted as the conditional mean of the dependent variable A*, but in this context A* takes on only the values 0 and 1, so its mean is equal to the conditional probability that A* takes on the value 1. Therefore A* is interpreted as the conditional probability that an employee has three or more occasions of absenteeism, given his level of job complexity and seniority. In general, when the dependent variable of a regression equation is an indicator variable, then the regression equation is regarded as giving the conditional probability that the indicator variable is 1, given the levels of the explanatory variables. Special care must be exercised in using an

44 indicator variable as a dependent variable. Because the indicator variable takes on only the values zero and one, the errors of the regression equation cannot be a sample from a normal distribution. Moreover, it can be shown that the variance of the errors is not constant. The usual assumptions of linear regression analysis are therefore not satisfied. However, if the sample size is sufficiently large (greater than 30 for many applications), the nonnormality of the errors causes little difficulty. If, in addition, A* is within the interval.2 to.8 for most of the observations, then the variance will be approximately constant and ordinary regression analysis is usually satisfactory. If either of these two conditions is violated, then the methods of logit or probit analysis can be utilized. One additional method of expanding the types of equation that can be fitted to data using regression analysis should be discussed, namely, the use of interaction variables. An interaction variable is simply the product of two candidate explanatory variables. The role of an interaction variable can be seen by comparing the following two regression equations, (18) A = 3.07-.015C -.063SE

45 and (19) A = 3.41 -.023C -.159SE +.002C*SE. Equation (18) is said to be additive because A is expressed as the sum of two components, one depending on job complexity and the other depending on seniority. Equation (19) is said to be nonadditive because of the interaction variable C*SE. Table 6 shows illustrative values of A calculated from both the additive equation (18) and the nonadditive equation (19). In the additive equation the change in A as one moves from SE equal to 1 to SE equal to 5 is -.25, regardless of the value of C; this is illustrated for three values of C in Table 6(a). However, under the nonadditive equation the corresponding changes in A vary with the level of C (Table 6(b)). In general, the nonadditive equation defines a much more complex relationship between the dependent variable and the explanatory variables. Because of the complexity of nonadditive equations, it is much more difficult to obtain useful graphs of the data and its relationship to the regression equation. All of the methods discussed in this section enhance the flexibility of regression analysis and usually require little added effort. It is very important to keep all these techniques in mind when analyzing data.

46 6. Values of A for Given Values of C and SE (a) As Determined by Equation.(18) Table SE 9 5 1 2.20 1.90 1.60 2.46 2.16 1.86 2.71 2.41 2.11 I I 20 40 - C 60 (b) As Determined by Equation (19) 9 SE 5 1 1.88 1.78 1.68 2.36 2.10 1.84 2.83 2.41 1.99.,,,..,,, _.,. 20 40 - C 60

47 However, if skillfully applied, ordinary multiple regression analysis using candidate variables directly will often produce satisfactory results. Hypothesis Testing in Data Analysis We introduce the principal problem addressed in this section by an example based on the absenteeism data. Consider the following two regression equations, previously shown as equations (16) and (1), respectively: (20) A = 3. 048 -.017C -.042SE +.175S + 1.063S -.181S -.462S and (21) A = 3.07 -.015C -.063SE. A question of obvious practical importance arises: Is the first equation better, in some sense, than the second equation? One way to approach this question is to compare the multiple correlation coefficients of the two regression equations, say R for (20) and R' for (21). If R is much larger than R', then the first of the equations would probably be preferred. In practice, however, it is often difficult to decide whether the difference between R and R' is sufficiently large to enable the researcher

48 to choose between the equations. In this case, for example, R =.5522 and R' =.4214 and it is not clear whether the increase of R for (20) over R' for (21) warrants the inclusion of the four additional variables. Before continuing our discussion it will be helpful to replace equations (20) and (21) with the following more general equations: (22) Y=b +blX +...+ b lX +b X, +...+b X 0 11 p'-l p'-1 p p p-l p-1 and (23) Y = b' + b'X +..+ b',_pXp-l Here (22) is a regression equation involving p variables, and (23) is a regression equation involving p' variables. We call (22) the full equation and (23) the reduced version of (22). It is assumed that p > p' and that all of the explanatory variables of (23) are included in (22). For convenience we also assume that the first p'-l explanatory variables included in (22) are the explanatory variables of (23). Thus (22) includes all the explanatory variables of (23) together with p - p' additional explanatory variables. Both of these regression equations are assumed to have been computed from data comprised of n observations of the variables. The multiple correlation

49 coefficients corresponding to (22) and (23)-will be denoted as R and R' and the sum of the squared residuals of these two equations as SSE and SSE', respectively. In order to make further progress, it is necessary to give a careful statement of the experimental situation assumed to underlie the data. We assume that there is some "true" relationship or model between all the variables in the full equation (22) of the form (24) Y = %0+lX1+...+%1Xp l l+p1 Xp+ + ***+pXpl+E The B0,.. o Bp1 are unknown numbers called parameters and s is a random variable which is normally distributed with mean zero and standard deviation a. This implies that we view our data, which consist of n observations of the variables, as having been generated by an underlying process which advanced through the following steps: (a) Either the researcher or a chance or deterministic mechanism selected the n values of the explanatory variables and these values are observed by the researcher. (b) The chance mechanism selected a random sample of n values of s, drawn independently from a normal distribution having mean zero and standard deviation a. The researcher does not observe this sample. (c) Then n values of Y were determined from (24) using true values of 0,...,p. These n values of Y O~"'p-V'

50 are observed by the researcher but the true values of j. are unknown to him. The regression equation (22), computed from the data, is viewed as an estimate of the true relationship or model (24). In other words, the computer regression coefficients bo,...,b 1 are estimates of the corresponding parameters /0, ' p-l' These computed regression coefficients are determined by the underlying process; they would vary from sample to sample and are considered to be random variables having some probability distribution. Similarly, the observed multiple correlation coefficient R associated with (22) would vary from sample to sample and is also viewed as a random variable. Using this formulation, our problem of choosing between the full equation (22) and the reduced version (23) can be approached by means of a statistical hypothesis test. If we can accept the hypothesis H0 that each of the B. corresponding to the explanatory variables in (22) which are not in (23) is equal to zero, then clearly (23) is the equation to be chosen. If we reject H0, then (22) would be the preferred equation. Thus our null hypothesis H0 is that the true coefficients p,,..,,p, are all equal to zero and the alternate hypothesis H1 is that at least one of these coefficients is nonzero.

51 The hypothesis H0 can be tested by calculating the F-statistic using R2 (R)2 p - p, F = 1 - R2 n - p It can be shown that H0 implies that this statistic has a known probability distribution called the F-distribution, with p - p' and n - p degrees of freedom. Tables of this distribution are readily available. The hypothesis H0 is rejected in favor of H1 if the value of the F-statistic for our data is larger than some value determined by the F-distribution and the chosen significance level of the test. Thus, we would prefer (22) over (23) if and only if the value of the F-statistic for the sample is large enough to cause us to reject H0 at the chosen level of significance. It can also be shown that the value of the Fstatistic may be calculated from the sum of the squared residuals of the two regression equations, SSE and SSE', SSE' - SSE P - P SSE n - p

52 We return to equations (20) and (21) to illustrate this procedure. Here R =.5522, R' =.4214, p' = 3, p = 7, and n = 77. We see that the two explanatory variables in (21) are the first two explanatory variables in (20). The hypothesis H0 in this case is H0: 3 = 4 = 5 = 6 = 0 H 63 4 5 6 and the alternate hypothesis H1 is that at least one of the B3,..., B6 is nonzero. Inspection of (20) indicates that 3,.3'''B6 are the parameters corresponding to the four indicator variables S1,...,S4. We choose the 5 percent level of significance for this test and calculate R2 - (R')2 (.5522)2 - (.4214)2 F = - = 4 3.21. 1 - R 1 - (.5522) n- p 70 According to tables of the F distribution, the value of F for p - p' = 4 and n - p = 70 degrees of freedom is 2.50. The F-statistic of 3.21 for the sample is greater than this value, so we reject H0 and accept H1. Thus we conclude that the difference between R and R' is statistically significant and that the full equation (20) is preferred over the reduced version (21). It can be seen, however, that H0 could be accepted at the more stringent 1 percent level of significance;

53 the difference between R and R' is not now statistically significant and one would prefer (21) over the full equation (20). As is often the case, the significance level chosen for a test has great influence on the decision that is subsequently made. A widely used level of significance is the 5 percent level. This hypothesis testing procedure, called an F-test of the significance of additional explanatory variables (or an F-test for choosing between two regression equations of the type (22) and (23)), is more general that it first appears to be. A variety of other conventional F-tests can be placed in the framework above and treated as a problem involving a choice between two regression equations. For example, in fitting a polynomial to data, one can use our procedure to test the significance of one or more higher order terms. We can take the full equation and reduced version to be respectively (2)Yb + X2 3 (25) Y = b + bX + b2 + b3X and (26) Y = b' + b'X. 0 1 The null hypothesis is H0: 32 = B3 = 0 and the alternate -— --- —------------ ......1;,,11.,;,..... _,

54 hypothesis H, is that at least one of the parameters B2 or 13 is nonzero. The significance of the coefficients of the entire set of explanatory variables included in a multiple regression equation can also be tested by using (27) Y = b0 + blXl + b2X2 + b3X3 and (28) Y = b' as our pair of equations. The null hypothesis in this case is H0: B1 = 2 = 3 = 0 and H1 is the alternate hypothesis that at least one of these parameter values is nonzero. In calculating the F-statistic in this case, p' = 1 and R' = 0 and we have p - 1 and n - p degrees of freedom. One can also test the significance of the coefficient of one or more interaction variables; for example, suppose we have (29) Y = b0 + blXl + b2X2 + b3X1*X and (30 Y = b' + bl'X + bX2; 0 1 1 + it is clear that we can test H0: 13 = 0 against H1: 13 3 0.

55 Finally, one can test the significance of the coefficient of any single explanatory variable in a regression equation as well. Suppose we have (31) Y = b0 + blXl + b2X2 + b3X3 and (32) Y = b' + bX + bX3; this enables us to test H0: 2 = 0 against H1: 1 2 / 0. In this instance we have p = 4, p' = 3 so that p - p'= 1 and we use the F-distribution for 1 and n - p degrees of freedom. This F-test of a single coefficient is the same as the t-test for the null hypothesis 32 = 0 against the alternative hypothesis that X2 0 O because the F-statistic for 1 and n - p degrees of freedom is the square of the t-statistic for n - p degrees of freedom. Some regression computer programs give the value of the F-statistic and others the value of a t-statistic. One can make use of the relationship F = t, perform either an F- or a ttest, and obtain identical results. The F-test of a single coefficient, or the equivalent t-test, is used by most variable selection programs such as stepwise regression. In the forward selection

56 phase the explanatory variable having the greatest partial correlation coefficient with the dependent variable is appended only if its regression coefficient is significantly different from zero, otherwise the procedure stops. In the backward elimination phase, the explanatory variable with the smallest partial correlation coefficient is delected only if its regression coefficient is not significantly different from zero.

57 APPENDIX A. ABSENTEEISM DATA BASE U) c ).~ a s U -H V] U0 FD U rd, e0 U) -i-i U4 -H0 s-I 0 {J oi g 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 0 14 1 22 0 21 2 22 0 9 1 7 1 21 1 21 2 19 1 22 3 11 2 15 1 17 3 7 2 12 0 7 4 9 7 1 3 25 2 8 2 8 4 24 2 18 0 12 3 17 3 6 3 23 0 28 2 4 3 9 1 7 4 1 2 11 1 19 3 5 2 3 4 11 4 16 45 76 56 76 70 69 56 56 43 76 30 50 10 69 67 69 70 13 16 52 52 3 6 67 10 89 21 34 12 70 69 13 30 43 8 69 30 23 3.86 5.74 3.08 5.74 5.92 4.31 3.78 2.70 4.99 3.63 3.02 4.88 2.80 4.48 5.61 4.44 5.34 4.17 5.87 5.39 4.87 4.04 3.38 6.42 2.66 7.51 2.83 4.27 6.47 4.71 4.39 3.77 4.28 3.19 4.40 5.03 2.84 2.81 4 3 4 10 1 9 3 7 3 14 3 9 4 3 4 1 1 9 3 1 2 1 4 9 4 1 2 4 3 3 1 4 2 8 2 1 3 3 1 5 1 16 2 2 3 4 3 6 3 1 3 18 2 2 3 4 4 6 2 2 3 11 2 1 4 13 2 1 2 2 2 2 4 1: 2 1 28 42 40 34 39 44 40 35 32 41 27 40 30 35 33 32 37 26 36 28 40 26 38 33 26 48 34 26 40 34 49 35 51 25 29 34 36 31 2 1 5 2 2 0 1 0 0 1 1 0 1 0 1 1 1 2 2 2 1 0 2 1 0 0 1 1 2 2 2 1 5 1 0 2 2 2

58 a) 0 0) 0 H OH ri Usll 0 4-H a) U IU Co v >1 4U) 4-4U) 0-i ) in 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 4 25 16 3 26 11 2 25 16 6 15 50 3 15 50 1 3 69 2 17 10 1 19 43 1 4 12 3 22 76 2 21 56 0 18 6 0 5 8 1 14 45 3 19 43 6 16 23 3 27 1 2 10 82 2 27 1 4 27 1 3 9 70 0 22 76 0 10 82 1 15 50 1 9 70 1 2 81 2 9 70 3 27 1 2 5 8 2 16 23 2 23 21 2 20 82 1 12 67 0 2 81 1 19 43 4 18 6 3 1 13 2 8 52 3 8 52 4.57 3.60 3.89 4.24 3.61 6.79 3.53 4.86 3.93 5.20 3.22 3.88 4.73 3.37 3.84 2.64 5.18 5.58 3.94 3.84 5.80 5.00 7.47 4.21 6.56 4.60 5.61 5.35 3.51 3.27 3.67 6.54 6.82 5.00 5.50 3.58 5.44 5.24 3.26 4 1 28 2 3 1 32 2 3 1 30 2 1 2 30 2 3 2 28 0 3 4 31 4 3 2 34 1 3 26 54 1 4 1 28 0 2 5 28 2 3 2 35 2 3 8 43 4 5 3 29 4 4 2 32 3 3 5 31 3 3 1 26 1 5 7 46 1 3 1 23 1 3 1 20 0 5 1 35 3 3 4 32 2 3 6 34 0 3 7 38 0 3 9 33 2 3 8 45 1 3 5 27 1 3 9 33 4 4 2 30 3 5 1 32 2 4 2 24 3 4 12 47 4 3 7 33 5 4 28 54 3 3 18 45 2 3 6 40 0 3 3 21 1 2 8 29 4 1 7 31 1 3 1 27 1 Source: Computer simulation by one of the authors

59 APPENDIX B. ABSENTEEISM DATA BASE DOCUMENTATION Variable Name Symbol Description Case Number Absenteeism Job Classification Job Complexity Base Pay Foreman Satisfaction Seniority i A J C P S SE (also called observation number) The number of distinct occasions that the worker was absent during 1975. Each occasion consists of one or more consecutive days of absence. An integer identifying the twenty-nine different jobs included in the study: 1 = Foundry Molder, 2 = Automatic Screw Machine Operator, 3 = Aluminum Extrusion Inspector, 4 = Warehouse Order Picker, 5 = Heavy Hydraulic Press Operator, etc. An index ranging from zero to one hundred, measured according to procedures developed by Turner and Lawrence. * Base hourly pay rate ($) Determined by employee response to the question: "How satisfied are you with your foreman?" 1 = Very dissatisfied 2 = Somewhat dissatisfied 3 = Neither satisfied or dissatisfied 4 = Fairly well satisfied 5 = Very satisfied Number of complete years with the company on December 31, 1975. *Turner, Arthur N. and Lawrence, Paul R. Industrial Jobs and the Worker (Boston: Harvard University, 1965).

60 Variable Name Symbol Description Age AG Employee's age on December 31, 1975 Number of D Determined by employee Dependents response to the question: "How many individuals other than yourself depend on you for most of their financial support?"