Multivariate Statistical Method [PDF]

  • 0 0 0
  • Suka dengan makalah ini dan mengunduhnya? Anda bisa menerbitkan file PDF Anda sendiri secara online secara gratis dalam beberapa menit saja! Sign Up
File loading please wait...
Citation preview

1



~



. ... !.



"



l ( (



¡. i



1 i j



,I



.



¡



1I



\ 1;-'



,... ...



l\l/ultilrariate



:.. ! •.



.....



'.1



"



....:..::



" ';



.'



... o



,~



6)



01



i' A journey 01 o Ihousand miles begins wilh o single st ep



Multivariate Statistical Methods



Lila Tsu



TOJEAN



A PRIMER



Bryan F.J. Manly of Mothemolics ond Statislics Universily of Ologo



Dep ortmenl



New Zeolond



,e .



BL ,UO T ~ e A



~ DAV.IO __V~~A~~E FALC~Nj



046~)







~ LONDON



NEW YORK



Chapman and Hall



i



I



\ !



'l\



!-Inl p"I;lilhrd il1 19M hy Chl/fll11{,n I/J/{/I/al! tu/ /1 :Vc'" rellcr LI/nr, I.{JJ/{/'l/l EC-II''¡¡':E Pl/olí.liJ e" in Ih e lfS,\ h.l" C/llIpl/lill1 (lJ/{.' 1fl/I! 29 l\'CSl 351h .'llrecl, Sc!\" }' orl: sr /000/ f(CfJlillll'd I(¡BB. 1989



l' I I



I



Contents



0 465



©/9Sr, n rw /II F.J. M(lI/!}



Preface



f'rillled in Grea l Bri/llin hy l . \\". Arr(l\l"5milh Ud, BrislO!



15B.\'0 ,112 28610 6 (/wrdback¡ 0 ·112 28620 3lpapcrback)



~l



This tirle is ami/able in borh hardbolllld alld paperback ediriolls. The paperback edirioll is soltl mbjecrro rhe co ndirioll rhar ir shal! nor. by "'ay of rrade or orherll'isc, be lelll, reso/d, hired 0111. or orhenl'¿5f! cirrulared ...irholllrlre pl/blislrers prior eonselll ill allyform ofbillding or eover oíher/han rhar in ...hieh ir is publishéd and "'irhour a similar cOllditioll includillg rhis eondirion being imposed on rhe SIIbsequent pllrehaser. Al! righrs resen-ed. flio par! al rhis book may be reprillled. or reproduced or urili:ed in an)" for/ll or by an~' e/ee/ronie, meehanical or orher means . nOI\! kllo..-n o'-hereafrer im'ented, including pholOcopying alld' recording. or In any informarion slOrage and rerriem/ sysrem, Wi lllOlII permission in •.-ririllg frOIll ¡he p ub/isher.



1 ~



J



The material of multivariate analysis 1.1 Examples of multivariate data 1.2 Preview of multivariate methods 1.3 The multivariate pormal distribution lA Computer programs References 2



2.1



2A



2.5 2.6 2.7 2.8



:-' bnly, B r> "" F.J ,



1\ , . ; lti\·;Hi~ t c stati , ti c31 mc thnd >: ..\ prime r l. ~lulti\'ari3t~ anal\~is 1. Tit k ISB~



3



QA:7S



0·-112·2S61O·6



ISB:-': 0·-1 12·2S6~O· 3 (phk.)



Lib ra r)' of Congress Cataloging in Publication Data Manly. Bryan F.l .. 19-Wl\lultivariate statistical mcthods: A Primer Includcs hihliographies and indexes. 1. Multi\'ariate analysis. 1. Title .



QA278.M351986 519.5'35 ISBN 0·-112· 28610·6 ISBN 0·-112· 28620·3 (phk.)



86-9537



The rieed for matrix algebra.



2,2 Matrices and vectors 2,3 Operations on matrices



Brit ish Lihr3r~ Ca talnl!uing in Put"-Iication Data



519 .5'35



Matrix algebra



Matrix inversion Quadratic forms Ei genvalues and vectors Vectors of means and covariance matrices Further reading Rcferences



Tests of significance with muJtivariate data 3.1 Introduction 3.2 Comparison of mean values for two samples: single variable case 3.3 Comparison of mean values for two samples: multivariate case 3A Multivariate versus univariate tests 3.5 Comparison of variation for two samples: single variable case 3.6 Comparison of variation for two samples: multivariate case 3.7 Comparison of means for several samples



v



I ~



..



IX



1 12 14 15 16 17 17 17 19 20 21 2~



22 24 25 26 26 26 27 31 32 33 37



. ::



vi



Canten ls 3.8 Compa rison of variation for several samples 3.9 Computational methods References



"



4



Meosuring ond testing multivoriote distances 4.1 M ultivariate distances 4.2 Distances between individual observations 4.3 D istances between populations and samples 4.4 Distances based upon proportions 4.5 The Mantel test on distance matrices 4.6 Computational methods 4.7 Further reading References



S



Principal component analysis 5.1 Definition of principal components 5.2 Pro~edufe for a principal component analysis 5.3 Computational methods 5.4 Further rea~ing References



39



7.6



41 41



7.7 7.8 7.9 7.10 7.11



42 42 42 47 52 53 57 57 . 57



B



Factor analysis 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8



7



The factor a~alysis model Procedure for a factor analysis Principal component factor analysis Using a factor analysis program to do principal component analysis Options in computer programs The value of factor a¡¡alysis Computational methods Further reading References



59 59 _ 61 71 71 71



Discriminant function anal}'sis 7.1 The problem of separating groups 7.2 Discrimination using M ahalanobis distances 7.3 Canonical discriminant functions 7.4 Tests of significance 7.5 Assumptions



72 74 76



86 86 87 87 89 90



Allowing fo r prior proba bilities of group membcrship Stepwise di scriminant function analysi s Jackknife classification of individuals Assigning of ungrouped individuals to groups ComputationaJ method s Further reading References



96 96 97 97 98 98 99



Cluster anolysis



Canonical correlation analysis 9.1 Generalizing a multiple regression analysis 9.2 Procedure for a canonical correlation analysis 9.3 Tests of significance 9.4 Interpreting canonical varia tes 9.5 Computational meth ods 9.6 Further reading References



72



78 83 84 84 85 85



vii



8.1 Uses of cluster analysis 8.2 Types of cluster analysis 8.3 Hierarchic methods 8.4 Problems of cluster analysis 8.5 Measures of distance 8.6 Principal component analysis with cluster analysis 8.7 Further reading References 9



6



C Úllt cnl s



10 Multidimensiona] scaling 10.1 Constructing a 'map' from a distance matrix 10.2 Procedure for multidimensional scaling 10.3 Further reading References 11



Epilogue 11.1 The next step 11.2 Sorne general reminders 11.3 Graphical methods



100 100 100 101 104 105 106 112 113 114



lÍ4 116 117 119 124 125 125 126 126 128 140 140 142 142 142 143



Vi;; .ll



Gontcnl5 11.4 Missing values Refcrences



Data Set 1: Prehistoric goblets from Thailand Data Set 2: Caoine groups from Asia Data Set 3: Proteinconsumption in Europe Refereoces



146 147 148 154 154



Author index Subject index



155 157



Appendix; example sets of data



-'"



¡: I



144 145



Preface



The purpose of this book is to introduce multivariate statistical methods to non-mathematicians. It is not intended to be particularIy comprehensive. Rather, the intention is to keep details to a minimum while still conveying a good idea ofwhat can be done. Ín other words, it is a book to 'get you going' in a particular area of statistical methods. 1t is assumed that readers have a working koowledge of elementary statistics, particularly tests of significaoce usiog the normal, t, chisquare and F distributions, analysis ofvariance, and linear regression. The material covered in a standard first-year universityservice course .in statistics shotrld be quite adequate, together with a reasonable facility with ordinary algebra. Al so, understanding multivariate analysis requires sorne use of matrix algebra. However, the amouot needed is quite smaII if one is prepared to accept certain details on faith. Anyone who masters the material io Chapter 2 wiII have the required basic minimum level of matrix competency. One of the reasons why multivariatc methods are bcing used so much these days is obviously the availability of statistical packages to do the calculations. For most people access to a package wilI be a necessity for using the methods. However, statistical packages are not stressed particularIy in this book for two reasons. First, there are so many packages available that it is not realistic to write about them aH, or to conceotrate 00 any particular one. Second, the calculations for many analyses are relatively straightforward, and can be programmed 00 a microcomputer fairIy easily. The approach adoptedin this book is therefore in most cases to state what caIculations have to be done and to give sorne idea of how these can be programmed. Readers who want to do their own analyses can choose a convenient package and read the manual to find out how to use it. To sorne extent the chapters ofthis bbok can be read independently of each other. The first four are preliminary reading in that they are



ix ) I



-.,.



x



]Jre;!(J (;e;



largely con cerned with general aspects of handling multivariC!.tc data rathcr than with specific techniqucs. Chapter J introduces sorne examples that are used in subsequent chapters and brieDy describes the six multivariate methods of analysis that this book is primarily eoncerned with. As mentioned aboye, Chapter 2 pro vides the minimum level of 'matrix competency rcquired for understanding the remainder of the book. Chapter 3 is about tests of significance and is no! crucial as far as understanding the' following chapters is concerned. Chapter 4 is about measuring distances with multivariate data. At least the first four sections of this chapter should be rcad befo re Chapters 7,8 and 10. Chapters 5 to 10 cover what 1 considcr to be the most important multivariate techniques of data analysis. Of these, Chapters 5 and 6 forro a natural pair to be read together. However, Chapters 7 to 10 can be read singly and still (1 hope) make sense. Finally, in Chapter 11,1 have attempted to sum up what has been covered and make sorne general comments on good practices with the ' analysis of multivariate ·data. The Appendix contains three example sets of data for readers to analyse by themselves. 1 am indebted to many people for their cornments on the various draft versions of this book. Earl Bardsley read early versions of several of the chapters. Anonymous reviewers read alI or parts of the work. John Harraway read through the final version. Their comments have Ied to numerous improvements. However, 1 tak':! all responsibility for any errors. Mary-Jane Campbell cheerfully typed and retyped the manuscript as I made changes. 1 am most grateful to her. B.F.J. ManIy Dunedin. November 1985



--



CHr\PTER ONE



The material of multivariate analysis



1.1 Examples of multivariate data The. statistical methods that are described in elementary texts are mostly univariate methods because they are only concerned with analysing variation in a single random variable. This is even true of multiple regressionbecause this technique involves trying to account for variation in one dependent variable. On the other hand, the whole point ofa multivariate analysisis to consider several related random variables simultaneously, each one being considered equalIy important at the start ofthe analysis. The potential value ofthis more general approach is perhaps best seen by considering a few examples.



Example 1.1 Storm survival o/ sparrows After asevere storm on 1 February 1898, a number of moribund sparrows were taken to the biological laboratory at Brown University, Rhode Island. Subsequently about half of the birds died and Hermon Bumpus saw this as an opportunity to study the effect of natural selection on the birds. He took eight morphological measurements on each bird and also weighed them. The results for five of the variables are shown in Table Ll, for females only. When Bumpus collected his data in 1898 his main interest was in the light that it would throw on Darwin's theory of natural selection. He conc1uded from studying the data that 'the birds which perished, perished not through accident, but beca use they were physically disqualified, and that the birds which survived, survived beca use they possessed certain physical characters.' To be specific, the survivors 'are shorter and weigh Iess ... have longer wing bones, Ionger Iegs, longer sternums and greater brain capacity' than the non-survivors. 1



'i'



'C x a m ples of niUl tivar ia te data



:\iI ,i



!,-~ :



T abl c 1.1 130dy measurcmcnts 01' fcmale sparrows (X J



l'



= totallcngt h; X 2 = alar cxtent, X ) = length of beak and head, 2( 4 = Icn gth of humerus, X 5 = Icngth of keel of



;1 H



sternum; al! in mm). Birds 1 to 2 1 survived, while the remainder died.



i'



il



l'



i.! :,1;



:1 : , .



d'



¡ir JI:



l'



1i



¡: "



"



;\:



l '



¡Ii .¡;"



.,



I !'



!~



l'



!'i



¡:¡¡,¡



:ji



;.¡ ::li j •



.; ~:



·1



r i



r 1



.1 !i 1 I1



11:



,111



,



:i



li: ,: ir



Bird



Xl



X2



X3



X4



X5



1 2 3 4 5 6 7 8 9 \O 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40



156 154 153 153 155 163 157 155 164 158 158 160 161 157 157 156 158 153 155 163 159 155 156 160 152 160 155 157 165 153 162 162 159 159 155 162 152 159 155 163



245 240 240 236 243 247 238 239 248 238 240 244 246 245 235 237 244 238 236 246 236 240 240 242 232 250 237 245 245 231 239 243 245 247 243 252 230 242 238 249



31.6 30.4 31.0 30.9 31.5 32.0 30.9 32.8 32.7 31.0 31.3 31.1 32.3 32.0 31.5 30:9 31.4 30.5 30.3 32.5 31.5 31.4 31.5 32.6 30.3 31.7 31.0 32.2 33.1 30.1 30.3 31.6 31.8 30.9 30.9 31.9 30.4 30.8 31.2 33.4



18.5 17.9 18.4 17.7 18.6 19.0 18.4 18.6 19.i 18.8 18.6 18.6 19.3 19.1 18.1 18.0 18.5 18.2 18.5 18.6 18.0 18.0 18.2 18.8 17.2 18.8 18.5 19.5 19.8 17.3 18.0 18.8 18.5 18.1 18.5 19.1 17.3 18.2 17.9 19.5



20.5 19.6 20.6 1,0.2 20.3 20.9 20.2 21.2 21.1 22.0 22.0 20.5 21.8 20.0 19.8 20.3 21.6 20.9 20.1 21.9 21.5 20.7 20.6 21.7 19.8 22.5 20.0 21.4 22.7 19.8 23.1 21.3 21.7 19.0 21.3 22.2 18.6 20(5 19.3 22.8



3



Table 1.1 (Canld.) Bird



Xl



X2



X3



X4



X5



41 42 43 44 45 46 47 48 49



163 156 159 161 155 162 153 162 164



242 237 238 245 235 247 237 245 248



3 1.0 31.7 31.5 32.1 30.7 31.9 30.6 32.5 32.3



18.1 18.2 18.4 19.1 17.7 19.1 18.6 18.5 18.8



20.7 20.3 20.3 20.8 19:6 20.4 20.4 21.1 2q.9



Dala



source:



Bumpus (189&).



He also concIuded that 'the process of seIective elimination is most severe with extremely variable individuals, no matter in which direction the variations may occur. It is quite as daqgerous to be conspicuouslyabove a certain standard of organic exceIlence as it is to be conspicuously beIow the standard.' This last statement is saying that stabilizing seIection occurred, so that individuals with measurements cIose to the average survived better than individuals with measurements rather different from the average. Of course, the development ofmulti variate statistical method~ had hardly begun in 1898 when Bumpus was writing. The correlation coeflicient as a measure of the reIationships betwcen two variables \vas introduced by Francis Galton in 1877. However, it was another 56 years before Hotelling described a practical method for carrying out a principal component analysis, which is one of the simplest multivariate analyses that can be applied to Bumpus's data. In fact Bumpus did not even calculate standard deviations. NevertheIess, his methods of analysis were sensible. Many authors have reanalysed his data and, in general, have confirmed his concIusions. Taking the data as an example for ilIustrating multivariate techniques. several interesting questions spring to mind. In particular: 1. How are the different measurements reIated? For example, does a large value for one variable tend to occur with large values for the other variables? 2. Do the survivors and non-survivors have significant differences for the mean values of the variables?



(



Table 1.2 Measurements on male Egyptian skulls rrom various epochs (X, = maximum breadth, X 2 = basíalveolar length, X 4 = nasal height; all in mm, as shown on Fig. l.I) Early predynastic



Late predynastic



J2th &



13t" dynasties



= basibregmatic hcight, X J .



Prolemaic period



Ramal! perind



Skull



X,



X2



XJ X 4



X,



X2 XJ X 4



X,



X2



XJ



X4



X,



X2 X 3 X4



X,



X2 X J



1 2 3 4 5 .6 7 S 9 10



138 131 132 132 143 137 130 136 134 134



89 92 99 96 100 89 108 93 102



141 133 138 134 134 133 138 145 131 136 129 139 126 133 142



137 141 141 135 133 131 140 139 140 138 132 134 135 133 136



134 107 12S 95 130 87 131 99 120 :JI 135 90 137 94 130 90 134 90 140 100 133 90 134 97 135 99 136 . 95 130 99



137 136 128 130



95 95 109 100 100



137 129 132 130 134 140 138 136 136 126 137 137 136 137 129



'52 47



121 129 136 140



48 48 45 51 45 52 54 4S 50 46 53 51 56 49 53



96 93 87 106



IJX



124 133 13S 148 126 135 132 133 131 133 1.13 I JI !31 138 130



138 134 134 129 124 136 145



12 13 14 15



131 125 131 119 136 138 139 125 131 134 129 134 126 132 141



123 131 126 134 127 1J8 1.18 126 132 135 120 1J6 1J5 lJ4 135



16 17 18 19 20 21 22 23 24 25 26 27 28 29 30



131 135 132 139 132 126 135 134 128 130 138 128 127 131 124



134 137 133 136 131 133 135 124 134 130 135 132 129 136 138



97 54 103 SO 93 53 96 50 101 49 102 51 103 47 93 53 103 50 104 49 100 55 93 53 106 48 114 54 101 46



45 53 51 47 54 49 48 52 50 49 53 50 52 54 51



135 129 134 138 136 132 133 138 130 136. 134 136 133 13l:i 138



138 135 125 134 135 130 131 137 . 127 133 123 137 131 133 133



102 55 92 50 90 60 96 51 94 53 91 52 100 50 94 51 99 45 91 49 95 52 101 54 96 49 100 55 91 46



134 131 129 136 l31 139 144 141 130 133 138 131 136 132 135



137 141 1:\5 128 125 130 124 131 131 128 126 142 13X 136 130



II



~9



49 48 50 44 54 56 4H 4H 51 51 SO.



53 51 50 51



!30



101 97 98 104 95 98 100 102



134 125 136 139 136 134 IJó



94 IOJ 98 99 98 104



131 128 138 129 123 131 130 129 134 130 137 136 126 131 135 136 129 126 134 139 DI 134 132 130 130 132 135 132 130 128



96



98 107 101 105 93 106 100 97 91 101 90 104



o :: ~



~



1



SJ



u



" ...



S



ii







oC-=:



'"~



",



I



JRH



2.§ ~







AF



=~



C1



I



= o



~ .~



. ,.~ Itt~



'7-



-= ~ ~ :=.:: -



;::.;;



2J.~ -;:: ~ ~ ;: ~..:::



E~



:í¡



~ o :: ~ c.. ~ 3: ::.2 ........ i":: o o ::: _Z~



I



-.,...::



12



Thc material



of 1ll1l}tivariaLe



ano}ysis



Prc!view of mu}Uvoriut e rnc:thocl s



í.2 Previcw of rnultivariatc rncthods



~~



The five cxar.lples just considered are typical of thc ~a~~aterial for multivariate statistical methods. The main thing to note at this point is that in al1 cases there are several variables of interest and these are cJearly not independent of each other. However, it is useful al so to give a brief preview of what is to come in the chapters that fol1ow in relationship to these examples. PaLcom/Joñéiít árialysií is designed to reduce the number of . vanables that need to be considered to a smal1 number of indices (cal1ed the principal components) that are linear combinations of the original variables. For example, much of the variation in the body measurements of sparrows shown in Table 1.1 will be related to the general size of the birds, and the total



tí' J:J1néi , I



I I



I



:,¡



11 =X I +X 2 +X)+X 4 +X S wil1 measure this quite well. This data. Another index is



1:



acc~unts



for one 'dimension' in the



12 =X I +X 2 +X) -X4 -X s,



!;



' : .¡



.i



'. ~



';;1



·il ~



@



which is a contrast between the first three measurements and the last two. This reflects another 'dimension' in the data. Principal component analysis provides an objective way of finding indices of this type so that thc variation in the data can be accounted for as conciselv as Qossibls. It may wel1 turn out that t\\'o or three principal components pro vide a good summary of a11 the original variables. Consideration of the values of the principal components instead of the values of the original variables may then make it much easier to understand what the data haye to sayo In short, principal component analysis is a means of simplifying data by reducing the number of variables. !.factoY ·W alvsip lso attempts to account for the variation in a number of original variables using a sma11er number of index variables or factors. It is assumed that each oricinal variable can be ~ressed as a linear combination ofthese factors, plus a residual term that reflects the extent to which the variable is inaependent of the . \ other variables. F or example, a two-factor model foI' the sparrow data . assumes that



13



+a¡2 F2+ e ¡ X 2 =a 2I F I +an F 2+ e2 X) = a))F I + a)2F2 + e) X 4 = a~)F¡ + a~2F2 +e 4 Xl



=aIIF¡



and



X 5 = as¡F¡



+ a s2 F2 + es,



where the aij values are constants, F ¡ and F 2 are the factors, and e¡ represents the variation in X¡ that is independent of the variation in the other X -variables. Here F I might be fhe factor of size. In that case the coefficients a l l , a21' a)l' a 41 and a SI would a11 be positive, reflecting the fact that sorne birds tend to be large and sorne birds tend to be sma11 on a11 body measurements. The second factor F 2 might . then measure an aspect of the shape of birds, with sorne positive coefficients and sorne ' negative coefficients. If this two-factor model fitted the data welI then it would provide a relativ.e1y straightforward description of the reIationship between the fivebody measurerneots being considered. One type of factor analysis starts by taking a few principal components as the factors in the data being considered. These initial factors are then modified by a special transformation process ca11ed 'factor rotation' in order to make them easier to interpret. Other methods for finding initial factors are also used. A rotation to simpler factors is almost always done. rDlSciIijj,nmiT úmciion allalvsis' is concerned with the probL.n of seeing whether it is possible to [email protected] grou,Es on the ba~ ,?fthe availab.1e measuremen~~ This could be used, for example, to see how welI sur\'iving and non-surviving sparrows can be separated using their body measurements (Example 1.1), or how sku11s fro m different epochs can be separated, again using size measurements (Example 1.2). Like principal component analysis, discriminant function analysis is based on the idea of finding suitable linear combin~ of the original variables. ~luslef:anaIVSis j! concerned with the l.9..entification of groups of ~lmllar individuals. There is not much point in doing this type of analysis with data like that of Examples 1.1 and 1.2 since the groups (survivors, non-survivors; epochs) are already known. However, in Example 1.3 there might be sorne interest in grouping colonies on the



®



€ .



iR



I



'l4



I



Thc muU:ríGi oÍ multivariaLt: oT1ulj'sis



Computer ¡)f()Qr,arn~'



1



).. '. !, " ~ ;.:¡;



.!:~ ., ¡ .:!: :::1:



)':-):i¡ I '



.-



' Ti,;;¡: ::,! ,¡:¡:¡¡ r li:¡: ,



'!,¡ii I 111



.



1:1 1f [1



: ~!l 11



:fr 1i ! ;1'iI :,:¡:!;,¡1 ' 1'['



, :1: : ~ ~: I: I ¡ '· 1, [



:i-i! I 1,1: 1",1



j!J:¡1 I.I! l. ,' ' , ." 1 t



1



:I



1: j.



¡: 1".,; , 1



: 1



l ·¡i



i,1:1 1



1 ,l.1



; Ir



·¡ I ; 1



,



basis o[ r:nvironmcntal variables or Pgi frcqucncies, whilc in Example 1.4 thc main point of interest is in the similarity betwcen prchistori¿ dogs and other animals. Similarly, in Example 1.5 differ-en! Europcan countries can be grouped in terms ofsimilarity between employment patterns. With ~®.iiiéal forr~19(i@ he variaJ21~ (not the individuals) ~ divided into two groups and interest centres o~ the relationshiE §.tween these. Thus in Example 1.3 the first four variables are related to the environment while the remaining six variables reflect the . gene,tic distribution at the different colonies of ElIphydryas editha. Finding what relationships, if any, exist between these two groups of variables is of considerable biological interest. Finally, there isJm4tídtñieñsional1?4!fW! The method begins with data on sorne rneasure of the distances apart of a number of individuals. From these distances a 'mapl is constructed showing how thc(individuals are related. This is a use fuI facility since it is often possible to measure how far apart pairs of objects are without having any idea of how the objects are related in a geometric sense. Thus in Exarnple 1.4 there are ways of measuring the 'distances' between modern dogs and golden jackals, modem dogs and Chinese wolves, etc. Considering each pair of animal groups gives 21 distances altogether. From these distances multidimensional scaling can be used to produce a 'map' ofthe relationships between the groups. ,With a one-dimensional 'map' the groups are placed along a straight line. - With ~ two-dimensional 'map' Ihey are represented by points on a planeo With a thrce-dimensional'map' they are represented by points within a cube. Four- and higher-dimensional solutions are also possible although these have limited use beca use they cannot be visualized. The value of a one-, two- or three-dimensional map is clear for Example 1.4 since such a map would immediately show which groups prehistoric dogs are similar to and which groups they are different from. Hence multidimensional scaling may be a useful alterna ti ve to cluster analysis in this case. A 'map: of European countries by employment patterns might also be of interest in Example 1.5.



"



, . ;i~ " I,/ ':1 :/ 11:



ii:1 r '" '¡¡id; i i'li ',; !.!



¡i':.>,I!!: r O::



li



': . . ¡, ! "



1.3 The multivariate normal distribution The normal distribution for a single variable should be familiar to readers of this book. It has the well known 'beIl-shapéd' frequency



\~



curve. Many standard univariatc statistical methods are bascd on lhe " assumption that data are normalIy distributcd. Knowing the prominencc of the normal distribution with univariate statistical methods, it wilI come as no surprise to discover that the multivariate normal distribution has a central position with multivariate statistical methods. Many of these methods require the assumption that the data being analysed have multivariate normal distribu tions. . The exact definition of a multivariate normal distribution is not too important. The approach of most people, for better or worse, seems to be to regard data as being normalIy distributed unless there is sorne reason oto believe that this is not true. In particular, if alI the . individual variables being studied appear to be normalIy distributed, then it is assumed that the ¡oint distribution is multjyadate normal. This ís, in fact, a mínimum requirement since the definition of multivariate normality requires more than this. Cases do arise where the assumption of multivariate normality is clearly invalid. -Forexample, one or more of the variables being studied may have a highly skewed distribution with several outlying " high (or low) values; there may be many repeated values; etc. This type of problem can sometimes be overcome by an appropriate transfor!'llation of the data, as discussed in elementary texts on statistics. If this does not work then a rather special form of analysis ma:y be required. One important aspect of a multivariate normal distribution is that it is specified completely by a mean vector and a covariance matrix. The definitions of a mean vector and a covariance matrix are given in Section 2.7.



,. -



;-



1.4 Computer programs Practical methods for carrying out the caIculations for multivariate analyses were developed from the mid-1930s. However, the application ofthese methods for more than smalI numbers ofvariables had to wait until computing equipment was sufficiently weIl developed. It is only in the last 20 years or so that analyses have become reasonably easy to carry out. Nowadays there are many standard statistical packages available for caIculations, for example BMDP, SAS and SPSS. It is the intention that this book should provide readers with enough information to use any package intelligently, without saying much about any particular one.



...



--~



~~ '"



1G



;-:.



- :!



:¡ :1



'¡'hc



T!J{J/()[ju]



úf



J/lultivurjul() ufl u!ysis



Mos! multiva ria tc analyscs are still done using the slandaru packagcs on mcdium or large computers. However, the increasing availability and power o"f microcomputers suggests lhat this will nol be the case for much longer. Packages will be 'shrunk down' to lit into micro s and, also, special-purpose programs will become increasingly available in languages like BASIC. Indeed, it is not difficult to write BASlC programs to do many of the standard multivariate analyses providing advantage.is taken of the availability of published algorithms to do the complicated parts of the calculations. Sorne limited instructions in this direction are incIuded in the chapters that follow.



..



ii



:1'



2.1 The need for matrix



;1 i.:



J



j;: r:



.~I' ., I



.



ti;; I :¡ ¡. :1



-



alg~bra



As indicated in the Preface, the theory of multivariate statistical methods can only be explained reasonably welI with the use of sorne matrix algebra. For this reason it is helpful, if not essential, to have a certain rninimal knowledge of this area of rnathernatics. This is true even for those whose interest is solely in using the methods as a tool. At first sight, the notation of rnatrix algebra is certainly sornewhat daunting. However, it is not difficult to understand the basic principIes involved providing that sorne details are acceptéd on faith ..



References Bumpus, H.C. (1898) The elimination of the unlit as illustrated by the introduced sparrow, Passer domescicus. Biological Lectures, Marine Biology Laborarary, Woods Hole, 11th Lecture, pp. 209-26. Euromonitor (1979) European Marketing Daca and Scaciscics. Euromonitor Publications, London. Higham. C.F.W~ Kijngam, A. and Marily, B.FJ. (1980) An analysis of prehistoric canid remains from Thailand. Journal oJ Archaeological Science 7, 149-65. McKechnie, S.W., Ehrlich, P.R. and White, R.R. (1975) Population genetics of Euphydryas butterflies. L. Genetic variation and the neunility hypothesis. Genecics 81, 571-94. Thomson, A. and Randall-Maciver, R. (1905) Ancient Races oJche Thebaid . O"ford University Press.



C HAPT ER T\VO



Matrix algebra



j ij



l·:



-



.:. ~. , ~: ~ o.,



2.2 Matrices and vectors A matrix of size m x n is an array of numbers with m rows and n columns, considered as a single entity, of the 'form a 11



A= a



fi



[ ami



a 12 an



a a 2n ]



am2



amn



in



If m =n, then this is a square matrix. Ira matrix only hasorié'column, ._--_..- .•... # for instance,



c{J then this is called a CoT¡[mn11.mQr.1lf there is onlYEh§.row~or instance



r = (r i ,r2, ... ,rn),



:+;i::.



.;.



17



18



,..,







,



'



1



MíJ(rix alí)c/)f(.J



, , :, 1 ~~



.' ¡



;



all



"



A' = a~2



':::'Il



[



pll



¡¡,!if pi:!1



l'



j j; 1: t



~



"1::1;;01, '::," 1," 1



Jii:¡;!, •



~\w¡\~\\~~~ ~~ \\\t\\\\~~~



The trar.spose of a rnatri x is found by interchang;'1g thc rows and columñs, Thus the transpose of A aboye is --.t..--



a ln



aZl oo. 'oo



an a Zn



amll



am2 amn



Also, c!=(C l ,C 2,': .. ,cm ), and r' is a column vector. There are a number of special kinds of matrix that are particularIy important. A zero ~áirix has all elements ec¡ual to zero:



'Ii;::ii



¡,¡jl'!,I:!! :



:,ll'!:li



:' 1 :



::~i¡ ~ 11



1'; :



"'1 '1 1 ; 1;1:



a ll



alZ



aZI



a22



[ a 31



a 32



a 13 ]



[b



ll



a 2 3 = bZI a 33 b 31



-bu· b1 3 ] b22 :' b23 b3 2



b 33



only if a ll = b 11' a 12 = b IZ,"" a 33 = b33 • TheJ~e1of a matrix is the sum of the diagonal terms, Thus tr(A) = a l l + a ZZ + oo. + ann for an n x n matrix. This is only defined for sq,uare matrices. J..



.2.3 Operations on matrices



:,¡I;!;I'1:



, ¡I'¡lj'" '



1t>



example,



01 O.



O O 0= ~ O [ O O



o'



,The ordinary arithmetic processes of addition, subtraction, multipli::'Cátion and division have their counterparts with matrices. With Jaddition and subtraction it is simply a matter ofworking eIement by :dement,For example, if A and D are both3 x 2 matrices, then



1



:';1"¡JII,1'11' , :'I¡j'i , :!lil'!I!,1i ,::"I,I" ,'" 'j'



A diagonal matrix is a square matÍ-1x that has elements ofzero, except down the main diagonal, and so is of the form



al' l



~. :'r· . A + D = ::~ [



' a1Z]



an



a 32



+



[d



ll



d



u ]



d 21



dn



d 31



d 3Z



=



[al1+dl1 a 21 d 21



+



a 31



+ d 31



a



l2



an



+d



12



+ d 22



]



a32 + "d 32



'



"¡i:,.!" I! 1:',;:



1



1' I



f,



¡i ~ :



O O O t2 O T=IO O t 3 ti



O O



O



O O' O



a 11 -;- d 11 A-D= a zl -d Z1 [ a31 - d 31



tn



A symmetric matrix is a square matrix that is unchanged when it is transposed, so that A has this property providing that A' = A. Finally, an identity matrix is a diagonal matrix with all diagonal terms being unity:



1 O O O 1 O T=IO O 1 .- ;



while



O



These operations can only be carried out with two matrices that are of the same size. In matrix aIgebra, an ordinary number such as 20 is called a scalar. MuItiplication of a matrix A by a scalar k is then defined as muItiplying every element of A by k. Thus if A is 3 x 2, as shown abo ve, then



O O



ka 11 kA = ka 11 [ ka31



O O O Two matrices are equal onIy if' all their elements agree. For



a 12 - d l z] a 22 -d z2 . a 3Z - d 32



ka l2 ] ka 12 kan







Multiplying two matrices together is more complicated. To begin with, A· Bis only defined if the number of columns of A is equal to the



.---.



...



nümber of rows of B. If tbis is the case, say with A of sizc m x ~f sizc n x p, to en a ll



A·B



bb2P ]



amn



bn2



bnp



ll



= arl a22 am2



ami



¿alibil _ ¿a 2jbjl



-



l



bnl



.



II



where indicates the summation for all j = 1,2, ... , n. Thus the elernent in the ith row and kth column of A· B is



¿aijbjk = a¡¡b lk + amb2k + ... + ainb nk · It is only when A and B are square that A' B and B' A are both , defined. However, even if rhis is true, A' B and B' A are not generally equa!. For example,



[2 -lJ.[l° Il [2 1J =



while



1



1]



1 2



[°1 lJ.[2 -IJ=[3 0J 1



I



J [- 2/3 1/3J = [1° 0J 1/3 - 2/3



2 1 [1 2 x



lP



¿amjbjp



¿amjbj2



1



Th is can be vcrified by ch ecking th ul



an c B



1



1



a 11 [ a21



= [



2/3 -1/3



a22



J-



1



a22/!1



[



- a2t1!1



=



- a l2 /flJ a ll /fl



'



fl = a 11 a22 - a 12 o'21 . Here the scalar quantity fl is called the deter",inant of the matrix. Clearly the inverse is not defined if fl = 0, since Iinding the elemems of the inverse then requires a division by , ,. zero. F or 3 x 3 and larger matrices the calculation of an inverse is a , ~ , /:;, tedious business best done using a computer programo ;:~.< ': Any square matrix has a determinant value that can be calculated , ,, ' c by a generalization of the equatio.n just given for the 2 x 2 case. If the determinant of a matrix is zero then t~e inverse does not exist, and ~ '.;: vice versa. A matrix with a zero detetminant is described as being :~S'2 singular. ' :::~ Matrices sorne times arise for which the inverse is equal to the - transpose. They are then described as being orthogonal. Thus A is orthogonal if A -1 ='A'.



',' >



l '



2.5 , Quadratic forms Let A be an n x n matrix and x be a column vector of length n. Then the expres.sion '



Matrix inversion is analogous to th e ordinary arithmetic process of division. Thus for a scalar k it is, of course, true that k x k - 1 = 1. In a similar way, ir A is a square matrix then its ¡/leerse is A -1 , where A x A -1 = 1, this being the identity matrix. Inverse matrices are only defined for square matrices. However, all square matrices do not have an inverse. When A - 1 exists, it is both a right and left inverse so that A - 1 A = AA - 1 = 1. ' An example of an inverse matrix is



1J-l



a12



, " where



2.4 M atrix imersion



2 [1 2



1 .



Actually, the in verse ofa 2 x 2 matrix, ifit exists, can be determined easil)'. It is given by



jp ¿a1jb ] ¿a2jbjp ,



¿alh2 ¿a 2jbj2



¿amib jl



¿



aaln] 2n x [bb21



bu b22



a12



[



21



QuacJra tic form s



Mutri x olgebra



20



-1/3J 2/3 .







Q= x'Ax



is calJed a quadratic formo It is a scalar and can be expressed by the alternative equation



.." Q= \,



':



¿¿ i= "



11



x¡a¡jx j ,



1 j= 1



,, '



c, : ~:: where Xi is the element in the ith row of x and



k-H,ith row'and the jth column of A.



:;/I :"~:" .if;' iir



'



;;}t~t"·.



,



a¡j is the elernent in the



... . 'n • I



,



" u



i



m.?úll



is



a 11 x 1 + a 12 x 2 + ...



+ a 1n Xn = L'1 a 21 x 1 + C?22X2 + ... + aZnx. = J.x 2



j:JI i¡i!i!'li



an1 x 1. +. ~n2xl



ilC::11;¡!



~il':: ! !l



,I' ~: '1 ¡:



II!:,I Jiq :'



,::,!II ,1 , :", 1 , :1



"



~¡ ' I :;



«01 " " 1



\,1



"



+ ... ...¡,. annx. = J.x.



= I"



S2



-'*~:



= -ix,



I



Xli



Xi



l'



1:



;j¡



:¡,I;:,l



. ; :: ~ :ri



~:I'' ''1''· . ::¡ ,



""1 :, ' 1' "1 "



' ¡::':",I



'J ::!!i! ,



j 'li:1'1"'1,1::: '" 1IIIi¡'i¡II!¡.,



Xni



(or any multiple of it) is called the ith larellc L'eCCor or the ith eigenrector ofthe matrix A. The sum ofthe eigenvalues of A is equal to the trace of A. Finding eigenvalues and eigenvectors is not a simple matter. Like finding a matrix inverse. it is a job best done on a computer.



(2.2)



(xij - .xl/(n - 1).



i= 1



In addition, the sample CO¡;(lriance between variablesj and k is defined as



cjk =







I



(x ij - Xj)(X ik - Xk )/(lI - 1),



(2.3)



i= 1



this being a measure of the extent to which the two variables are linearIy related. The ordinary correlation coefficient for variables j and k, rjk , say, is related to the covariance by the expression



2.7 Vectors of means and coyariance matrices Populations and samples for a single random variable are often summarized by mean values and variances. Thus suppose that a random sample of size n yields the sample values Xl_ X 2- •• . • x •. Then







=I



s7



= I X3i



,



,!,



- 1).



These are estimates of the corresponding population parameters the population mean p, and the popúlation variance (12. or In a similar way, multivariate populations and samples can be (A - -iI)x = 0, summarized by mean vectors and covariance matrices. These are . . defined as follows. Suppose that there are p variables Xl' X 2'''' ,X p where 1 is an n x n identity matrix and () is an n x 1 zero 'vector. l.t can and the values of' these for the ith individual in a sample are be showil that these equations can only hold for certain particular . Xi!' X i2 , ... , x ip , respectively. Then the sample mean of variable j.is values of the scalar i. that are called the lacenc roocs or eigenw lzies of the matrix A. There can be up to IJ ofthese roots. G iven the ith latent (2.1) root ¡'i, the equations can be sol ved by arbitrarily setting XI = 1. The xj =i=" 1 x¡)n, resulting vector of x values while the sample variance·is



Ax



'1



I! : ' . :'



(Xi -:. x)2/(n



which can be written in matrix form as



'1 I •



X;/Il 1



whilc the sample estimate oJ variance is



¡: i! ,~



I i=



i= 1



¡ti!!!; i: ", 1" " 1'



n



X=



¡



;~!F!!



the sample



!\I{~



-:"'~','



Considcr the sel of equations



1:r



o, '



\J (:('.\M~ ()) ín('/mr.:. onr\ rmJflr}nnfil) mlt,"M'1>



ol)),(!\lI'(l



2.6 Eigenvalucs and oyccton;.



1:



,



.Motriy.



r jk



= cjlJ(sh)'



It is clear [rom equations (2.3) and (2.4) that



(2.4) C tj



= c jk , rkj = r jk



and



ru = 1.



. :1¡t,II¡; ,1



"I:lih¡:¡' "



, ¡;:ir¡ '1 ,. : ,,11



i! ~¡f :



,..



..... ..



'¡.



.........



--..



¡vi (J di XU 19;; iJ ru



H:;fcrences



Thc veclo/'- of sample means is calculatcd using cquation (2.1):



_= Xl] x2



X



1:, -.1 .



p



.1.li.. :



l i¡



'-. 1,.: ,



1



This can be thought of as the centre of the sample. It is an estima te of the population vector of means



p~[j:J



"



l' } ,l· :).



'1, l' . j.







1 '



(2.6)



.~



1;



c=



.lllli.



.• 'I¡ji ·, ' •II¡;:¡ t •~.



: 1'1 .



1/



Jij;. ,~ .~ I ¡



r l:,



:



; .



References



The matrix of varÍances and covariances



'11,li



.-:



vectors. In addition to thc chaptcr on matrix algebra, th{; othcr parts of Causton's book provide a good rcview of general mathcmatics. A more detailed account of matrix theory, stiU at an introductory levcl, ís provided by Namboodiri (1984). Those interested in learning more about matrix inversion and finding eigenvectors and eígenvalues, particularly methods for use on microcomputers, will find the book by Nash (1979) a useful source of Ínformation.



(2.5) .



.'



[x



sr,



C



c 12 c 22



- Cpi



Cp2



r"



r



' ']



C2p



.



,-



(2.7)



~



cpp



where cií = is caBed the sample covárianée matrix, or sometÍ'mes the sample dispersion matrix. This reflects the amount of variation in the sample and also the extent to which the p variables are correlated. It is an estimate of the population covariance matrix. The matrix of correlations as defined by equation (2.4), is



I !'



.; 1I , ,, , ,¡i,. ; a..¡ • '1"'"



12



rr



'''] r



rpl



rp2



rpp



r'



R= .



li



r



r 22



r 2p = . r~l 1



lr~l



r



12



1



lP



r



] r;p.



(2 .8)



rp2



This is called the sample correlation matrix. Like C, this must be symmetric .



t,l ¡! ,1 :1. ~II ¡¡ l.



~11 :i



:



2.8 Further reading A book by Causton (1977) gives a somewhat fuller introduction to matrix algebra than the one given here, but exc1udes latent roots and



JlII,: JlIIJ Jnl)r..



Jr.m)



25



.. ~ .;



Causton, D.R. (1977) A Biologist's Mathematics. Edward Arnold, London. Namboodiri, K. (1984) Matrix Algebra: An Introduction. Sage University Paper Series on Quantitative Applications in the Social Sciences, 07038. Sage Publications, Beverly Hills. Nash, l.e. (1979) Compact Numerical Methodsfor Computers. Adam Hilger, .:' . Bristol.



Cnmporíson o[ mean vlllu es for two sumpl es CllAPTER THP.SE



Tests of significance with multivariate data



, 'j 1



'1



.1



IJ



,,:1'1 ~.-: "



:/: ;:\1



,~"+I¡ II : I~ ,:



, ·; ':" 1



.::iP¡: ;"';, 1.1: ¡



.::.~J¡¡ \ ;,:,::1. 1 '1 . :. ~ 1; rt 1 1



.: ! ¡ I



, :



j



o" ,.! '



' .¡:



' ;¡



I



'"~l' ,' . . ,







,



t ll I



1.1 " ',d ' .:;¡ .:



!.t¡ lii l,



';.: !;



!: i .



1::1 , '



; !H ¡ I:::il' :1:



· l



(3.1)



On the assumption that X is nonnalIy distrib uted in bothsamples, with a common within-sample variance, a test to see whether the two sample means are. significantly different involves caJculating the statistic t=



(Xl



- .Xl )



I{sJell +~12)}



(3.2)



". and seeing whetber this is significantly different from zero in 3.1 lntroduction comparison with the t distribution with ni + n2 - 2 degrees of The purpose ofthis chapter is to describe sorne tests that are available :freedom (dJ.). Here for seeing whether there is any evidence that two or more samples , come from populations with different means or different amounts of (3.3) S2 = {(ni - l)s~ + (n 2 - l)sD/(n l + n z - 2) variation. To begin w1th, two-sample situations wiII be covered. is the ~stimate of variance from the two samples . > ;[.lt is lCnown that tbls test IS fairIy robust (o the .assumption of : nonnality. Providing that the population distributions of X are not 3.2 Comparison of mean values for two samples: too different from nonnal it should be satisfactory, particularIy for . single variable case sample sizes of about 20 or ..more. The assumption of egual within- r t ~ ~. C6nsider the data in Table 1.1 on the body measurements of 49 ~mple :arian~es i: al~o not ~o~ crucial. Pro:,iding t~at t~u~tio oft~e E: ~ ";.:¡•.J femal e spa rrows. Consider' in particular the first measurement, which ..!.Q1e v~nances IS wlthm the hmlts 0.4 to 2.5, m~guJhty 01 vanan_ce wl!L Té~}.) L'~.!. . is tot ~! length . A question ofsome interest might be whether the mean h~e httle adverse effect on the test. The test lS partlcularIy robust If ., \ of thi s variable was the same for survivors and non-survivors of the the two sample sizes are equal, or nearIy so (Carter et al., 1979). If the storm that led to the birds being coJlected. There is then a sample population variances are very different then the t test can be modified (hopefully random) of 21 survivors and a second sample (again to alIow for this (Dixon and Massey, 1969, p. 119). hopefully random) of28 non-survivors. We wish to know whether the two sample . means are significantly different. A standard approach 3.3 Comparison of mean values for two samples: would be to carry out a e-test. .. Thus, suppose that in a general situation there is a single variable X " multivariate case and two random samples . of values are a~~~able from ditTerent Consider again the sparrow data of Table 1.1. The test described in populations. Let XiI denote the values of X ¡n- the !irst sample, for the previous section can obviously be employed for each of the five :i= 1, 2, . . . , nI' and X iZ denote the values in the s~cond Sample, for measurements shown in the table (total length, alar extent, length of i= 1, 2, ... , nz. Then the mean and variance for ihejth sample are beak and head, length of humerus, and length of keeJ of sternum). In that way it is possible to decide which, if any, ofthese variables appear ,,; to have had different mean values for survivors and non-survivors. Xj= Xi/nj i-l ~ However, in addition to these it may also be of sorne interest to know "



~



¿



...;" ot'j :



J I!



1'1 :11



1).



¡ ", !or two sampl es



s!glli jicancf;



H ottl ling's T statistic is based on an assumpti on of normality and egual within-samplc variability. To be precise, the two samplcs being compared using the T 2 statistic are assumed to come from multivariate normal distributions with equal covariance matrices. Sorne deviation from multiváfÍate normality is probably not serious. A moderate difTerence between population covariance matrices is also not too important, particularly witb equal or nearly equal sample sizes (Carter et al., 1979). If the two population covariance matrices are very difTerent, and sampJe sizes are very difIerent as well, then a modified test can be used (Yao, 1965).



whether all fivc variables considen:d togethcr suggest a difTerence , betwecn sUfvivors and non-survivors. In othcr word s: does the total evidence point to mean difTerences between survivors and nonsurvivors? What is needed to answer this question is a multivariate test. One possibility is Ho~elIing's T 2 test. The statistic used is then a generalization of the t statistic of equation (3.2) or, to be more precise, the square of the t statistic. In a general case there wiII be p variables Xl' X 2' ... , X p being considered, and two samples with sizes nJ and n 2 . There are then twosample mean vectors Xl and X 2 , with each one being caIculated as shown in .equations (2.1) and (2.5). There are also two-sampJe covariance matrices, C I and C 2 , witb eacb one being calculated as sbown in equations (2.2), (2.3) and (2.7). Assuming that the population covariance matrices are the same for both populations, a po oled estimate of this matrix is



4,;



Example 3.1 Testing mean values for Bumpus's female sparrows



As an example of the use of tbe univariate and multivariate tests tbat have been described for two samples, consider thesparrow data sbown in Table 1.1. Here it is a question of whether tbere is any difIerence betweeñ survivors ~nd non-survivors with respect to the mean values of five morphological characters. First ofall, tests on the individual variables can be considered, starting with Xl' ihe total lengtb. The mean of this variable for the 21 survivors is Xl = 157.38 while tbe mean for tbe 28 non-survivors is x2 = 158.43. The corresponding sample variances are si = 11.05 and s~ = 15.07. The po oled variance from equation (3.3) is therefore



I¡;,j



11::-': '1 ::>



t:



'\ lI



¡I,¡ -



C = {(ni - 1)C I



+ (n 2 -



1)C 2 }/(n l + n2 - 2).



(3.4)



Hotelling's T 2 statistic is defined as T 2 = n l n2 (x l



-



X2YC- I (XI -



x2 )/(n l + n 2 ).



(3.5)



A significantIy large value for tbis statistic is evidence that tbe mean vectors are difIerent for tbe two sampled populations. The significance or lack of significance of T 2 is most simply determined by using tbe fact tbat in the nulI hypothesis case of equal population means the transformed statistic F



= (ni + n 2 -



P -1)T 2 /{(n l



+ n2 -



2)p}



S2



(3.6)



t



1 I



¡ i 1:-



f fL (-X li - X-2 J·Cik(-X lk - X-2k),



L



(3.7)



i=lk=1



wbicb may be simpler to compute. Here xjl is the mean ofvariable XI in thejth sample and cik is the element in the ¡th row and kth column of the in verse matrix C-I.



!, :;: I ~



1' : I



1"



~ 1' \, 1



= (157.38 -



158.43)



IJ{ ;1 13.36(



1



+ 28) } = -



0.99,



with ni + n2 - 2 = 47 d.f. Tbis is not significantly different from z~ro at tbe 5% level so there is no evidence of a mean difference between survivors and non-survivors with regard to total length. Table 3.1 summarizes the results of tests on all five of tbe variables in Table 1.1 taken individually. In no case is there any evidence of a mean difIerence between survivors and non-survivors. For tests on all five variables considered together it is necessary to know the sample mean vectors and covariances matrices. The means are given in Table 3.1. The covariances matrices are defined by



¡l . _



l 2 T2 = n n nl +n 2



= (20 x 11.05 + 27 x 15.07)/47 = 13.36,



and the t statistic of equation (3.2) is



folIows an F distribution witb p and (ni +n2 - P -1) dJ. Since T 2 is a quadratic form it is a scalar wbich can be written in the alternative way



:i



29



2



• ...:1: ·



:w



i



s ignifican ce



M ult ivari at 0 vers u s un iva r io[¡; tests



T ablc 3.1 Comparison of mean v¡¡lucs for survivors alld n o ~- s llr vivors for Bumpu.s's fcmal e sparrows with ~a riables taken orie át a .time.



!'



j' 1



1



uf



T es ts



Survivors



l' 1,



Variable



' 1: .



", i



..,



1



" I :'\1~ :1i



Total length Alar "extent Length beak & head Length hume rus Length keel of sternum



::'1 1'



",' '::t:



1



157.38 241.00 31.43 18.5020.81



~~



Xl'



t



" ",.¡ " "



11.05 158.43 15.07 17.50 ~ 241.57 , 32.55 0.53 - -31 :48 ." 0.73 18.45 0.43 0.18 0.58 20.84 1.32



-0.99 -0.39 "':'-Ó.20 0.33 -0.10



equation (2.7). For the sample of 21 survivors,



",



:i



\; ': ~!



'.



, , ~t



x1.=



157.381 241.000 31.433 and C I 18.500 20:810



I



=



11.048 . 9.100 9. roo-' 17.500 1.557 1.910 0.870 1.310 ,1.286 0.880



- ~ . :~ - ..~



0.2061 -0.0694 C- 1 = 1-0.2395 0.0785 - 0.1969



-0.0694 0.1234 -0.0376 - 0.5517 0.0277



- 0.2395 - 0.0376 4.2219 - 3.2624 - 0.0181



0.0785 -0.5517 - 3.2624 J 1.4610 ' ~1.2720 ' . .



--



- 0.1969 0.0277 - 0.0181 -1.2720 1.8068



. ~-



This can be verified by evaluating the product C x C- l and seeing that this is a unit matrix- (apart from rounding errors). "Substituting the elements of C- 1 and other values into equation (3.7) produces



1.557 1.910 0.531 0.189 0.240



0.870 1.286 1.310 0.8800. 189 0.240 0. 176 0.133 0.133 0.575



,.,



. ' JJ



-(157.318 -158.429) x 0.0694 x (241.000 - 241 ,571)



+ ... + (20.810 = 2.824.



20.839) x 1.8068 x (20.810 - 20.839)]



Using equation (3.6) this converts to an F statistic oC



For the sample of 28 non-survivors, F=(21



....-



r 158 .429 -



x2



:=:



241.571 I 31.479 18.446 20.839



and C 2 =



15.069 ,17.190 2.243 1.743 2.931



1.746 2.950 0.470 0.434 0.506



17.190 2.243 32.550 3.398 3.398 0.728 2.950 0.470 4.066 0.559



2.931 4.066 0.559 0.506 1.321



The pooled sample covariance matrix is then 13.748 26.146 2.765 2.252 2.710



13.358 13.748



l'



;'1 : i'



;.¡



d:



1 ':



C = (20C l



+ 27C 2)/47 =



I 1:951 1.373 2.231



1.951 2.765 0.645 0.350 0.423



+ 28 -



1



1



l'



¡



1.373 2.252 0.350 0.324 0.347



2.231 2.710 0.423 0.347 1.004



where, for example, the element in the second row and third column is (20 x 1.910 + 27 x 3.398)/47 = 2.765.



¡. ', . :o:



+ 28 -



2) x 5}



= 0.517.,



with 5 and 43 dJ. Clearly this is not significan tly large since a significant F value must exceed unity. Hence there is no evid ence of a difTerence in means Cor survivors and non-survivors, taking alI fi ve variables together.



In this last example the're were no significant results either for the variables considered individuaIly or for the overaIl multivariate test. It should be noted, however, that it is quite possible to have insignificant univariate tests but a significant multivariate test. This can occur because of the accumulation . of the evidence from the individual variables in the overall test. Conversely, an insignificant multivariate test can occur when sorne univariate tests are significant because the evidence of a difTerence provided by the significant



, ':



~!



5 -1) x 2.824/ {(21



3.4 Multivariate versus univariate tests



.! .I



-...



, " 21 x 28 ',T 2 = _. _~ [(157.381 -158.429) x 0.2061 x (157.381 -158.429)



' '' 'l ' ':: 1



".:':



(47 d.f.)



'l[ J



,



, The inverse of the ma tri x C is found lo be



Non-¡¡.urvivors S2



Xl



31



'-



í.



"



j~



¡ .:



,,- , '- 1 i



1¡J



Irri Ir , Ur : \Ir: 1fT ' Ji "



(JJ



CamfJuri,7 " variables being considered but the variation about the means ~~,~ ,. remained fairIy constant. ~.



: :.44~· · · .l:f...-



'{{:



3.9



Computa~on21



metbods



.1':'



The multivariate tests discussed in this chapter are not difficult to · .~; . program on a microcomputer if standard algorithms are used where ·~>:; possible. The T 2 statistic of equation (3.5) requires a matrix inversion. ~: This can be done easily using AIgorithm 9 ofNash(1979). The ljJ test of ' .~ equation (3.10) requires the ca1culation of two determinants. AIgorithm 5 of Nash (1979) will provide these. References



426.73] 346.47 - 16.40 ' 1533.33



for which the determinant is ITI = 7.306 x 10 13 • The within-sample matrix of sums of squares and cross-products is found from equation (3.12) to be .



I1r'



•.¡







Example 3.3 ComparisolJ of samples of Egyptian skulIs



1



~I,



T r:.';/s



Substituting n = 150,



Carter, E.M ., Khatri, CG. and Srivastava, M.S. (1979) The effect ofinequality of variances on the t-test. Sankhya 41, 216-25. Dixon, \V.J. and Massey, F.J. (1969) lntroduction toStatisticaI Analysis (3rd edn). McGraw-Hill, New York. Levene, H, (1960) Robust tests for equality of variance. In Contributions 10 Probability and Statistics (Eds 1. Olkin, S.G. Ghurye, W. Hoeffding, W.G. Madow and H.B. Mann), pp. 278-92, Stanford Univ. Press, California. ?" Nash,J.C. (1979) Compact NumericaI Methodsfo[ Computers. Adam Hilger, Bristol. Schultz, B. (1983) On Levene's test and other statistics of variation. EL"oIutionary Theory 6, 197-203. . ~. Srivastava, M.S. and Carter, E.M. (1983) An Introduction to Applied Multivariate Statistics. North-Holland, New York. Van Valen, L. (1978) The statistics ofvariation. Evolutionary Theory 4, 33-43. (Erratum Evolutionary Theory 4, 202.) . :3 .. . .'t. Yao, Y. (1965) An approximate degrees offreedom solution to the multivariate Behrens-Fisher problem. Biometrika 52, 139-47. 'dJI}



;I~ ~·



t;-ii



~:I~ ¡:.



.~



i



¡":, " ,



j i.



l' \ '



,



. 1:\le. \ w e(}n ' ,' D\stant~~ m01VlOull



i



!.



i,



;



I



CHA. ? TER FOUR



;



1



,



11 , ,



1 ·1



I ·1



¡ ¡: .:j ::i



Measuring 'a nd testing multivoriate distances



Ij 'l







,¡,;



l' ,r, I '~ · '! 1 , ¡¡ 1' 1



:.j:I ~: 1



,í1 !iI'i:.: ¡,;'i:¡j ~



" \' , '1 ,



:'¡:¡j¡ 'J:¡ ( r.: : . '11



l!~" tl,. r



, !.'



,1;



' :,, '1; 1 ' )1



,'., '¡" ' j'l1 · ," ,1



~:



t, • •



¡i.i :y]



,-¡.,,, I







~~ ~ ;"L 1



d ti : ;:1



¡-! ;i:J; ,:;



I ', .: '1



,1



i l ' I!" , ,,;,



I,, 'l.II, :.:¡:,:·¡'



: 'I! ;.:'¡ ' I!' I r ! 1: : :



I; :! ; ,. "1"



:it ,.:¡.i¡I¡ii



"I:i¡ I¡ ',."", ,1'.",II:: f" l;.:.,¡ ', I'/'':¡:;:1':I, ::1'



~uItiv~riate



~an



'



,'¡I' í':;:; "1



:i) . :.:¡- r Ii



dij =



J {(Xii -



X j ¡)2



+ (X¡2 -



X j2 )2} .



of :~ F\\fithp;"



~~~rdinates



A' large numberof probletns be viewed in terms 3 variables the values can be taken as the in 'distances' between single observations, orbétweertsamples of ,:; ~pac~ for plotting the positions of individuaIs i and j (Fig.4.2). observations, or betwéen population~ ,o( observatioris. For. example~;,(Pythagoras' theorem thengives the distance between the two points considering the data in Table 104 onmandible measurementS of dogs, 'I:~t(L:' :. ,., ~ > ,', . ' . . \. • : .. . . _ . ..J ..... .... , • • . wolves,jackafs, cuonsand dingos, ir is sensible.t.9 a,sk 'llo,)V far one of .;~lMi;':,~;,. ' . th~se groups is, ~ro~ lhe othei si~ g.roups. The i.d ea then is thát j[ two ~;:::;y:;: : x ammals have SImIlar mean mandlbIe. measurements then they are .' 2 , 'close', whereas if they have rather dilTerent mean n:easur~rn.e?t~ the.n ~--:..&.h~: . 'they are 'distant' frOm each other. Throughout thls chapter It IS thls :;)~:;~1~~'~:,; : concept of 'di~tance' that is being used. , ' '";::!, ' Xj2 -~I'¡, .-jj'J A large number of distance measures have been proposed and used :;;:. :, ' in multivariate analyses. Here only sorne of the most common ones - , will be mentioned. Tt is fair to say that measuring distances is a topic wh :,; ~e a cert:lin amount of arbitrariness seems unavoidable. l . possible situation is that there are n objects being considered. ¡ dij with a number of measurements being taken on each of these, and the me~surements are of t\Vo types. For example, in Table 1.3 results are given for four environmental variables and six gene frequencies for 16 colonies of a butterfiv. Two sets of distances can therefore be 1 ',,1--- __ / . calculated between th~ colonies. One set can be environmental 1 --1 (Xi" X¡2) 1 I distances and the other set genetic d istances. An interesting question . " .. '. 1 I is then whether there is a significant relationship between these two c. , • .• I sets of distances. Mantel's (1967) test which is descríbed inSection 4.5 élS',': ,.: I I is useful in this context. I :::~



I



4.2 Distances between individual obsenations



... "



~.



e'



.



I XiI



Xjl



Xl



To begin the discussion on measuríng distances, consider the simples! Fi~re 4.1 The Euclidean distance between individuals i and j , with p = 2 case where there are n individuals. each of which has values for p ,!~~~les.



42



r1~~}:,



!:. ¡



1( '



- l";"



:'



~



::;.' 1; (I! ."



. !.: ..' .



4.1 Multivariate distan'ces



i



, !



43



variables XI' X 2>"' , X p' The valucs for individual ¡"can then be denoted by XiI' X i2 ,·· · ,X¡p and thosefor individualj by Xjl, xj2 , ... , x jp' The problem is' to measure the 'distance' between individual i and individual j. " If there are only p = 2 variables then the values can be plotted as , shown in Fig.4.1. Pythagoras' theorem then says that the length, d¡j, " of the line joining the point for individual j to be the point for individual j (the Euclidean distance) is



~



, '¡:'111 { l l¡',,



(j¡:;WlIC;CS



¡ "~ r



Xj2 'y



':



X2



":j



''-



"~o 'l'. 'i,l



- / - - - - -__ (X.



I



ilri ' I



I



rn



In



dij



;



:



¡:-¡¡ !



'r¡f



.WI!, 'o'



x¡ú-_



tI



Irn



-----~ :



/



( Xi!,



Xi2. Xi3)



I



. : """



I I



- :¡~f



I



"',:f-.S :.



,~:,



I



45



F rom the form 6f.equation (4.1) it is cJea r thal if one of lhl: varia bies mcas ured is muele more·. variable than the others thcn this will dominate the cal~ulation ' of distanccs. For example, to take an extreme case, suppose that 11 men are being compared and that X) is their stature and the other variables are tooth dimensions, with all the measurements being in millimetres. Stature differences will then be in the order of perhaps 20 or 30 millimetres while tooth dimension differences will be in the order of one or two millimetres. Simple calculations of d¡j""will then pro vide distances between individuals that are essentially stature differences only, wi-th tooth differences having negligible efTects. Clearly there will be a scaling problem. In practice it is usually desirable for all variables to have about the same influence on the distance ca:Jculation. This is achieved by a preliminary scaling of the variables to standardize them. This can be done, for example, by dividing each variable by its standard deviation for the n individuals being compared.



" ~"?.



;'"



-Jt~".i.,"



' :O." .~t::" ~



¿.



Example 4.1 Distances between dogs and related species



""!"-:-



~;_



As an example of the use of the Eucliaean distance measure, consider the data in Table 1.4 for mean mandible measurements of seven '''~:~:. groups of dogs and related species. It may be recalled from Chapter 1 that the main question with these data is how the prehistoric dogs relate to the other groups. The first step in caJculating distances is to standardize the measurements. Here this will be done by expressing them as deviations from means in units of standard deviations. For example, the first measurement Xl (breadth) has a mean of 10.486 mm and a



~:~(;.



lli.1[



Figure 4.2 The Euclidean distance between individuals i and j, with p = 3



1rT!"



variables.



p "



,o. ' ~



¡



as



I i



:! i



I



dij =



ni ;n



I



J{Jl



(X¡k -



X jk )



2}.



Table 4.1 Standardized variable values ca1culated from the original data in Table 1.4.



With more than three variables it is not possible to use variable values as the coordina tes for physically plotting points. However, the two- and three-variable cases suggest that the generalized Euclidean distance ..



dij =



JLt



(Xi« - X jk)2 }



may serve as a satisfactory measure for many purposes.



Xl



(4.1) ".~ ~



-0.50 -1.52 1.92 0.65 0.14 -0.56 -0.12



X2 -0.50 -1.93 1.60 0.60 0.33 0.03 -0.13



X3



X4



Xs



-0.74 -1.12 1.84 1.04 -0.04 -0.14 -0.84



-0.74 -1.39 1.950.74 0.00 -0.19 -0.37



-0.49 -0.86 1.68 1.26 -1.19 0.03 -0.43



::



=¡ :



~.~.



~l ·­ ,~



-< !t



.., ~.~



X6 -0.61 -1.31 1.62 0.95 -0.40 0.66 -0.90



U\S\m\(lS t~\'N~~\\ \\~~~\\~\\~~~ ~\\~ ~~t~1~~~ '.




'; '



:;;:;



10 4 3 2



2



3



4



5



Total



8 8 4 3 4



4 5 15 7 4



4 4 2 5 9



2



30 30 30 30 30



3 5 12 11



94



.1 '1



I'! : " '1 1: I :.; :; : 1¡ ! ~ ¡



'1" . i ¡i,¡:,)¡ ¡j ,: "l' '1 , . :" ~:



i'



i ;:.



¡;~ :



!



l '



.



1 ': ,~ , 1



¡"~el'lil!! ¡j;! f·!: ~



:ij ¡



"!':' ! ¡ ¡. i '



.. : !.¡



;¡:



1.· i' ~"1I "ji! 1I. ·':L:



i ~¡



1



I



d



'lí' . 1 ¡



1I,.mIl



,Ir!,



.. ni 1;;1 1 1'., 11"



·1 1.1:'\ .



'1"



1,"1 >'



¡¡f,· 1, ., ¡!jI .l.



I nI~I;¡I. ! ;¡i ' '.11,1 : ~.1i'I1 I ' ;" ' 1



~



i



''f ;: ¡



~I'" , I!",' o ,' . fl!: ·11 ;i ; p



.':1:



¡r,l!'1 ,



(~ "'lCllysj ::;



howcver, vr; ~y much an average change. Jf the 1SO skulls are allocated to the sarr:pJcs to which t:1ey are closest accordi¡;g to the Mahalanobis distanci:: flil1ction of equation (7. J) then only a fairly sma]] proportion are alloc;ted to the samples that they really belong to (Table 7.3). Thus although this discriminant function analysis has been successful in pinpointing the changes in skull dimensions over time, it has not produced a satisfactory method fO,r 'ageing' skulls.



¡



I !!' ,:,, i:I



~I1,,:1 1



fiJncLj(J 11



l'



,



,i ,i



Example 7.3 Discriminating between groups of European coulltries The data shown in Table 1.5 on employment percentages in nine groups in 26 European countries have already been examined by principal component análysis and by ' factor analysis (Examples 5:2 and 0.1). Here they wiIl be considered trom the.point of view of the extent to which it · is possible to discriminate between grotips oCcountries on the basis ofemploymentpatterns. In particular, three natural groups ex.isted iIi 1979 when the data were collected. Ihese were: (1) the Europe3:nEconomicCommunity (EEC) countries. at the time of Belgiun'l; Denmark, France, West Germany, IreJand, Italy, Luxembu~g, the Netherlands and the United-Kingdom; (2) the other western European countries of Austria, Finland, Greece, Norway, Portugal, Spain, Sweden, Switzerland and Turkey; and (3) the eastern European commun~st countries of Bulgaria, Czechoslovakia, East Germany, Hungary. Poland. Romania. USSR and Yugoslavia. These three groups C ~iD be used as a basis fo r a discriminant funclion analysis. The percentages in the nine industry groups add to 100% for each of the 26 countries. This means that any one of the nine percentage variables can be expressed as 100 minus the rcmaining variables. It is therefore necessary to omit one of the variables from the analysis in order to calculate Mahalanobis distances and canonical discriminant functions. The last variable, the percentage employed in transport and communications, was omitted for the analysis that wiIJ now be described. The number of canonical variables is two in this example, this being the minimurii of the number of variables (p = 8) and the number of groups minus one (m - 1 = 2). These canonical variables are



and Z2 = 0.84AGR + 2.46MIN + 0.78MAN + l.J8PS . + 1.17CON + 0.83SER + O.84FIN + l.05SPS, the corresponding eigenvalues of W- I B being )'1 = 7.531 and )'2 = 1.046. The corresponding chi-s·quare values from equation (7.2) . are cfJi = 41.80, with 9 degrees of freedom, and cfJ~ = 13.96, with 7 .' degrees of freedom. The chi-square value for Z 1 is significantly large ···· .at the O. 1/~ IeveL. The chi~square value for Z 2 is not quite significantIy ;' .' large-at the 5% .leve!. . , ;:-,::: From the coefficients in the'equation for Z1 it can be'seen that this ':. ~-: . variable will tendto be large when there are high percentages .,:.:;¡¿>, empk>yed in everything except PS (power supplies). There is a ' ":~~':~i.:partiCularly high coefficient for SER (service industries). For Z2' on - .~.r.: the other hand, a11 the coefficients are positive, with that for MIN . )~t;Jrl1'¡ningJ being particularIy high. . . . : :[r~: ' Á pIot of the countrÍes against their values for Z 1 and Z 2 is shown ., - - i~ Fig. 7~L The eastern European comf!1unist countries appear on . the left-hand side, the non-EEC western European countries in the centre, and the EEC countries on the right-hand side of the figure. It can be cJearly seen how most separation occurs with the horizontal .'- ~,



I



2fL



/



EAST:'~':~R~~E':~ I



I



I



o.:



~ y



Polond



E..~rmony



Romonio



/ \



YUq"'IaY;~1



. i :



Turk.y



\



-3



\ \



-



- -



\



-2



.



Fi~:~~"~ÓrlUQal Nor....,y



\



\ G... c• ...! ___



"-- I



-3 -4



UK



Dtnmork



1 - - - I{W Germony 1 Belqiu m Italy './ 1 .... ____ --,,-----



\



-5



EEC \



I



.... - - - - - - - -



o



-



/



,



\



......



\ \



NON-EEC WESTERN EUROPEA N



SWltZtt'IO~l 2



3



4



Z,



5



J:',~: . Figure 7.1 Plot of 26 European countries against their values for two :'P:;'Z:'" '. canonical discriminant functions.



~~~



,g'



.~. ·~;~o



.......,



\ L. . . -- S_eden)[Irflond F,onceNethtt'londs ......I



\



-2



. .f¡: . .,:.



.,"'"



\



/



, !Spain



'....USSR



o o, o



"'



\



/



I /



1



\



/



\



Bulgario



-1



ILu lI,e mbl,lrQ'



/



/



/HI..I": oC¡zeCt'lOSlovOkiO\



,



"



r--~



,



Zz



o,



"'



. ' ,



3-



ZI =0.73AGR+O.62MIN +O.63MAN -O.16PS



+ O.SOCON + l.24SER + O.72FIN + O.S2SPS,



(~~



(\~~\lm~\,\m\~



'-



ilj'I



i::



J)j ~: c rirnillont



~



_"



),



~



-"



. ~



values of Z J' As far as valucs of Z2 are conccrned, it appears tha! thc n'n-EEC western European countrics tend to have lower valucs [han thc other two groups. Overall, the degree of scparation of the three groups is good. The only 'odd' cases are West Germany, which appears to be more like a non-EEC western European country than an EEC country, and Sweden, which appears to be more like an EEC country than a non-EEC western European country. The discriminant function analysis has been rather successful in this example. It is possible to separa te the three groups of countries on the basis of their employment patterns. Furthermore, the separation using the two canonical discriminant functions is much clearer than the separation shown in Fig. 5.2 (p. 70) fbr the first two principal components.



,r': Ti



li1' ~I j l ¡ . .



¡-.,1,'



1:



¡:.



1111141: ,







II j "



'1 11r 1:



l'lli, ¡ ,



, lo



j'



'1



fls;;igning n[ ungrouped inc1iviuuo!s lo groups



- Un;r:rijj¡jllulll !1I1l r: tirJII onuiysi.'i



11'(; -1 11 : :~ ¡. lO tÍ I :



11 Ij¡I ¡:,



stepwisc analysis is carrieo out then it is advisable to check its validity by-rerunning it several times with a random allocation bf individuals to groups to see how significant are the results obtained. F or example, with the Egyptian skull data the 150 skulls could be allocated completely at random to five groups of 30, the allocation being made a number of times, and a discriminant function analysis run on each random set of data. Sorne idea could then be gained ofthe probability of getting significant results through chance alone. It should be stressed that this type of randomization to verify a discriminant function analysis is unnecessary for a standard nonstepwise analysis providing there is no reason to suspect the assumptions behind the analysis. It could, however, be informative in where the data are clearly not normally distributed within groups or where tbe within-group covariance matrix is not the same for each group.



cases



1 . 1~



j



JI



l'



J



!: :



,1



. 1ji;



11, ,j ll! ° ". 11t~, '.1\:IL



J11 ..



I



·



,t



.: '1\\: ' , H 1:: 1'



. ' nI: l'" : . ' .0



11\•



' '1



.¡ ::



0 ,



, I



AH i: . l'



l'



l



' n;r) - 1-



Ir: '.



· l'



Ir:: :'l"' !r.' 1 1,



7.6 Allowing ror prior probabilities of group membership Computer programs allow many options for varying a discriminant functioo analysis. One situation is that the probability ofmembership is inherently different for differeot groups. For example, if there are two groups it mighi be that it is known that most individuals fall into groupi while very few fall into group 2. In that case if an individual is to be allocated to a group it makes sense to bias the allocation procedure in favour of group 1. Thus the process of allocating an individual to the group to which it has the smallest Mahalanobis distance should be modified. To aIlow for this sorne computer programs permit prior probabilities of group membership to be taken iDto account in the analysis.



II"I! '



7.7 Stepwise discriminant fun'.:tion analysis



1m 1;



Another possible modification ofthe basic analysis involves carrying it out in a stepwise manner. In íhis case variables are added to the discriminant functions one by one until it is found that adding extr..a variables does not-give significantly better discriminatiC?n. There are many different criteria that can be .used for decidin'g 00 which variables to include in the analysis and which to miss out. . A problem with stepwise discri,!TIinant function analysis is the bias that the procedure introduces into significance tests. Given enough variables it is almost certain that sorne combination of them will produce (significant) discriminant functions by chance alone. If a



Dlb ! ; :



lI!T¡i ll, ' : ,..1;, :~ .' .! ~ ,'



::.: :\ .



:



. .lr: ,:



1



¡ "": ; 1· a.r,! : '\



1111":\



':l



I



;¡i r:; I '.,;



' :¡: Ii . .. ~ l '



¡, :g" I



,.., " n-:'~i).~,\\\!







\ \



~ ~\\



7.8 Jackknife classiñcatioD of individuals A moment's reflection will suggest Ihat an aIlocation matrix such as that shownin Table 7.3 must tend to have a Oías in favour of aIlocating individuals too the group that they really come from. After all, the group means are determined from the observations in that group. It is oot surprising to find that an observation is closest to the centre of the group where the observation helped to determine that centre. To overcome this bias, sorne computer programs carry out what is called a )ackknife classification' of observations. This involves allocating each individual to its closest group without using that individual to help determine a group centre. In this way any bias in the aIlocation is avoided. In practice there is often not a great deal of difference between the straightforward classification and the jackknife classification. The jackknife classification usually gives a slightly smaIler number of correct allocations.



7.9 Assigning of ungrouped individuals to groups Some computer programs allow the input of data values for a number of individuals for which the true group is not known. It is then possible to assign these individuals to the group that they are closest to, in the Mahalanobis distan ce sense, on the assumption that they



~t



.!~l ¡¡d! .1



97



'-: ." ;r.



.'. :



~



.q: ~ :~ :



nG \~rQn~~~



l)j;;criminwli f U.1CLio n anolysis



98 \ ~.



'1'-



1-',



have to come from one ofthe ¡;¡ groups that are sampled. Ob vio usly in these cases it will not be known whether the assignment is corree!. However, the errors in the alloeation of individuals from known groups is an indication of how accurate the assignment process is likely to be, For example, the results shown in Table 7.3 indícate that allocating Egyptian skulls to different time periods using skull dimensions is liable to result in many errors,



li::



¡t l



,· ; 1



l~¡ ~ ., '1,



,



1~i !!' i.



'!



,,1::,



,1



1~II'¡;



'il



I



~ '!



'" i 1 ¡ I $ 1' ;,: ~~l !jl ", 1" .: :':" ,~II'I¡ ,:: 1' 4i 1



I



," "Cjl! '! I;! l';1e :iji !:" ,;!"r i'



¡ '



,



~



11 :"



1



,



!II j:,I; ' !j"II ,



1il ll·,1¡



:, :1 , 1 ¡j i



'1"''",\1:



-=-::' __'- ¡._,



ro I



¡ ji1 i



."..



fU:"



. . (W-l B -JJ)a.= O ." . '



-



,1



~ , .. '" I' ! '1



"



t



:·!¡:W::, •. w ' .i



---.,



e



10.1 Constfucting a 'map' from a distance matrix Multidimensional scaling is a technique that is designed to construct a 'map' showing the relationships betwéen a number of objects, given only a tabie of distances between them. The 'map' can be in one. dimension (if the objects fal! on a line), in two dimensions (if the objects lie on a plane), in three dimensions (if the objects can be represented by points in space), or ina higher number of dimensions (in which case an iinmediate geometricaI representaticJn is not possible). ' Tlle fact that it may be possible to construct a map from a"table of distances can be seen by considering the example offour ohjects A, B, e and o shown in Fig. 10.1. Here the distances apart are given by the array:



~;



? A B e



!~ .,



o



o



A



B



e



o



6.0



6.0



2.5



9.5



7.8



o



6.0 6.0



9.5



o



2.5



7.8



3.5



, Figure 10.2 A mirror image of the map shown in Fig. 10.1 ror which the distances between objects are the same. ""'\



"' ~.



For example, the distance from A to B. which is the same as the distance from B to A, is 6.0, The distance of objects to themselves is, of course, O.lt seems plausible that the map can be reconstructed from the array of distances. A moments reflection may indicate, howev~r, that a mirror image of the map as shown in Fig. 10.2 will have tht;. same array of d istances between objects. eonsequently, it seems clear that a recovery of the original m'!p wil! be subject to a possible reversal of this type. It is also apparent ¡hat if more than three objccts are imolved then the)' ma y not lie on a plar-e. In that case theií distance mr,líix will implicitl y conta in Ihis information. For exam p\c, the distancc:



e o :c S~.~



:3 = ? - .-



_



.-



... ~ ..c:-=~



f- c ...,.•



Vl



",



.~



u



Q ... c -... .0 o ._ ~



:QE~



.. :::= !- c:tt



~



Cc



~;;-~~_



Vl_"



.. ~~. :.~.!



Dimension 1



--



-5 -o



4)



"" .... ~,.,



·.•



o



Missing values can cause more problems with multivariate data than with univariate data. The trouble is that when there are many variables being measured on each individual it is quite often the ease that one or two of these variables have missing values. It may then happen that ifindividuals with any missing values are excluded from an analysis this means excluding quite a Iarge proportion of individuals, which may be completely impractical. For example, in studying aneient human populations skelctons are [requentl)' broken and Í:1complete. Texts 00 multivariate analysis are oflen remarkabJy siJent on the question o[ missing values. To sorne extent this is bccause doiog something about missing values i.s by no means a straightforward matter. Seber (1984) gives a discussion of (he problem, with referenees. In practice, computer packages sometimes incJude a facility for estimating missing values. For· example, the BMDP package (Dixon, 1983) allows missing values tp be estimated by several different 'eommon sense' methods. One possible approach is therefore to estima te missing values and then analyse the data, ·including these estimates, as if they were complete data in the first place. Jt seems.reasonabh! to SUPP9sé that this procedure wi1l work satisfactorily providing that only a smalI proportion of values are missing.



:t-;r:'¡I'



. l'" ":



r



~:



~



: ,



;..: '"



. y--



,"'2,- .



:;1:



i



I 1



'~~ '-



J LJ:l



:~::



il



I\l? l? 'E.NIJ



ri



Data Set 1: Prehistoric goblcts from Thailand



Table A.l shows six measuremcnts" on each of 25 pottery goblets 'excavated from prehistoric sites in Thailand. Figure A.l illustrates a typical goblet and shows ho\V the measurements were made. The mainquestions of interest with this set of dal:: concern the - ' similarities and differences between the individual goblcts. Are there _ any obvious groupings of similar goblets? Is it possible to display the data graphically to show how the goblets are related? Are there any gobIets that are particularly unusu~l?



E?!.omple seis of data



/ • :i ;



This Appendix contains three sets of data that can be used for trying out the different methods of analysis that are described in this book. In each case sorne possible approaches for analysing the data are suggested. However, readers are invited ' to develop alternatives to these. . Table A.l Measuremen15; in centimetres, taken on 25 prehistoric" goblets fro.m Thailand. The measurements are defined in Fig. A.1. Coblet



Xl



1 2



13 14 19 17 19 12 12 12 11



3



4 5, 6



7 8



9 I



I



I !



.1 Ji I



"



·1



;0 ji



·.. 2 13 14 15 16 17 18 19 20 21 22 23 24 25



.-"J



X1 21 '14



.," _.J



18 20 20



19 l'



.l ,.



15 13



11 13 1: 13 14 14 15 19 12



::0



17



13 9 8



9 12



21 15 22 21



19 16 21 20



20 20 9 8 9 19



x.J



X~



23 24· 24 16 16 24 22 25 17 14 25



14 . 19 20 16 16 17 16 15 11 11



Data sourct.>:



15



19 26 26 20 15 20 26 27 27 10 7 8 27



12



17 15 17 15 16 16 18 17 7 5 4 18



Professor C.F.W. Higham • Universüy of Olago.



146 ~;¡:.



18



23



I



o'Ci-on