Don - Subsamples

download Don - Subsamples

of 16

Transcript of Don - Subsamples

  • 8/2/2019 Don - Subsamples

    1/16

    Using Subsample Values as Typical Values

    Author(s): J. A. HartiganReviewed work(s):Source: Journal of the American Statistical Association, Vol. 64, No. 328 (Dec., 1969), pp. 1303-1317Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/2286069 .

    Accessed: 11/04/2012 12:00

    Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

    JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

    of scholarship. For more information about JSTOR, please contact [email protected].

    American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access toJournal

    of the American Statistical Association.

    http://www.jstor.org

    http://www.jstor.org/action/showPublisher?publisherCode=astatahttp://www.jstor.org/stable/2286069?origin=JSTOR-pdfhttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/stable/2286069?origin=JSTOR-pdfhttp://www.jstor.org/action/showPublisher?publisherCode=astata
  • 8/2/2019 Don - Subsamples

    2/16

    USING SUBSAMPLE VALUES AS TYPICAL VALUES*J. A. HARTIGANPrincetonniversity

    The subsamplevalues of a statistic are the values of tfor ubsetsofthe whole sample. Subsample values may be used as indicatorsofvariability ft. For real valued statistics , estimating parameter ,thesubsamplevalues are defined o form set of typicalvalues if theintervalsbetweenthe ordered ubsample values each include0 withequal probability.The typical value propertyholds exactly n somecasesand approximatelynothers.

    1. INTRODUCTIONATATISTIC t is computed from sample (Y1, Y2, , Y,n) s an estimateof some parameter 0. A subsample is any subset of the sample(Y1, Y2, * *, Yn); the subsample values of t may sometimesbe used to giveconfidence tatements, r Bayesian probability tatementsfor 0. Subsamplemethods re potentially erywidely applicable, since they require only that tbe defined or ubsamplesofthedata, and that the data be regarded s a sam-ple from population; subsamplemethodshave been used informallyor longtime, to give a rough dea ofvariabilityof the statistict, and to test homo-geneityof the sample--Mahalanobis [5] used subsamples for assessing vari-ability under the name "interpenetratingamples." Also in sample surveywork, McCarthy [6] has investigatedthe propertiesof half-samples.TheJackknife, ukey [3], is a subsamplemethod for estimating ariance, nd re-ducingbias, ofa real valued statistic .This paper introduces he use of subsample values as typicalvaluesfor aparameter . A set of random variablesXi, X2, * * * XN is said to be a set oftypical values for0, if the intervals betweenthe ordered randomvariables(- 0, X(1)), (X(l), X(2)), ... , (X(N), oo) each nclude with qualprobability.For example, suppose that the sample {Y1,.Y2, . I, Yn is suchthat the Yiare independent, continuous,and symmetrically istributedabout t; letZ1, Z2, . , ZN, N = 2 - 1, denote the set of subsample means. Then Z1,Z2, * * *, ZN formtypicalvalues forA. (This result s the confidence ersionof Fisher's ([2], p. 46) non-parametricest for the difference etween twomeans. ProfessorJ. W. Tukey, of PrincetonUniversity, uggestedthe rela-tionship o theauthor.)The typicalvalue property oldsexactlyfor ubsamplevalues of statistics ther hanthemean; for xample,certainmaximum ikeli-hood estimators fA in the above case. Since computationof t for all 28 1subsamples may be very expensive, t is necessary o look for mallersets oftypicalvalues. Onewayto do this s to selecta setrandomly rom he set ofallsubsamples;anotherway is to choose one of a numberof "balanced" sets ofsubsamplesdetermined y group theory.For an arbitrary eal valued statistic ,the interpretationfsubsampleval-ues as a setoftypicalvalues has approximate alidity n a widerangeofcases:someasymptotic rguments re givento support hisstatement.An empirical

    * Research upported y theOffice fNaval ResearchunderContractNonr-1858(05).1303

  • 8/2/2019 Don - Subsamples

    3/16

    1304 AMERICAN STATISTICAL ASSOCIATION JOURNAL, DECEMBER 1969investigationf eigenvectornalysis howsthat the confidencetatementsbasedon subsample alues renotas accurate s the onesbasedon the tan-dardnormal symptotic ethods. evertheless,he simplicityf subsamplemethods ecommendshem s a generaloolfor ssessing ariability.

    2. DEFINITIONS AND BASIC RESULTSLet X(1),X(2),* , X(N)denote heordered aluesof a set ofrandomvariablesX1, - * * XN.A typical etor setof typicalvalues fora constant0 isa set ofrandom variablesX1, * *- XN such that the (N+ 1) intervals - ooXX(1)), X(I), X(2)), . .. , (Xv_, X(N)), X(Nv),oo)each nclude0 withprobability1/(N+1). Given a sample Y1, Y2, * *, Y. and a statistict(Yi, * * *, Yn)estimating,we wish to specify typical et for0. To obtainthe randomvariables X1, - * * XN in the typical set, we recomputethe statistict forselected ubsamples-a subsamples anysubsetof Y1, Y2, * ** Y.. The set of28 subsamplesorms groupGunder heproduct,ymmetricifference,.e.,AoB=AkJB-AAnB.The unitofthe group s thenullset,4. A set of sub-samplesG* is balanced f (q, G*) is a subgroupofG.

    Theorem: Let {Yi be a setof independent,ontinuousandom ariablessymmetricbout . LetG*be an arbitraryetof ubsamples.efineXA-= Yi, A G*.Y,EA

    ThenXA,A&G*) is a typicalet or , for ll sets frandomariablesYiifandonly fG* s balanced.Proof: o prove ufficiency,etGo (4,G*),AoGGo. etXA 0. Firstweshowthat he et XA - XA,,AEGo) hasthe amedistributions a permutationfthe et XA,ACGo). LetZ,= Y, ifYiEjAo,Zi=- Y ifYiEA0. Let

    UA= E Zi.Z,EASince Yi and - Yi have the same distribution,UA,AEGo) and (XA,A CGo)havethesame distribution.lsoXA-XAO = UAOAQ.SinceGo s a subgroupfG,the et UAOAO,,AGGo) s a permutationftheset UA, ACGO).Thereforetheset (XA-XA,,, ACGo) has the same distributions a permutationf(XA, AC Go). In particular,heprobabilityhatXA

  • 8/2/2019 Don - Subsamples

    4/16

    USING SUBSAMPLE VALUES AS TYPICAL VALUES 1305particular et of values yi c,Y2+C, * * *, yn c. Suppose that the kth ntervalbetween the means { A} includes zero; since Yi lies in any interval (yi ,Yi+E) withpositiveprobability,hekth nterval ncludes0 withpositiveprob-ability. Also, by choosing c suitably, for 1< p< N+ 1, the pth interval of{XA + c} includes0 with positive probability; hus p of the XA

  • 8/2/2019 Don - Subsamples

    5/16

    1306 AMERICAN STATISTICAL ASSOCIATION JOURNAL, DECEMBER 1969To prove this, let XA = EYiGA Yi, nA -= Yi,A 1. From the theorem,IXA-nA,g, zG*} is a typicalset for0. The probability hat exactlyk ofXA-nfAg} are positive s independentof k. Therefore he probability hatexactly of { A - I, } arepositives ndependentfk, nd {XA,AE G* formsa typicalset for u s required.It is interestingo note that if u 0, thisresultmakesthe same propositionas Theorem1, EXCEPT that the XA form he typical set insteadof the XA.In the particular ase { , G*}= G,the result tatesthat theset of all subsamplemeans, excluding he null subsample, s a typicalset foru-i.e., the gaps be-tweenthe ordered ubsamplemeansprovideconfidencentervals or u fequalconfidence ize. There is a close relationship etweenthisresultand thenon-parametric est fordifference etweentwo means of Fisher ([2], p. 46). Inthistest,theobservationsY1, Y2,* * * Y,nre supposed symmetricbout 0.The sum E Yi is compared withthe (2n 1) sums + ? Yi; the probabilitythat the sum 2 Yi is greater han exactlyk ofthe 2n-1 sums is 1/2n. Thehypothesis f zero mean s rejected, f jE YiI s excessively arge,as measuredon thisprobabilitycale. Correspondingly,et Y1, Y2, - - -, Yn e symmetricallydistributed bout u; consider hequantitiesE ? (Yi - ) The probability hatE(Yi-0) is greater han exactlyko f E ? (Yi-g) is 1/2n. Now ,(Yi- A)> ZYiCA (Yi -)- EYi1fA (Yi-A) is equivalent to g?XA= EYiEAYi/ YEA 1. Therefore (Yi-/) is greater than exactly k of Z?(Y -/4)ifand only f is less than exactlyk ofXA.Therefore he set of all subsamplemeansforms typicalsetforA,as already mpliedby the theorem.The prob-ability tatements vailable forAare similar o the tolerance tatements ornew observation Y, given observations Y1, Y2, * * - , Ynfrom the same distribu-tion-as ifthe subsamplemeans and , are regarded s independent bserva-tions from he same population.We now considera moregeneralversionof the above result. Let Z be anarbitraryandomvariablewithdistribution epending n a realvalued param-eter0.We will call e(z, 0) an estimatingunctionfe(z, 0) is continuous nd de-creasing s a function f 0, f a solution o theequatione(z, 0) =0 existsfor achz, and ifE { e(Z, 0) } = 0. We may then estimate0 from 1,Z2, * * Zn by finid-ingthe unique solutionto n

    e(Zi, 0) = 0.Result2: Let e(Z, 0) be an estimatingunctionuchthat (Z, 0) is continuouslyand symmetricallyistributedbout0. Let OA ddnote heestimate f0 based onthe ubsampleA. If G* is a balanced etof ubsamples, A,AE G*} is a typicalsetfor0.This result s proved n thesameway as Result 1. Result 1 is a special caseofResult2 with (Z, 0) = (Z-0). The mostcommonuse ofestimating unctionsis perhaps n maximum ikelihood-e(z, 0) = d/dologf(z,) wheref is theprob-abilitydensity f z given0. The particular onditions fResult 2 are satisfiedif0 is a locationparameter, nd Z is continuously ymmetricallyistributedabout 0,and thedensity z -0) is such thatf"

  • 8/2/2019 Don - Subsamples

    6/16

    USING SUBSAMPLE VALUES AS TYPICAL VALUES 1307We may droptheconditions n e(Z, 0) as a function f0; theestimateforparticular ubsample may then not exist,or be unique. We still have that re-gionsOk0 I xactly of >LZiEA (Zi, 0) are positive each have confidencesize (1/N) where N-1) is the number fsubsamples nG*.There are many sets ofsubsamples satisfying he subgroupproperty,nd itis now appropriate o consider methodof narrowing own therangeofsub-samplesfromwhichwe shouldselect.A halfsampleis a subsamplecontaininghalf the total numberofobservations.

    Theorem : Let {Yi} be a setofn independent ormalvariables.Let A* be abalanced etof half amples, nd letX1, ., X bethe orrespondingalf amplemeans. Conditionally n X, X1, X,Xx are distributeds independentormalvariableswithmean X and variance qual to that fX.Proof:X1-X = Ej ? Xi/nhas mean zero,varianceo2/n,and is independentof X. Because the set A* is balanced,any twomeansX, and X2have exactly(n/4) variables in common (so that the product of the corresponding alfsamples will be a halfsample). Thus X1-X and X2-X areuncorrelated. hisconcludes he proof.This theoremmay be used to obtain the average lengthofconfidencenter-vals based on half samples,whichare computedfrom he average value ofnormalorder statistics.A Bayesian versionof thistheorem xists: if ,uhas auniform rior distribution,henconditionally n X, the randomvariables I,X1 . , XN are independent identical normal variables. This theorem sug-gests a definitionof typical values-that ZA,Z2, * * * ZN are typical values of0 if Z1,Z2, *- *, Z., 0 form randomsamplefrom ome distribution iventhefullsample statistic.The advantage ofthis definitions that it is meaningfulforcompletely eneralparameters0; the disadvantage s that the interpreta-tion is exactly valid onlyfora rather pecificprobabilitymodel,normal ob-servationswitha uniform riordistributionfA.

    4. APPROXIMATE EXTENSIONS OF THEOREMSGiven an arbitrary tatistict based on observationsY1, Y2, ... * Y., wemay recomputet forsubsamplesfrom he set ofobservations, nd use theserecomputedvalues as indicationsof error n the statistict. If the set ofsub-samples s a groupunder theproduct, ymmetric ifference,and withthenullsubsampleadded as an identity),we have givena precise probabilisticnter-pretation to the set of recomputedvalues, under certain conditionson thestatistic t and on the observations Y1,* * , Y,,. Under these conditions, theintervalsbetween the orderedrecomputedvalues each includethe estimatedparameterwithequal probability.The subsampling echniquerequiresno specific ssumptions n thestatistict or on the observationsY1, * * -, Y., exceptthat tbe recomputablefor omesubsamplesofobservations. t is a method that maybe used generally, venthough herecomputed alues have a preciseprobabilisticnterpretation nlyunder pecial circumstances.In order o studythebehaviourofthemethod,we consider nlyrealvaluedstatisticst, forvarious types of observations. We may considereitherthe

  • 8/2/2019 Don - Subsamples

    7/16

    1308 AMERICAN STATISTICAL ASSOCIATION JOURNAL, DECEMBER 1969validityfthemethod, y computingheprobabilitieshatthegaps betweenthe ordered ecomputedaluescontain he truevalue; or the efficiencyfthe method y comparinghe error stimates ased on this method gainstsomeother.We firsthow hat he ymmetryequirementfTheorem maybedroppedif henumber fobservationsneach ubsamples large nough.n thefollow-ing theorem, e suppose hat theobservationsre identicallyistributednorder o makethe asymptotic ehavior impler.A similar esultholds forarbitraryandom ariables.

    Theorem : Let {Yi, i= 1, 2, , } be a sequence f ndependentdenticallydistributedandomvariableswith eromean and finite ariance.Let G* be a setof N subsamples uch that 1c,G*} forms subgroup, nd such that very ub-sampleA in G* contains limiting roportionASO of the bservations.etXA= ?YionYiEALet pn denote heprobablity hat is contained n thekth nterval ftheorderedsubsample ums.Thenp'->1/(N+1) as n-*+oo.

    Proof: ince achA contains limitingroportionA of heobservations,heset {XA/ n, AEG* } is asymptoticallyistributeds a jointnormal ariable{VA, AEG* } where (VA)=O, E [V= PA2, E [VA1VA2 (E(VA+E(V2)-E [VA1A2])/2. The theoremwill followfwe can show thattheset [VA,AEG*] is a typical et. Let g be a subgroup fsize (N+1)/2 and set Zg=2(ZEA$O VA - >,VA)/(N+ 1). For any two differentubgroupsg and g', Zgand Zg, are uncorrelated.Also we may write VA-= EAg Z,. Since the Zgare ndependentariables ymmetricbout0,weapplyTheorem to obtainthatthe {VA,AEG* } form typical et.This concludes heproof.In order o examinehe symptoticniformityf he nclusion robabilities,considerhecaseofobservationsXiI from normal istributionithmeanand unknown ariance2; n 2t=- EX,n i-1is an unbiased stimate f -2.We usemsubsampleselected s describednSection .The inclusionrobabilities,orn = 2, 4, 8, 16,32,m= 1, 3, 7, 15, 31,are estimatedmpirically,nd given n Table 1. (The number frepeatedsamples sed s 4,000, o the standard eviation f an entry is pq/20.)FromTheorem , the inclusion robabilitieshouldconverge o uniformvaluesformfixed,s napproachesnfinity.heconvergencesnotdramaticnthis xample. he inclusionrobabilitiesneither ail are furthestrom heirsupposed symptoticalue.Uniformitys somewhatmore uickly pproachedforarge ubsampleshan mall, aspeciallywayfrom hetails.Thissuggeststhatwe maymakemore ccurateprobabilitytatementsy taking largenumberf ubsamples,ndcombininghe ntervals,hanby using few ub-samples.

  • 8/2/2019 Don - Subsamples

    8/16

    USING SUBSAMPLE VALUES AS TYPICAL VALUES 1309TABLE 1. SUBSAMPLE INCLUSION PROBABILITIESFOR A NORMAL VARIANCE(Probability hat 72 lies in thekth nterval f theordered ubsamplevariances)

    Number 1 Subsample Number 3 SubsamplesObs. lc=1 2 Obs. bk=1 2 3 42 .32 .68 2 .10 .27 .16 .464 .36 .64 4 .11 .33 .10 .458 .40 .60 8 .14 .33 .14 .4016 .43 .57 16 .17 .31 .17 .3532 .44 .56 32 .19 .29 .19 .33

    7 SubsamplesNumberObs. ._ _h=1 2 3 4 5 6 7 8

    4 .04 .13 .06 .19 .13 .04 .09 .328 .05 .13 .10 .22 .06 .08 .06 .2816 .06 .13 .11 .18 .10 .09 .09 .2432 j .08 .13 .12 .16 .11 .10 .10 .2115 SubsamplesNumberObs. kI=1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    8 .01 .03 .04 .08 .05 .06 .04 .10 .09 .04 .05 .04 .06 .03 .06 .2216 .02 .04 .05 .07 .06 .07 .06 .10 .06 .05 .04 .05 .04 .06 .05 .1732 .02 .05 .06 .07 .06 .06 .06 .07 .07 .06 .05 .05 .05 .06 .06 .14

    31 SubsamplesNumberObs. kI=1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1616 .01 .01 .01 .03 .03 .03 .04 .03 .03 .03 .03 .03 .03 .04 .04 .04.04 .03 .03 .02 .02 .02 .03 .03 .03 .02 .02 .02 .03 .03 .04 .1432 .01 .02 .02 .03 .03 .03 .04 .03 .04 .04 .03 .04 .03 .03 .04 .03.03 .03 .03 .03 .03 .03 .02 .02 .03 .03 .03 .02 .03 .03 .04 .09

    We next onsidern extensionfTheorem,which ives heconditions,nthenormal ase,that onfidencentervalsasedon subsamples avethe ameasymptoticehaviours confidencentervalsasedontheknown istributionofthe amplemean.Theorem: Let {Y;} be ndependentormal ariables, ithmean andvari-

    ance . LetG*be balancedet f ubsamplesnnobservations;or achAEGS*,let A,n denote he roportionfobservationsn A, let e,a denotehe roportionofsubsamplesn Gnwith pA,n-0.51 >e, and suppose hat e,n-AJ as n--*oo.Let it T+Za/Vn denote confidencenterval or uofprobability; thenfthenumberfsubsamplesn G* approachesnfinity,heproportionfsubsampleswhoseubsample eans re essthanY+Za/Vfl approaches.Proof:The random ariableXA-Y has mean0, variance

  • 8/2/2019 Don - Subsamples

    9/16

    1310 AMERICAN STATISTICAL ASSOCIATION JOURNAL, DECEMBER 1969

    n XPA,n* -( ~iand covariance,withYB-7 Of

    _1 PAOB,,n -_ . + - - 1 .n

  • 8/2/2019 Don - Subsamples

    10/16

    USING SUBSAMPLE VALUES AS TYPICAL VALUES 1311effectsthe 2n factorcombinations) s partitioned nto subsets of 2- effectswhichare confounded,nd these subsetsare cosetsofthe alias subgroup,on-sistingofsets of factorswhichhave an even numberofelements n commonwith all the treatments. or example,thetreatmentsX, AB, AC, BC, have analias subgroupX,ABC. There are a numberofways in which the techniquesof fractional actorialdesignare relevantto subsampling.We may thinkofanalysingourresponses n the factorial cheme,getting maineffect,ffectsdue to each observation, ffects ue to each pair, etc. We may choose thebalanced sets so that the alias subgroupcontainsonly highordereffects.Amore concrete nd immediatebenefits theuse ofYates' algorithm o shortenthecomputation fsubsamplesums.Yates' algorithm perateson a set of2Pnumbers i, Z2, z, as follows.First step: replace zi,Z2by Z1+Z2, l-Z2; replace Z3, 4by Z3+Z4,Z3 Z4, and soon. Second step: replace zi,Z3by Z1+Z3,Zl-Z3, replaceZ2,Z4by Z2+Z4,Z2 Z4,replacez5,Z6 by z6+z6, z6-z6 and so on. Continuethrough p-1) steps.Now suppose we wish to generate he subsamplesums {XA, AEEG}whereG is a group. f G is generatedbyA1,A2, . ., AP, thesums2Xo, 2XA1,XA,,2XA,0A1, 2XA3, 2XA30A2, 2XA,@A20A1, . . . may be obtained by using Yates'algorithmon the sums XUA1, -XA1A2A3-. AP,- XA1A23... AP, XAIA2;13 .. .4-XA1AI2As ..ap -XA1IA2A,3 .pX, -X1A2A3 .. ;IP, -XA1A2A3,. .SAP, . The totalnumberofadditions s p2P.A directcomputation equiresK(2P-1) where Kis the average subsample ize. SinceK isusuallyabout (n/2),theproportionatereduction s approximately 2p)/n.An exampleoftheuse ofYates' algorithmis givenin Table 2 for 8 yearsofDecember snowfalls n CentralPark, NewYork.

    6. GENERATION OF BALANCED SETSA balanced set is generatedby S1, S2,* , Sp ifeverymemberofthe setis a productofsome subsetofS1, S2, * **, Sp.Since many possiblebalancedsets are available, it is desirableto have a convenient cheme forgeneratingbalanced sets.A numberof criteria re available to decideamongthe variousbalanced sets. Perhaps thesimplest s therequirementhatthe expected am-ple varianceofthesubsamplemeansbe as smallas possible. LettingXA be themean ofthesubsampleA we wishtominimize

    VG= E{ (A - XA)2/2N(N-1)}A,A'EG

    whereN is the number fsubsamples n G. This requirement avoursbalancedsetswhose meansare clustered ogether,o that theconfidencentervalsbasedon thetypicalvalueswilltendto be short.For n observations, he varianceofthe samplemean is -2/n. Let us definetherelativevarianceofthebalanced subsample by

    rG = nfVG/72 [ E nAO I/nAn -X[n12N(N-1)].A,AIEeIn order o keeprG;smallwe needG to contain s largesubsamplesas possible.

  • 8/2/2019 Don - Subsamples

    11/16

    1312 AMERICAN STATISTICAL ASSOCIATION JOURNAL, DECEMBER 1969TABLE 2. SUBSAMPLE SUMS USING YATES ALGORITHM.DECEMBER SNOWFALL IN INCHES, CENTRAL PARK, N. Y.

    Year 1957 1958 1959 1960 1961 1962 1963 1964Fall 8.7 3.8 15.8 18.8 7.7 4.5 11.3 3.1

    Generators (1957 1958 1959 1960), (1957 1959 1961 1963), (1957 1958 1961 1962)(1957 1958 1959 1960 1961 1962 1963 1964)Start Step 1 Step 2 Step 3 Step 4 Divisor Subsample

    UA 73.7 73.7 73.7 73.7 0 8 4)A1X2X3X4 0 73.7 73.7 73.7 103.3 8 AtXiA23Ao4 0 0 73.7 73.7 87.0 9 A2A1A2XAA4 0 0 73.7 73.7 84.1 8 A2oAL11j42A31A_4 0 0 0 73.7 49.4 8 A3AlA2A3;j4 0 0 0 73.7 104.5 8 A3oAl11A2A3A4 0 0 0 73.7 70.8 8 A30A2AiA2A3d4 0 0 0 -73.7 85.7 8 A3oA2oA1AXiA2A3A4 -3.1 -21.9 -49.0 -73.7 147.4 16 A4Al243A4 -18.8 +15.7 +30.2 29.6 44.1 8 A4oAiXiA2saAi -11.3 -27.1 +5.2 13.3 60.4 8 A4oA2A1A2X3A4 -15.8 +4.5 +11.2 10,.4 63.3 8 A4oA2oAiUA12AaA4 -4.5 -8.3 -24.7 -24.3 98.0 8 A4oA3Al7A2AsA4 -3.8 -0.7 -0.6 30.8 42.8 8 A4oA3oAiAliA2A3A4 -7.7 -16.4 +8.1 -2.9 76.6 8 A4oA3oA2A1A2AIA4 -8.7 +1.0 -0.8 12.0 61.7 8 A4oA3oAioAlMeans XX IX XXX X XX XXX X XIX5 6 7 8 9 10 11 12 13

    It turnsout that the sets ofsubsamples which are optimal are all nearlyhalfsamples.Larger ubsamplesdonotappearbecausetheir roducts re smallsub-sampleswhich nflate G.The generators iven in Table 3 have not been shownto give optimal setsofsubsamples,but an empirical nvestigation f theirgenerated ets,for ownumbersof observations nd numbersof subsamples, has shown themto besatisfactoryhere.The qth generator onsistsof the observationsn theblocks[K2q+q, K2q+2q-1+q- 1],K=O, 1,2, * *- In generating ubsamplesfrompgenerators,we inserta dummyobservationat K2P, K = 0, 1, 2 * . As anexample,we generate7 subsamples of 15 observations.Three generators rerequired,p = 3. The observations, rranged n randomorder,are denotedbyA,B, C,D, E, F, G,H, I, J,K, L, M,N, 0, P, (inposition8, a dummyobser-vation denoted by H is inserted). The generators are (ACEGIKMO)(BCFGJKNO), CDEFLMNO); the generatedsubsampleswill each containeightobservations.7. VALIDITY OF SUBSAMPLE METHODS IN EIGENVECTOR ANALYSISIn order o test the validity ofsubsampling echniques n eigenvector rob-lems,a Monte Carlo experimentwas performed. he data foreach run con-sisted of 20 observations rom threedimensionalnormal Xl, X2, X3) withvar Xi = 4, var X2 = 1, var X3 = 0, all covariances zero. The computation ut-puts the eigenvalues and eigenvectors f the sample covariancematrix. n asubsample error nalysis,we recompute he eigenvaluesand eigenvectors or

  • 8/2/2019 Don - Subsamples

    12/16

    USING SUBSAMPLE VALUES AS TYPICAL VALUES 1313+++++lI+ I ++I

    H ~+ I I I+ I

    I +I++I++1111+

    $~~ . +++S~~ P??~ o +++:

    0H +1I I I+

    n t I+ I +IIt?~~ ,#$ + ++:+o? +++:z 11+1 +1

    - ~~+11+1

    X ++ III++? +I+++I+

    0 I++I

    +C+)+I~~40~0~~~~

  • 8/2/2019 Don - Subsamples

    13/16

    1314 AMERICAN STATISTICAL ASSOCIATION JOURNAL, DECEMBER 1969TABLE 4. EIGENVECTORS OF 20 3-DIMENSIONAL OBSERVATIONS, WITHTHREE SUBSAMPLE EIGENVECTOR ANALYSES

    SubsampleValuesMain Values BC AC AB

    Eigenvalues 2.87 1.36 0.00 2.55 0.86 0.00 3.33 1.70 0.00 2.56 1.31 0.00Eigenvectors .98 -.23 0.00 .93 -.37 .00 .98 -.18 .00 .98 -.22 .00-.23 -.98 0.00 -.37 -.93 .00 -.18 -.98 .00 -.22 -.98 .00.00 .00 1.00 .00 .00 1.00 .00 .00 1.00 .00 .00 1.00

    each ofthree ubsamplesfrom he20 observations-the selectionrules,follow-ing Section6,omitevery hird bservation.Anexampleofthe outputof a singlecomputation,withthe subsampleoutputs s given nTable 4.We are proposing o interprethethree subsample values as typical valuesofthe true values. This means, forexample,that the largesteigenvalue,X=4,shouldbe contained n the intervalsbetween the ordered ubsample (largest)eigenvalueswithequal probability. or 139 trials, ounts of the frequencywithwhichthe true values lay in the subsample ntervals re giveil n Table 5. Theentries ssociated with the variable with zero variance are omitted. We firstnotethat thefirst lement ftheeigenvector orrespondingo the argest igen-value is always less than its true value 1; this s inevitablesincecomputedeigenvector lements re always less than or equal to one, and experimentalerrormakesthem ess thanone. The inclusionprobabilities or he second ele-mentofthiseigenvector re satisfactory, eflectinghefactthat the observedentrieswill be symmetricallyistributed roundthe true value zero.The more nformativeountsare the inclusionprobabilities ssociatedwiththe eigenvalues.For both first nd second eigenvalues he true value is in thehighest nterval a significantlyarger number of times. This meanls hat thetypicalvaluestendto be too low. For theeigenvalues omputedfrom he wholesample, the same thinghappens-77/139 of the largest eigenvaluesare lessthan 4, and 94/139 ofthe small eigenvaluesare less than one. This suggeststhat the departureof the inclusionprobabilities romuniformitymay be dueto the median bias of the eigenvaluescomputedfrom he wholesample. Letus then ook at the inclusionprobabilities f the mediansof the wholesamplecomputations, ivenin Table 6. The counts are much more nearly uniform.The eigenvalue typicalvalues are slightly ow,but much essnoticeably hanbefore.TABLE 5. FREQUENCIES OF INCLUSION OF TRUE VALUESIN SUBSAMPLE INTERVALS

    1st nterval 2nd Interval 3rd nterval 4th IntervalEigenvalues 24 14 * 34 24 * 27 27 * 54 74 *Eigenvectors 0 39 * 0 39 * 0 34 * 139 27 *27 0 * 34 0 * 39 0 * 39 139 ** * * * * * * * * * * *

  • 8/2/2019 Don - Subsamples

    14/16

    USING SUBSAMPLE VALUES AS TYPICAL VALUES 1315TABLE 6. INCLUSION COUNTS OF MEDIANS OF WHOLESAMPLE COMPUTATIONS

    Whole Sample MediansEigenvalues 3.717 .840 0.000Eigenvectors .995 .0 .0.0 .995 .0.0 .0 1.0

    lst Interval 2nd Interval 3rd Interval 4th ntervalEigenvalues 34 27 * 35 36 * 25 27 * 43 47 *Eigenvectors 18 39 40 39 * 47 34 * 32 27 *27 18 * 34 40 * 39 47 39 32 ** * * * * * * * * * * *

    For comparisonurposes, e look at the confidencentervals btained yasymptoticormal heory.We haveapproximatelyhat i-'-',N(Xi,X\/n) rthat ogXi/Xif./N(O,/n). Sincewe have removedmedian ias in testing hesubsamplemethod, e will ssumehere lsothatXi sthemedian fXiratherthanthetrue value. Thus Xi 3.72,X2 .84. A setoftypicalvalues forXibasedon normal onfidenceheory s X2/1.63,, 1.63Xi;the intervals orm 5%confidencentervals orXi.The inclusion ounts regiven n Table 7. It willbe seenthatthe counts onotdepart rom niformityignificantly.o intheeigenvalue ase, the asymptotic ormal heory,las, seemsto workbetterthanthesubsampling ethods. here s a weak argumentor xpectinghatthesubsamplemethodshouldwork boutas well s normalheory-asymp-toticnormalityrisesfrom he amesummingperations hich re used tovalidate ubsampling.

    8. CONCLUSIONS, DIFFICULTIES, AND FURTHER DEVELOPMENTSBalanced etsof ubsamplesmaybe usedto formxact onfidencentervalsforrealparametersn a widerange fcases. For anyreal-valuedtatistic ,the recomputedaluesoftfor hesubsamplesf a balanced etprovide p-proximateonfidencentervals or he "true alue" oft.Fora statisticwithgeneral ange pace,theconfidencentervalnterpretationfthesubsamplevalues snot vailable;what sneededs a wayofmaking recisehe deathatsubsample alues re the ort f rue aluesoftto beexpected; or xample,TABLE 7. VALIDITY OF CONFIDENCE INTERVALS BASED ONASYMPTOTIC NORMALITY

    Eigenvalue lst Interval 2nd Interval 3rd Interval 4th IntervalX=4 35 35 37 29X=1 36 30 36 34

  • 8/2/2019 Don - Subsamples

    15/16

    1316 AMERICAN STATISTICAL ASSOCIATION JOURNAL, DECEMBER 1969Bayesian interpretation, hich s applicable generally, s that the set of sub-sample values and the true value of t, form random ample from ome distri-bution,giventhe wholesample value oft.The term"true value" oftrequiressomeexplanation. t maybe defined s thevalue oftfor n infinite mount ofdata, lim._OO( Y1, * * *, Yn); he definitionequires heassumption f topologyon therange pace oft, nd itrequires he existence fthe imitwithprobabilityone. For real valued statisticst,these assumptionsare reasonable and veri-fiable;forgeneralrange spaces, the assumptionswillhave to be specified, e-foremeaningcan be given to the "true value" of t. For real valued statistics,the "true value" mayalso be defined s the medianoftrather han thelimit-ing value of t; this improvedthe accuracy of the confidencenterpretationfsubsample values, in the eigenvector nalysis exampleofSection7.A similardifficultyrises from he definition f t on subsamples; there s noguaranteed relation between the definitions f t for samples ofvarious sizes,yet the values of t forvarious subsamples are used to tell about the true valueof t (correspondingo an infiniteample). A simple consistency equirement,suggested ya referee,s thattbe a function fthe empiricaldistribution unc-tiononly. t would be useful o know thelimiting ehaviourofstatistics atis-fying his condition.The theorem efining alancedsets of ubsamplesusesobservationsXi,Xnwhichare requiredto be independent nd symmetrical bout 0, but notnecessarily denticallydistributed.The theoremessentiallydepends on thegroup of 2n transformationshich eaves the distribution f Y1, * * *, Yn n-variant,transformationsf the formY1, * * *, Yn-?+ Yi, + Y2, . . + Yn.(The independence ssumption s unnecessarys longas these transformationsare invariant.)A large groupof nvariant ransformationss obtained fwe re-quirethe Yi to be independent, ymmetricnd identical.Then we have 2nn!transformationsf the formY1, * * , Yn-?+ Yi1, ? Yi2, * *, ? Yinwhere(i1, * * *, in) is a permutation f (1, 2, * * *, n). As a result here re more etsof balanced subsamplesunder these conditions.For examplewithn=3, (12),(13) is balanced under these extra conditions.Group theorymay be used tocharacterize hiswiderrangeof balanced sets. Finally,we make the mostde-tailed assumption, hat Y1, * * *, Yn re independentunit normal variables.Invariant transformationsre of the form Y1, * * , Y,,) >(Y1, * * , Yn)TwhereT is an orthogonalmatrix.The sets ofsubsamplesbalanced underthisrequirementmay be determined y group theory.Balanced setsofsubsamples llow us to construct etsoftypicalvalueswith-out havingto computeall subsample values, and withoutverymuch ncreasein average lengthofthe confidencentervals btained.Dr. Ray MickeyoftheBiomathematicsDepartment, UCLA, suggeststhat the same effectmay beachievedby takingrandom ets ofsubsamples. ndeed, ifthesampling s donewithout eplacement,nthe mean case with ymmetricrror istribution, setofrandomsubsamplemeans forms set of typicalvalues for u;this followsfrom he factthata random ample without eplacement) rom setoftypicalvalues for u, s itself set oftypicalvalues forA.The same random selectionprocedure, pplied to any balanced set, preserves he typicalvalue property;this meansthat ifthesubsample computations re done in randomorder, he

  • 8/2/2019 Don - Subsamples

    16/16

    USING SUBSAMPLE VALUES AS TYPICAL VALUES 1317typical alue property olds t every tage, ndwe can choose o stopat anystage.MonteCarlo tudies ave hownhat herandometsof ubsamplesrenotmuchworse in average ength f confidencentervals) omparedo thebalanced ubsamples fSection , oncethenumberf ubsamplesxceeds 5.

    REFERENCES[11Fisher,R. A. The DesignofExperiments.th Ed. London: Oliver nd Boyd, 1935.[21Fisher,R. A., "The theory fconfoundingn factorial xperimentsn relation o thetheory fgroups,'AnnalsofEugenics, 1 (1942) 341-353.[31 Tukey,J. W., "Bias and confidencen not quite large samples,"Annals of Mathe-maticalStatistics, 9 (1958) 614.[4] Lancaster,H. O., "Kolmogorov'sremarkon the Hotellingcanonicalcorrelations,Biometrika, 3 (1966) 585-588.151Mahalanobis, P. C., "Report on the Bihar crop survey: Rabi season 1943-1944,"

    Sankhya, (1946) 269-280.[61McCarthy,P. J., "Replication An approachto the analysisof data from omplexsurveys),"NationalCentre orHealth tatistics (14) (1966).