Uni/bivariate Probleme

87
26/06/2008 -1- tische Statistik für Umwelt- und Geowissenschaftler Uni/bivariate Probleme Parametrische Verfahren Nicht parametrische Verfahr Unabhängigkeit Normalverteilung Ausreißer (Phasen-Iterationstest) (KS-Test / Chi-Quadrat Test) (Dixon / Chebyshev Verteilungstest Vergleich von Mittelwerten mit dem Parameter der GG Zusammenhangsanalyse Voraussetzungen erfüllt ? Vergleich von 2 unabhängigen Stichproben Vergleich von k unabhängigen Stichproben KS-Test / Chi-quadrat Test H-Test Rangkorrelation nach Spearmann Einstichproben T-test Chi-quadrat Test Zweistichproben T-test F-Test / Levene Test Varianzanalyse (ANOVA) earson‘s Korrelationsanalyse/ Regressionsanalyse Vergleich von 2 verbundenen Stichproben Wilcoxon-Test U-Test T-Test für verbundene Stichproben Nein Ja Mehrfachvergleiche: Post hoc tests Mehrfachvergleiche: Bonferroni Korrektur, Šidàk-Bonferonni correction

description

Uni/bivariate Probleme. Unabhängigkeit NormalverteilungAusreißer (Phasen-Iterationstest) (KS-Test / Chi-Quadrat Test) (Dixon / Chebyshev's Theorem). Ja. Nein. Voraussetzungen erfüllt ?. Parametrische Verfahren. Nicht parametrische Verfahren. - PowerPoint PPT Presentation

Transcript of Uni/bivariate Probleme

Page 1: Uni/bivariate Probleme

26/06/2008-1-

Praktische Statistik für Umwelt- und Geowissenschaftler

Uni/bivariate Probleme

Parametrische Verfahren Nicht parametrische Verfahren

Unabhängigkeit Normalverteilung Ausreißer (Phasen-Iterationstest) (KS-Test / Chi-Quadrat Test) (Dixon / Chebyshev's Theorem)

Verteilungstest

Vergleich von Mittelwerten mit dem Parameter der GG

Zusammenhangsanalyse

Voraussetzungen erfüllt ?

Vergleich von 2 unabhängigen Stichproben

Vergleich von k unabhängigen Stichproben

KS-Test / Chi-quadrat Test

H-Test

Rangkorrelation nachSpearmann

Einstichproben T-testChi-quadrat Test

Zweistichproben T-testF-Test / Levene Test

Varianzanalyse (ANOVA)

Pearson‘s Korrelationsanalyse/Regressionsanalyse

Vergleich von 2 verbundenen Stichproben Wilcoxon-Test

U-Test

T-Test für verbundene Stichproben

NeinJa

Mehrfachvergleiche: Post hoc tests

Mehrfachvergleiche: Bonferroni Korrektur,

Šidàk-Bonferonni correction

Page 2: Uni/bivariate Probleme

26/06/2008-2-

Praktische Statistik für Umwelt- und Geowissenschaftler

Data analysis Data mining

Reduction ClassificationData

Relationships

Principal Component

Analysis

FactorAnalysis

CorrespondenceAnalysis

HomogeneityAnalysis

Non-linearPCA

ProcrustesAnalysis

FactorAnalysis

DiscriminantAnalysis

HierarchicalCluster Analysis

MultidimensionalScaling

K-MeansArtificialNeural

Networks

MultipleRegression

PrincipalComponentRegression

LinearMixtureAnalysis

PartialLeast

Squares - 2

PartialLeast

Squares -1

Canonical Analysis

SupportVector

Machines

ANNSVM

ANNSVM

Categorization of multivariate methods

Page 3: Uni/bivariate Probleme

26/06/2008-3-

Praktische Statistik für Umwelt- und Geowissenschaftler

Vorgehen beim statistischen testen:

a) Aufstellen der H0/H1-Hypothese

b) Ein- oder zweiseitige Fragestellung

c) Auswahl des Testverfahrens

d) Festlegen des Signikanzniveaus (Fehler 1. und 2. Art)

e) Testen

f) Interpretation

Page 4: Uni/bivariate Probleme

26/06/2008-4-

Praktische Statistik für Umwelt- und Geowissenschaftler

In Population giltE

nts

chei

du

ng

aufg

run

d d

er S

tich

pro

be

H0 H1H

1H

0 richtig, mit 1-α β-Fehler P(H0¦H1)= β

α-Fehler P(H1¦H0)= α richtig, mit 1- β

Fehler 1. und 2. Art

Page 5: Uni/bivariate Probleme

26/06/2008-5-

Praktische Statistik für Umwelt- und Geowissenschaftler

Bestimmen von Irrtumswahrscheinlichkeiten

42 / K x mg g sei eine normalverteilte Stichprobe (nach 1. Grenzwertsatz) unbekannter Herkunft, mit 8, 100n

5%

40 / Hunsrück mg g K

43 / Eifel mg g K

.1: HunsH x .0 : HunsH x

Probe stammt aus dem Hunsrück

Probe stammt aus der Eifel

Page 6: Uni/bivariate Probleme

26/06/2008-6-

Praktische Statistik für Umwelt- und Geowissenschaftler

0

x

xz

Test: Einstichproben Gauss Test

mit 0.8x n

42 402.5

0.8z

Wert schneidet 0.62% von NV ab

(P-Wert = Irrtumswahrscheinlichkeit)

H0 muss verworfen werden!

P-Wert wird gleiner

mit > Diff.mit <mit > n

0x

x

α=5%, ~Z=1.65

Page 7: Uni/bivariate Probleme

26/06/2008-7-

Praktische Statistik für Umwelt- und Geowissenschaftler

Frage: Welches muss überschritten werden, um H0 mit gerade verwerfen zu können?

(1 )critx

5%

schneided von der rechten Seite der SNV genau 5% ab1.65z

( ) 0

= 40 + 1.65 0.8 = 41.32

crit xx z

Page 8: Uni/bivariate Probleme

26/06/2008-8-

Praktische Statistik für Umwelt- und Geowissenschaftler

Zweiseitiger Test:

1.96z

Hunsx

schneidet auf jeder Seite der SNV genau 2.5% ab

( /2) 40 - 1.96 0.8 = 38.43

40 + 1.96 0.8 = 41.57

critx H0 wird knapper abgelehnt!

Entscheidung ein-/zweiseitiger Test muss im Vorfeld erfolgen!

Page 9: Uni/bivariate Probleme

26/06/2008-9-

Praktische Statistik für Umwelt- und Geowissenschaftler

Der β-Fehler

Kann nur bei spezifischer H1 bestimmt werden!

Wir testen, ob sich die Stichprobe mit dem Parameter der Eifelproben verträgt

0

x

xz

42 43

1.250.8

Wert schneidet auf der linken Seite der SNV 10.6% ab.

Entscheidet man sich aufgrund des Ereignisses für die H0, so wird man mit einer p von 10.6% einen β-Fehler begehen, d.h. H1 (« Probe stammt aus der Eifel ») verwerfen, obwohl sie richtig ist.

42 / K x mg g

Page 10: Uni/bivariate Probleme

26/06/2008-10-

Praktische Statistik für Umwelt- und Geowissenschaftler

Die Teststärke

Die β-Fehlerwahrscheinlichkeit gibt an, mit welcher p die H1 verworfen wird, obwohl ein Unterschied besteht

1- β gibt die p an zugunsten von H1 zu entscheiden, wenn H1 gilt.

Bestimmen der Teststärke

Wir habe herausgefunden, dass ab einem Wert der Test gerade signifikant wird (« Probe stammt aus der Eifel »)

41.32 / K x mg g

Page 11: Uni/bivariate Probleme

26/06/2008-11-

Praktische Statistik für Umwelt- und Geowissenschaftler

41.32 432.1

0.8

Bestimmen der Teststärke

β-Wahrscheinlichkeit: 0.0179

Teststärke: 1-β =1-0.0179 = 0.9821

Die p, dass wir uns aufgrund des gewählten Signifikanzniveaus (α=5%) zu Recht zugunsten der H1 entscheiden, beträgt 98.21%

Determinanten der Teststärke:

Mit kleiner werdener Diff. µ0-µ1 verringert sich 1- βMit wachsendem n vergrössert sich 1- βMit wachsender Merkmalsstreuung sinkt 1- β

Page 12: Uni/bivariate Probleme

26/06/2008-12-

Praktische Statistik für Umwelt- und Geowissenschaftler

Why multivariate statistics?

Fancy statistics do not make up for poor planning

Design is more important than analysis

Remember

Page 13: Uni/bivariate Probleme

26/06/2008-13-

Praktische Statistik für Umwelt- und Geowissenschaftler

• Prediction Methods– Use some variables to predict unknown or future values of

other variables.

• Description Methods– Find human-interpretable patterns that describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Categorization of multivariate methods

Page 14: Uni/bivariate Probleme

26/06/2008-14-

Praktische Statistik für Umwelt- und Geowissenschaftler

Multiple Linear Regression Analysis

The General Linear Model

A general linear model can be:

straight-line model

quadratic model (second-order model)

more than one independent variables. E.g.

222110

222110

)(

)(

xxyE

or

xxyE iii

x1

x2

y

0=10

0

(xi1, xi2)

E(yi)yi

i

Response Surface

Page 15: Uni/bivariate Probleme

26/06/2008-15-

Praktische Statistik für Umwelt- und Geowissenschaftler

y=x1 + x2 – x1 + 2 x12 + 2 x2

2

Multiple Linear Regression Analysis

Page 16: Uni/bivariate Probleme

26/06/2008-16-

Praktische Statistik für Umwelt- und Geowissenschaftler

The goal of an estimator is to provide an estimate of a particular statistic based on the data. There are several ways to characterize estimators:

Bias: an unbiased estimator converges to the true value with large enough sample size. Each parameter is neither consistently over or under estimated

Likelihood: the maximum likelihood (ML) estimator is the one that makes the observed data most likely ML estimators are not always unbiased for small N

Efficient: an estimator with lower variance is more efficient, in the sense that it is likely to be closer to the true value over samples the “best” estimator is the one with minimum variance of all estimators

Parameter Estimation

Multiple Linear Regression Analysis

Page 17: Uni/bivariate Probleme

26/06/2008-17-

Praktische Statistik für Umwelt- und Geowissenschaftler

A linear model can be written as XY

TNyyY ,...,1Where: is an N-dimensional column vector of observations

Tk ,...,0 is a (k+1)-dimensional column vector of unknown parameters

TN ,...1 is an N-dimensional random column vector of unobserved errors

Matrix X is written as

NkN

k

k

XX

XX

XX

X

1

221

111

1

1

1

TN 1,...,11 0The first column of X is the vector , so that the first coefficient is the intercept.

N

it

Tt XxRSS

1

2)(

The unknown coefficient vector is estimated by minimizing the residual sum of squares

Multiple Linear Regression Analysis

Page 18: Uni/bivariate Probleme

26/06/2008-18-

Praktische Statistik für Umwelt- und Geowissenschaftler

Mean of errors is zero:

Errors have a constant variance:

Errors from different observations are independent of each other: for

Errors follow a Normal Distribution.

Errors are not uncorrelated with explanatory variable:

Model assumptionsThe OLS estimator can be considered as the best linear unbiased estimator (BLUE) of provided some basic assumptions regarding the error term are satisfied :

te

0)( tE 22 )(

tE

0)( ktE kt

0)(: ktt XEX

Multiple Linear Regression Analysis

Page 19: Uni/bivariate Probleme

26/06/2008-19-

Praktische Statistik für Umwelt- und Geowissenschaftler

For a multiple regression model :

1 should be interpreted as change in y when a unit change is observed in x1 and x2 is

kept constant. This statement is not very clear when x1 and x2 are not independent.

Misunderstanding: i always measures the effect of xi on E(y), independent of other x variables.

Misunderstanding: a statistically significant value establishes a cause and effect relationship between x and y.

iiii exxy 22110

Interpreting Multiple Regression Model X2

X1

Y

Multiple Linear Regression Analysis

Page 20: Uni/bivariate Probleme

26/06/2008-20-

Praktische Statistik für Umwelt- und Geowissenschaftler

If the model is useful…At least one estimated must 0

But wait …What is the chance of having one estimated significant if I have 2 random x?

For each , prob(b 0) = 0.05At least one happen to be b 0, the chance is:

Prob(b1 0 or b2 0) = 1 – prob(b1=0 and b2=0) = 1-(0.95)2 = 0.0975 Implication?

Explanation Power by

Multiple Linear Regression Analysis

Page 21: Uni/bivariate Probleme

26/06/2008-21-

Praktische Statistik für Umwelt- und Geowissenschaftler

RR22 (multiple correlation squared) – variation in (multiple correlation squared) – variation in YY accounted for by the set of accounted for by the set of predictorspredictors

Adjusted RAdjusted R22. . The adjustment takes into account the size of the sample and The adjustment takes into account the size of the sample and number of predictors to adjust the value to be a better estimate of the number of predictors to adjust the value to be a better estimate of the population value.population value.

Adjusted RAdjusted R22 = R = R22 - ( - (kk - 1) / ( - 1) / (n - kn - k) * (1 - R) * (1 - R22))

Where: Where:

nn = # of observations, = # of observations,kk = # of independent variables, = # of independent variables,

Accordingly: smaller Accordingly: smaller nn decreases R decreases R22 value; larger value; larger nn increases R increases R22 value; value; smaller smaller kk, increases R, increases R22 value; larger value; larger kk, decreases R, decreases R22 value. value.

The The F-F-test in the ANOVA table to judge whether the explanatory variables test in the ANOVA table to judge whether the explanatory variables in the model adequately describe the outcome variable.in the model adequately describe the outcome variable.

The The t-t-test of each partial regression coefficient. Significanttest of each partial regression coefficient. Significant t t indicates that indicates that the variable in question influences the the variable in question influences the YY response while controlling for other response while controlling for other explanatory variables.explanatory variables.

Analysis

Multiple Linear Regression Analysis

Page 22: Uni/bivariate Probleme

26/06/2008-22-

Praktische Statistik für Umwelt- und Geowissenschaftler

Source of Variance SS df MS

Regression p-1 MSR=SSR/(p-1)

Error n-p MSE=SSE/(n-p)

Total n-1

JYY'YX''β

)yy()'yy(

)1

ˆˆ

n

SSR

)y(y)'y(y SST

yX'βyy'

)y(y)'y(y

ˆ

ˆˆ

SSE

ANOVA

where J is an nn matrix of 1s

Multiple Linear Regression Analysis

Page 23: Uni/bivariate Probleme

26/06/2008-23-

Praktische Statistik für Umwelt- und Geowissenschaftler

The R2 statistic measures the overall contribution of Xs.

Then test hypothesis:H0: 1=… k=0H1: at least one parameter is nonzero

Since there is no probability distribution form for R2, F statistic is used instead.

2 1SSE SSR

RSST SST

Multiple Linear Regression Analysis

Page 24: Uni/bivariate Probleme

26/06/2008-24-

Praktische Statistik für Umwelt- und Geowissenschaftler

dof 1)(k-n vdof, p v where,FF :regionRejection

)1/()1(

/

)1/(

/

21

2

2

knR

pRF

knSSE

kSSR

MSE

MSRF

F-statistics

Multiple Linear Regression Analysis

Page 25: Uni/bivariate Probleme

26/06/2008-25-

Praktische Statistik für Umwelt- und Geowissenschaftler

How many variables should be included in the model?

Basic strategies:

Sequential forward

Sequential backward

Force entire

1

1,1 kk

kN

RSS

RSSRSSF kNkk

The first two strategies determine a suitable number of explanatory variables using the semi-partial correlation as criterion and a partial F-statistics which is calculated from the error terms from the restricted (RSS1) and unrestricted (RSS) models:

where k, k1 denotes the number of lags of the unrestricted and restricted model, and N is the number of observations.

Multiple Linear Regression Analysis

Page 26: Uni/bivariate Probleme

26/06/2008-26-

Praktische Statistik für Umwelt- und Geowissenschaftler

Measures the relationship between a predictor and the outcome, controlling for the relationship between that predictor and any others already in the model.

It measures the unique contribution of a predictor to explaining the variance of the outcome.

The semi-partial correlation Z

X

Y

Multiple Linear Regression Analysis

Page 27: Uni/bivariate Probleme

26/06/2008-27-

Praktische Statistik für Umwelt- und Geowissenschaftler

2An unbiased estimator for the variance is

kN

RSSs

2

The regression coefficients are tested for significance under the Null-Hypothesis using a standard t-test

0:0 iH

iiikN cst /^

Where denotes the ith diagonal element of the matrix . is also referred to as standard error of a regression coefficient .

iic 1 XXC T

iics

i

Testing the regression coefficients

Multiple Linear Regression Analysis

Page 28: Uni/bivariate Probleme

26/06/2008-28-

Praktische Statistik für Umwelt- und Geowissenschaftler

Which X is contributing the most to the prediction of Y?

Cannot interpret relative size of bs because each are relative to the variables scalebut s (Betas; standardized Bs) can be interpreted.

a is the mean on Y which is zero when Y is standardized

1 1 2 2 3 3

1 1 2 2 3 3

'

' ( ) ( ) ( )

y a b x b x b x

Zy Zx Zx Zx

Multiple Linear Regression Analysis

Page 29: Uni/bivariate Probleme

26/06/2008-29-

Praktische Statistik für Umwelt- und Geowissenschaftler

Can the regression equation be generalized to other data?

Can be evaluated by

randomly separating a data set into two halves. Estimate regression equation with one half and apply it to the other half and see if it predicts

Cross-validation

Multiple Linear Regression Analysis

Page 30: Uni/bivariate Probleme

26/06/2008-30-

Praktische Statistik für Umwelt- und Geowissenschaftler

MSEe

n

)(

:sample from estimate weunknonw, is population Since

ˆˆ

: termsresidual theand

ˆˆ

:is ,ˆ denoted , of valuefitted The

2

2

1

s

βXyyye

βXy

yy

1n

Residual analysis

n

ippii xxySSEMin

1

211110 )(

Multiple Linear Regression Analysis

Page 31: Uni/bivariate Probleme

26/06/2008-31-

Praktische Statistik für Umwelt- und Geowissenschaftler

Divide the residuals into two (or more) groups based the level of x, The variances and the means of the two groups are supposed to be equal. A

standard t-test can be used to test the difference in mean. A large t indicates nonconsistancy.

e

x/E(y)

0

The Revised Levene’s test

Multiple Linear Regression Analysis

Page 32: Uni/bivariate Probleme

26/06/2008-32-

Praktische Statistik für Umwelt- und Geowissenschaftler

Influential points are those whose exclusion will cause major change in fitted line.

“Leave-one-out” crossvalidation. If ei > 4s, it is considered as outlier.

True outlier should not be removed, but should be explained.

Detecting Outliers and Influential Observations

0.10.0-0.1-0.2

0.4

0.3

0.2

0.1

0.0

-0.1

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is m1)

Multiple Linear Regression Analysis

Page 33: Uni/bivariate Probleme

26/06/2008-33-

Praktische Statistik für Umwelt- und Geowissenschaftler

Example for a Generalized Least-Square model which can be used instead of OLS-regression in the case of autocorrelated error terms (e.g. in Distributed Lag-Models)

Generalized Least-Squares

Multiple Linear Regression Analysis

Page 34: Uni/bivariate Probleme

26/06/2008-34-

Praktische Statistik für Umwelt- und Geowissenschaftler

SPSS-Example

Multiple Linear Regression Analysis

Page 35: Uni/bivariate Probleme

26/06/2008-35-

Praktische Statistik für Umwelt- und Geowissenschaftler

SPSS-Example

Multiple Linear Regression Analysis

Page 36: Uni/bivariate Probleme

26/06/2008-36-

Praktische Statistik für Umwelt- und Geowissenschaftler

SPSS-ExampleModel evaluation

Multiple Linear Regression Analysis

Page 37: Uni/bivariate Probleme

26/06/2008-37-

Praktische Statistik für Umwelt- und Geowissenschaftler

Studying residual helps to detect if:

Model is nonlinear in function

Missing x

One or more assumptions of is violated.

Outliers

SPSS-ExampleModel evaluation

Multiple Linear Regression Analysis

Page 38: Uni/bivariate Probleme

26/06/2008-38-

Praktische Statistik für Umwelt- und Geowissenschaftler

ANalysis Of VAriance 

ANOVA (ONE-WAY)

ANOVA (TWO-WAY)

MANOVA

ANOVA

Page 39: Uni/bivariate Probleme

26/06/2008-39-

Praktische Statistik für Umwelt- und Geowissenschaftler

Comparing more than two groups

• ANOVA deals with situations with one observation per object, and three or more groups of objects

• The most important question is as usual: Do the numbers in the groups come from the same population, or from different populations?

ANOVA

Page 40: Uni/bivariate Probleme

26/06/2008-40-

Praktische Statistik für Umwelt- und Geowissenschaftler

One-way ANOVA: Example

• Assume ”treatment results” from 13 soil plots from three different regions: – Region A: 24,26,31,27– Region B: 29,31,30,36,33– Region C: 29,27,34,26

• H0: The treatment results are from the same population of results

• H1: They are from different populations

ANOVA

Page 41: Uni/bivariate Probleme

26/06/2008-41-

Praktische Statistik für Umwelt- und Geowissenschaftler

Comparing the groups

• Averages within groups: – Region A: 27

– Region B: 31.8

– Region C: 29

• Total average: • Variance around the mean matters for comparison. • We must compare the variance within the groups

to the variance between the group means.

4 27 5 31.8 4 2929.46

4 5 4

ANOVA

Page 42: Uni/bivariate Probleme

26/06/2008-42-

Praktische Statistik für Umwelt- und Geowissenschaftler

Variance within and between groups• Sum of squares within groups:

• Sum of squares between groups:

• The number of observations and sizes of groups has to be taken into account!

2 2 2(24 27) (26 27) ... (29 31.8) .... 94.8SSW

2 2 2

2 2 2

(27 29.46) (27 29.46) ... (31.8 29.46) ....

4(27 29.46) 5(31.8 29.46) 4(29 29.46) 52.43

SSG

ANOVA

Page 43: Uni/bivariate Probleme

26/06/2008-43-

Praktische Statistik für Umwelt- und Geowissenschaftler

Adjusting for group sizesSSW

MSWn K

1

SSGMSG

K

Both are estimates of population variance of error under H0

n: number of observationsK: number of groups

• If populations are normal, with the same variance, then

we can show that under the null hypothesis,

• Reject at confidence level if

1,~ K n K

MSGF

MSW

1, ,K n K

MSGF

MSW

ANOVA

Page 44: Uni/bivariate Probleme

26/06/2008-44-

Praktische Statistik für Umwelt- und Geowissenschaftler

Continuing example

• -> H0 can not be rejected

94.89.48

13 3

SSWMSW

n K

52.43

26.21 3 1

SSGMSG

K

26.22.76

9.48

MSG

MSW 3 1,13 3,0.05 4.10F

ANOVA

Page 45: Uni/bivariate Probleme

26/06/2008-45-

Praktische Statistik für Umwelt- und Geowissenschaftler

ANOVA table

Source of variation

Sum of squares

Deg. of freedom

Mean squares

F ratio

Between groups

SSG K-1 MSG

Within groups

SSW n-K MSW

Total SST n-1

MSG

MSW

2 2 2(24 29.46) (26 29.46) ... (26 29.46)SST SSG SSW SST NOTE:

Page 46: Uni/bivariate Probleme

26/06/2008-46-

Praktische Statistik für Umwelt- und Geowissenschaftler

When to use which method• In situations where we have one observation per

object, and want to compare two or more groups: – Use non-parametric tests if you have enough data

• For two groups: Mann-Whitney U-test (Wilcoxon rank sum)

• For three or more groups use Kruskal-Wallis

– If data analysis indicate assumption of normally distributed independent errors is OK

• For two groups use t-test (equal or unequal variances assumed)

• For three or more groups use ANOVA

ANOVA

Page 47: Uni/bivariate Probleme

26/06/2008-47-

Praktische Statistik für Umwelt- und Geowissenschaftler

Two-way ANOVA (without interaction)• In two-way ANOVA, data fall into categories in

two different ways: Each observation can be placed in a table.

• Example: Both type of fertilization and crop type should influence soil properties.

• Sometimes we are interested in studying both categories, sometimes the second category is used only to reduce unexplained variance. Then it is called a blocking variable

ANOVA

Page 48: Uni/bivariate Probleme

26/06/2008-48-

Praktische Statistik für Umwelt- und Geowissenschaftler

Sums of squares for two-way ANOVA• Assume K categories, H blocks, and assume

one observation xij for each category i and each block j block, so we have n=KH observations. – Mean for category i: – Mean for block j: – Overall mean:

ix

jx

x

ANOVA

Page 49: Uni/bivariate Probleme

26/06/2008-49-

Praktische Statistik für Umwelt- und Geowissenschaftler

Sums of squares for two-way ANOVA

2

1

( )K

ii

SSG H x x

2

1

( )H

jj

SSB K x x

2

1 1

( )K H

ij i ji j

SSE x x x x

2

1 1

( )K H

iji j

SST x x

SSG SSB SSE SST

ANOVA

Page 50: Uni/bivariate Probleme

26/06/2008-50-

Praktische Statistik für Umwelt- und Geowissenschaftler

ANOVA table for two-way data

Source of variation

Sums of squares

Deg. of freedom

Mean squares F ratio

Between groups SSG K-1 MSG= SSG/(K-1) MSG/MSE

Between blocks SSB H-1 MSB= SSB/(H-1) MSB/MSE

Error SSE (K-1)(H-1) MSE= SSE/(K-1)(H-1)

Total SST n-1

Test for between groups effect: compare to

Test for between blocks effect: compare to

MSG

MSEMSB

MSE

1,( 1)( 1)K K HF

1,( 1)( 1)H K HF

Page 51: Uni/bivariate Probleme

26/06/2008-51-

Praktische Statistik für Umwelt- und Geowissenschaftler

Two-way ANOVA (with interaction)• The setup above assumes that the blocking

variable influences outcomes in the same way in all categories (and vice versa)

• Checking interaction between the blocking variable and the categories by extending the model with an interaction term

ANOVA

Page 52: Uni/bivariate Probleme

26/06/2008-52-

Praktische Statistik für Umwelt- und Geowissenschaftler

Sums of squares for two-way ANOVA (with interaction)

• Assume K categories, H blocks, and assume L observations xij1, xij2, …,xijL for each category i and each block j block, so we have n=KHL observations. – Mean for category i: – Mean for block j:– Mean for cell ij: – Overall mean:

ix

jx

x

ijx

ANOVA

Page 53: Uni/bivariate Probleme

26/06/2008-53-

Praktische Statistik für Umwelt- und Geowissenschaftler

Sums of squares for two-way ANOVA (with interaction)

2

1

( )K

ii

SSG HL x x

2

1

( )H

jj

SSB KL x x

2

1 1

( )K H

ij i ji j

SSI L x x x x

2

1 1 1

( )K H L

ijli j l

SST x x

SSG SSB SSI SSE SST

2

1 1 1

( )K H L

ijl iji j l

SSE x x

ANOVA

Page 54: Uni/bivariate Probleme

26/06/2008-54-

Praktische Statistik für Umwelt- und Geowissenschaftler

ANOVA table for two-way data (with interaction)

Source of variation Sums of squares

Deg. of freedom

Mean squares F ratio

Between groups SSG K-1 MSG= SSG/(K-1) MSG/MSE

Between blocks SSB H-1 MSB= SSB/(H-1) MSB/MSE

Interaction SSI (K-1)(H-1) MSI=

SSI/(K-1)(H-1)

MSI/MSE

Error SSE KH(L-1) MSE= SSE/KH(L-1)

Total SST n-1

Test for interaction: compare MSI/MSE with Test for block effect: compare MSB/MSE with Test for group effect: compare MSG/MSE with 1, ( 1)K KH LF

1, ( 1)H KH LF

( 1)( 1), ( 1)K H KH LF

Page 55: Uni/bivariate Probleme

26/06/2008-55-

Praktische Statistik für Umwelt- und Geowissenschaftler

Notes on ANOVA• All analysis of variance (ANOVA) methods

are based on the assumptions of normally distributed and independent errors

• The same problems can be described using the regression framework. We get exactly the same tests and results!

• There are many extensions beyond those mentioned

ANOVA

Page 56: Uni/bivariate Probleme

26/06/2008-56-

Praktische Statistik für Umwelt- und Geowissenschaftler

MANOVA Uses Multiple DVs

• Various measures of soil properties– Corg, Cmik, N, pH,…

• Various outcome measures following different types of categories– Fertilization, point in time, crop type,…

Predictors (IVs) Criterion (DV(s))

ANOVA Multiple, discrete Single, continuous

MANOVA Multiple, discrete Multiple, continuous

MANOVA

Page 57: Uni/bivariate Probleme

26/06/2008-57-

Praktische Statistik für Umwelt- und Geowissenschaftler

• Multiple DVs could be analysed using multiple ANOVAs, but:

– The FW increases with each ANOVA

– Scores on the DVs are likely correlated

• Non-independent, and taken from the same subjects

• Hard to interpret results if multiple ANOVAs are significant

• MANOVA solves this by conducting only one overall test

– Creates a ‘composite’ DV

– Tests for significance of the composite DV

MANOVA

Page 58: Uni/bivariate Probleme

26/06/2008-58-

Praktische Statistik für Umwelt- und Geowissenschaftler

• The Composite DV is a linear combination of the DVs– i.e., a discriminant function, or root– The weights maximally separate the groups on the

composite DV

C = W1Y1 + W2Y2 + W3Y3 + …+ WnYn

where, C is a subject’s score on the composite DV

Yi are scores on each of the DVs

Wi are the weights, one for each DV

A composite DV is required for each main effect and interaction

MANOVA

Page 59: Uni/bivariate Probleme

26/06/2008-59-

Praktische Statistik für Umwelt- und Geowissenschaftler

• Considering the DVs together can enhance power

a. Frequency distributions show considerable overlap between groups on the individual DVs

b. The elipses, that reflect the DVs in combination, show less overlap

c. Small differences on each DV combine to make a larger multivariate difference

MANOVA

Page 60: Uni/bivariate Probleme

26/06/2008-60-

Praktische Statistik für Umwelt- und Geowissenschaftler

• In ANOVA, the sums of squared deviations are partitioned: SST = SSA + SSB + SSAxB + SSS/AB

• In MANOVA, the sum of squares cross-products are partitioned: ST = SD + STr + SDxTr + SS(DTr)

• The SSCP matrices (S) are analogous to the SS– SSCP matrix is a squared deviation that also

reflects correlations among the DVs

2

TTSS Y Y

MANOVA

Page 61: Uni/bivariate Probleme

26/06/2008-61-

Praktische Statistik für Umwelt- und Geowissenschaftler

Scores and Means in MANOVA are Vectors

• Y: Scores for each subject• T and D: Row and column marginals• GM: the grand mean • DTr: the average scores of subjects within cells

MANOVA

Page 62: Uni/bivariate Probleme

26/06/2008-62-

Praktische Statistik für Umwelt- und Geowissenschaftler

MANOVA

Page 63: Uni/bivariate Probleme

26/06/2008-63-

Praktische Statistik für Umwelt- und Geowissenschaftler

• The deviation score for the first subject is:

• The squared deviation is obtained by multiplying by the transpose:

SS are on the diagonal: (25.89)2 = 670, and (20.78)2 = 431 Cross-products are on the off-diagonals: (25.89)(20.78)=538

• And:

111

115 89 26

108 87 21Y GM

111 111

26 670 53826 21

21 538 431Y GM Y GM

T iii iiiS Y GM Y GM

MANOVA

Page 64: Uni/bivariate Probleme

26/06/2008-64-

Praktische Statistik für Umwelt- und Geowissenschaftler

• The squaring of a matrix is carried out by multiplying it by its transpose

• The transpose is obtained by flipping the matrix about its diagonal:

• To multiply, the ijth element in the resulting matrix is obtained by the sum of products of the ith row in A and the jth column in A'

• For a vector, the transpose is a row vector, and:

a b c

A d e f

g h i

a d g

A b e h

c f i

( )( ) ( )( )

( )( ) ( )( )

a a a a bAxA a b

b b a b b

MANOVA

Page 65: Uni/bivariate Probleme

26/06/2008-65-

Praktische Statistik für Umwelt- und Geowissenschaftler

• Main Effects in ANOVA vs. MANOVA:

D i iS n t D GM D GM

Tr i iS n d Tr GM Tr GM

2

A TASS n b Y Y

DxTr cells D TrS S S S

2

/ ABS ABSS Y Y ( )S DTrS Y DTr Y DTr

• The Interaction:

• The Error Term:

SS n Y YC ells A B T 2

CellsS n DTr GM DTr GM

AxB Cells A BSS SS SS SS

MANOVA

Page 66: Uni/bivariate Probleme

26/06/2008-66-

Praktische Statistik für Umwelt- und Geowissenschaftler

• In ANOVA, variance estimates (MS) are obtained from the SS for significance testing using the F-statistic

• In MANOVA, variance estimates (determinants) are obtained from the SSCP matrices for significance testing e.g. using Wilk’s Lambda ()

ANOVA MANOVA SS ~ SSCP MS ~ |SSCP|

~

Note that F and are inverse to one another

Effect

Error

MSF

MS Error

Effect Error

S

S S

MANOVA

Page 67: Uni/bivariate Probleme

26/06/2008-67-

Praktische Statistik für Umwelt- und Geowissenschaftler

• The determinant of a 2x2 matrix is given by:

( ) ( )

544 31, det 544 539 31 31 292434

31 529S DT S DTS S

( ) ( )

546 36, det 546 529 36 36 322040

36 529DT S DT DT S DTS S S S

, deta b

if A then A A a d b cc d

• The determinants required to test the interaction are:

( )

( )

2924340.908

322040S DT

DT S DT

S

S S

Error

Effect Error

S

S S

• Wilk’s Lambda for the Interaction is obtained by:

MANOVA

Page 68: Uni/bivariate Probleme

26/06/2008-68-

Praktische Statistik für Umwelt- und Geowissenschaftler

• If the effect is small, then approaches 1.0

– Here SDT was small, and was 0.91

Error

Effect Error

S

S S

• Eta Squared for MANOVA is:• 2 = 1 - Effect

• = 1 – 0.91 • = 0.09

• The interaction accounts for only 9% of the variance in the group means on the composite DV

MANOVA

Page 69: Uni/bivariate Probleme

26/06/2008-70-

Praktische Statistik für Umwelt- und Geowissenschaftler

MANOVA SPSS ExampleMANOVA

Page 70: Uni/bivariate Probleme

26/06/2008-71-

Praktische Statistik für Umwelt- und Geowissenschaftler

MANOVA SPSS Example

MANOVA

Page 71: Uni/bivariate Probleme

26/06/2008-72-

Praktische Statistik für Umwelt- und Geowissenschaftler

MANOVA

Page 72: Uni/bivariate Probleme

26/06/2008-73-

Praktische Statistik für Umwelt- und Geowissenschaftler

MANOVA

Page 73: Uni/bivariate Probleme

26/06/2008-74-

Praktische Statistik für Umwelt- und Geowissenschaftler

MANOVA

Page 74: Uni/bivariate Probleme

26/06/2008-75-

Praktische Statistik für Umwelt- und Geowissenschaftler

Discriminant analysis is used to predict group memberships from a set of continuous predictors

Analogy to MANOVA: in MANOVA linearly combined DVs

are created to answer the question if groups can be separated.

The same “DVs” can be used to predict group membership!!

Discriminant Analysis

Page 75: Uni/bivariate Probleme

26/06/2008-76-

Praktische Statistik für Umwelt- und Geowissenschaftler

What is the goal of Discriminant Analysis?

− Perform dimensionality reduction “while preserving as much of the class discriminatory information as possible”.

− Seeks to find directions along which the classes are best separated.

− Takes into consideration the scatter within-classes but also the scatter between-classes.

Discriminant Analysis

Page 76: Uni/bivariate Probleme

26/06/2008-77-

Praktische Statistik für Umwelt- und Geowissenschaftler

MANOVA and Disriminant Analysis (DA) are mathematically identical but are different in terms of emphasis:

– DA is usually concerned with grouping of objects (classification) and testing how well objects were classified (one grouping variable, one or more predictor variables)

– Discriminant functions are identical to canonical correlations between the groups on one side and the predictors on the other side.

– MANOVA is applied to test if groups significantly differ from each other (one or more grouping variables, one or more predictor variables)

Discriminant Analysis

Page 77: Uni/bivariate Probleme

26/06/2008-78-

Praktische Statistik für Umwelt- und Geowissenschaftler

Discriminant Analysis

Page 78: Uni/bivariate Probleme

26/06/2008-79-

Praktische Statistik für Umwelt- und Geowissenschaftler

Assumptions – small number of samples might lead to overfitting.

– If there are more DVs than objects in any cell the cell will become singular and cannot be inverted.

– If only a few cases more than DVs equality of covariance matrices is likely to be rejected.

– With a small objects/DV ratio power is likely to be very small

– Multivariate normality: the means of the various DVs in each cell and all linear combinations of them are normally distributed

– Absence of outliers – significance assessment is very sensitive to outlying cases

– Homogeneity of Covariance Matrices. DA is relatively robust to violations of this assumption if interference is the focus of the analysis, but not in classification.

Discriminant Analysis

Page 79: Uni/bivariate Probleme

26/06/2008-80-

Praktische Statistik für Umwelt- und Geowissenschaftler

Assumptions

— For classification purposes DA is highly influenced by violations for the last assumption, since subjects will tend to be classified into groups with the largest variance

— Homogeneity of class variances can be assessed by plotting pairwise the discriminant function scores for the first discriminant functions.

— LDA assumes linear relationships between all predictors within each group. Violations tend to reduce power and not increase alpha.

— Absence of Multicollinearity/Singularity in each cell of the design: Avoid redundant predictors

Discriminant Analysis

Page 80: Uni/bivariate Probleme

26/06/2008-81-

Praktische Statistik für Umwelt- und Geowissenschaftler

Interpreting a Two-Group Discriminant Function

In the two-group case, discriminant function analysis is analogous to multiple regression; the two-group discriminant analysis is also called Fisher linear discriminant analysis.

In general, in the two-group case we fit a linear equation of the type:

c = a + d1*x1 + d2*x2 + ... + dm*xm

where a is a constant and d1 through dm are regression coefficients and c is

the predicted class. The interpretation of the results of a two-group problem is straightforward and closely follows the logic of multiple regression: Those variables with the largest (standardized) regression coefficients are the ones that contribute most to the prediction of group membership.

Discriminant Analysis

Page 81: Uni/bivariate Probleme

26/06/2008-82-

Praktische Statistik für Umwelt- und Geowissenschaftler

Discriminant Functions for Multiple Groups

When there are more than two groups, then we can estimate more than one discriminant function. For instance, when there are three groups, there exist a function for discriminating between group 1 and groups 2 and 3 combined, and another function for discriminating between group 2 and group 3.

Canonical analysis. In a multiple group discriminant analysis, the first function is defined such that it provides the most overall discrimination between groups, the second provides second most, and so on.

All functions are independent or orthogonal. Computationally, a canonical correlation analysis is performed that determines the successive functions and canonical roots.

The number of function that can be calculated is:

Min [number of groups-1;number of variables]

Discriminant Analysis

Page 82: Uni/bivariate Probleme

26/06/2008-83-

Praktische Statistik für Umwelt- und Geowissenschaftler

Eigenvalues

Eigenvalus can be interpreted as the proportion of variance accounted for by the correlation between the respective canonical variates.

Successive eigenvalues will be of smaller and smaller size. First, compute the weights that maximize the correlation of the two sum scores. After this first root has been extracted, you will find the weights that produce the second largest correlation between sum scores, subject to the constraint that the next set of sum scores does not correlate with the previous one, and so on.

Canonical correlations. If the square root of the eigenvalues is taken, then the resulting numbers can be interpreted as correlation coefficients. Because the correlations pertain to the canonical variates, they are called canonical correlations.

Discriminant Analysis

Page 83: Uni/bivariate Probleme

26/06/2008-84-

Praktische Statistik für Umwelt- und Geowissenschaftler

Let be the total number of samples. And

1 1

( )( )iMC

Tw j i j i

i j

S x x

μ μ

1

1/C

ii

C

Suppose there are C classesLet µi be the mean vector of class i, i = 1,2,…, C

Within-class scatter matrix:

1

( )( )C

Tb i i

i

S

μ μ μ μ

1

1/C

ii

C

Between-class scatter matrix:

Where = mean of the entire data set

and t B WS S S

Discriminant Analysis

Page 84: Uni/bivariate Probleme

26/06/2008-85-

Praktische Statistik für Umwelt- und Geowissenschaftler

• Methodology

– LDA computes a transformation that maximizes the between-class scatter while minimizing the within-class scatter:

| | | |max max

| | | |

Tb b

Tw w

U S U S

U S U S

TUy x

products of eigenvalues !

projection matrix

,b wS S : scatter matrices of the projected data y

Discriminant Analysis

Page 85: Uni/bivariate Probleme

26/06/2008-86-

Praktische Statistik für Umwelt- und Geowissenschaftler

Linear transformation implied by LDA

– The linear transformation is given by a matrix U whose columns are the eigenvectors of the above problem.

– The LDA solution is given by the eigenvectors of the generalized eigenvector problem:

– Important: Since Sb has at most rank C-1, the max number of eigenvectors with non-zero eigenvalues is C-1 (i.e., max dimensionality of sub-space is C-1)

B k k W kS u S u

1 1

2 2

... ...

T

TT

Tk K

b u

b ux U x

b u

Discriminant Analysis

Page 86: Uni/bivariate Probleme

26/06/2008-87-

Praktische Statistik für Umwelt- und Geowissenschaftler

1W B k k kS S u u

• Does Sw-1 always exist?

– If Sw is non-singular, we can obtain a conventional eigenvalue problem by writing:

– In practice, Sw is often singular when more variables than cases are involved in the analysis (M << N )

Discriminant Analysis

Page 87: Uni/bivariate Probleme

26/06/2008-88-

Praktische Statistik für Umwelt- und Geowissenschaftler