Peter Grzybek

37
Peter Grzybek http://www-gewi.uni- graz.at/quanta Austrian Research Fund Project #15485 Von der Ökonomie der Sprache zur Selbst- Regulation kultureller Systeme Korpuslinguistik vs. Textanalyse Exakte Literaturwissenschaft: Zur Prosa Karel Čapeks Was tun die Wörter im Vers miteinander? Zur Poesie A.S. Puškins

description

Peter Grzybek.  Von der Ökonomie der Sprache zur Selbst-Regulation kultureller Systeme Korpuslinguistik vs. Textanalyse  Exakte Literaturwissenschaft: Zur Prosa Karel Č apeks  Was tun die Wörter im Vers miteinander? Zur Poesie A.S. Pu š kins. http://www-gewi.uni-graz.at/quanta - PowerPoint PPT Presentation

Transcript of Peter Grzybek

Page 1: Peter Grzybek

Peter Grzybek

http://www-gewi.uni-graz.at/quanta

Austrian Research Fund

Project #15485

Von der Ökonomie der Sprache zur Selbst-Regulation kultureller Systeme

Korpuslinguistik vs. Textanalyse

Exakte Literaturwissenschaft:

Zur Prosa Karel Čapeks

Was tun die Wörter im Vers miteinander?

Zur Poesie A.S. Puškins

Page 2: Peter Grzybek

Peter Grzybek

http://www-gewi.uni-graz.at/quanta

Austrian Research Fund

Project #15485

Korpus-Linguistik

vs.

Text-Analyse

Page 3: Peter Grzybek

Analysis of Letter Frequencies

Methodological Problems in Former Studies

1. Insufficient Data Distinction

(graphemic and phonematic/phonetic data)

2. Insufficient Control of Data Homogeneity

(text / text segments / text mixtures (corpora)

3. Frequency Models: Continuous vs. Discrete

(a) theoretical entropy, repeat rate

(b) pi = 1

4. Goodness of Fit

Graphics vs. tests, R² vs. ²

Page 4: Peter Grzybek

Analysis of Letter Frequencies

Methodological Decisions

1. Data Distinction

Graphemic data

2. Control of Data Homogeneity

Text vs. text segments vs. text cumulations vs. text mixtures (corpus)

3. Discrete Frequency Models

Test of relevant models

4. Goodness of Fit

² test C = ² / N (C < 0.02 = * ; C < 0.01 = **)

Page 5: Peter Grzybek

Analysis of Letter Frequencies

Slavic Alphabets

inventory size

minimal 25 Slovene

maximal 46 Slovak

medium 32/33 Russian(е / ё)

Page 6: Peter Grzybek

Analysis of Letter Frequencies

Russian

А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

No. Author Text Chapter Abbr. N26 A.S. Puškin Evgenij Onegin ch. 1& 8 ASP-EO1+8 31694

pt. 8 (ch. 18) & pt. 1 (ch. 1)pt. 1 (ch. 1) &pt. 6 (ch. 8)

A.S. Puškin Evgenij Onegin && L.N. Tolstoj Anna KareninaA.S. Puškin Evgenij Onegin && F.M. Dostojevskij Prestuplenie i nakazanieA.S. Puškin Evgenij Onegin && text 24 Text 24L.N. Tolstoj Anna Karenina && text 24 Text 24F.M. Dostojevskij Prestuplenie i nakazanie && text 25 Text 25

34 M. Gor'kij & text 25 Na dne & Text 25 complete texts MG+IN 95312ch. 5, verse 1-5 per ch.epilogue, each alternate linept. 4 (ch. 1-5), every 4th line

38 Complete corpus CC 3328454

7720

28 F.M. Dostojevskij Prestuplenie i nakazanie FMD-PN1+6 29498

27 L.N. Tolstoj Anna Karenina LNT-AK8+1

29 complete texts ASP+LNT 1445733

30 complete texts ASP+FMD 947135

31 complete texts ASP+UR 117311

32 complete texts LNT+UR 1344544

33 complete texts FMD+IN 856596

4323

36 F.M. Dostojevskij Prestuplenie i nakazanie FMD-2 14464

35 Puškin, A.S. Evgenij Onegin ASP1-5

714137 L.N. Tolstoj Anna Karenina LNT-4

No. Author Text Chapter Abbr. N1 A.S. Puškin Evgenij Onegin 1 ASP-EO 1 15830

2 2 ASP-EO 2 11544

3 3 ASP-EO 3 13597

4 4 ASP-EO 4 12475

5 5 ASP-EO 5 12018

6 6 ASP-EO 6 12742

7 7 ASP-EO 7 15180

8 8 ASP-EO 8 15864

9 1-2 ASP-EO 1-2 27374

10 1-3 ASP-EO 1-3 40971

11 1-4 ASP-EO 1-4 53446

12 1-5 ASP-EO 1-5 65464

13 1-6 ASP-EO 1-6 78206

14 1-7 ASP-EO 1-7 93386

15 complete text ASP-EO 1-8 10925016 L.N. Tolstoj Anna Karenina complete text LNT-AK 133648317 Otročestvo complete text LNT-O 11395418 F.M. Dostojevskij Prestuplenie i nakazanie complete text FMD-PN 837885

19 Zapiski iz podpol'ja complete text FMD-ZAP 18824920 A.P. Čechov Čajka complete text* APČ-Č 14573521 Djadja Vanja complete text* APČ-DV 6087122 M. Gor'kij Mat' complete text* MG-MA 433177

23 Na dne complete text MG-ND 7603924 www.rusmet.ru Ural'skij rynok metallov techn. Text UR 806125 www.phyton.ru Instr. sredstva […] techn. Text IN 18711

Page 7: Peter Grzybek

Zipf (Zeta) distribution

Basic assumption:

r x fr = c fr = c / r

1

1

1, 1,2,3,..., 1,

r a aj

cP r a c

r j

1 11 21 310

200000

400000

600000

800000

1000000

1200000

1400000beobachtet f(i)

Zeta NP(i)

Page 8: Peter Grzybek

Zipf-Mandelbrot distribution

Basic assumption:

fr = c / (r + b)a

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320

5000

10000

15000

20000

25000

f(i)

NP(i)

1

1

1, 1,2,3,..., 1, 1,

( ) ( )r a aj

cP r a b c

b r b j

Page 9: Peter Grzybek

Zipf and Zipf-Mandelbrot Distributions: Goodness of Fit

(38 Russian samples)

1 2 3 4 5 6 7 8 9 1011 1213141516171819202122232425262728293031323334353637380,00

0,05

0,10

0,15

0,20rt. Zeta Zipf-Mandelbrot

Page 10: Peter Grzybek

1 2 3 4 5 6 7 8 9 1011 1213141516171819202122232425262728293031323334353637380,00

0,05

0,10

0,15

0,20rt. geometric Good1

Geometric Distribution and Good Distribution

1 rrP p q , 1, 2,...,r

r b

aP c r n

r 1

1

jn

bj

ca

j

Page 11: Peter Grzybek

n = inventory size, x = class

2 parameters: K and M

Negative Hypergeometric Distribution

2

1 1

1

x

M x K M n x

x n xP

K n

n

1 11 21 310

200000

400000

600000

800000

1000000

1200000beobachtet f(i)

neg. hypergeom. NP(i)

Analysis of Russian Letter Frequencies:

Corpus: 37 Texts (ca. 8.5 mio. letters)

Page 12: Peter Grzybek

Analysis of Russian Letter Frequencies

Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus

1 11 21 310,00

0,02

0,04

0,06

0,08

0,10

1 11 21 310,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

4,00

Parameter K

Parameter M

Constancy of goodness of fit (C) Constancy of Parameters (K, M)

Negative Hypergeometric Distribution

Page 13: Peter Grzybek

Analysis of Slovene Letter Frequencies

Corpus: ca. 130.000 letters

Goodness of fit

(C= 0.0094)

Negative Hypergeometric Distribution

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290

2000

4000

6000

8000

10000

12000

14000

16000

18000

beobachtet

neg.hypergeom.

Page 14: Peter Grzybek

Analysis of Slovene Letter Frequencies

Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus

Constancy of goodness of fit (C) Constancy of Parameters (K, M)

Negative Hypergeometric Distribution

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

K

M

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200,00

0,05

0,10

0,15

0,20NHG

Page 15: Peter Grzybek

Analysis of Slovene Letter and Phoneme Frequencies:

Corpus: ca. 130.000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290

2000

4000

6000

8000

10000

12000

14000

16000

18000

beobachtet

neg.hypergeom.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290

2000

4000

6000

8000

10000

12000

14000

16000

beobachtet

neg.hypergeom.

Slovene Letters Slowene Phonemes

Page 16: Peter Grzybek

First Tentative Results of Slowak Letter Frequencies

1 11 21 31 410

200

400

600

800

1000

1200

1400

beobachtet

neg.hypergeom.

Tasks:

1. Interpretation of Parameters: „foreign letters Q-W-X“ influence inventory size

2. Exploration of Data Basis: Texts, Text Segments, Text cumulations, text mixtures

Page 17: Peter Grzybek

The Question of Data Homogeneity

Page 18: Peter Grzybek

“[…] the magnitude of words tends, on the whole, to stand in an inverse (not necessarily proportionate) relationship

to the number of occurrences”

Zipf (1935: 25)

Four major problems in research

Page 19: Peter Grzybek

What is the direction of dependence:

Does frequency depend on length or vice versa?

What is the unit of measurement:

Is word length measured in letters, phonemes, syllables, morphemes, ...?

What is frequency:

Absolute occurrence or the rank of words, or of word forms?

What is the text basis:

Corpus data, frequency dictionaries, ..., individual texts?

Page 20: Peter Grzybek

Assuming that

word length is a variable of frequency

Measuring

word length in the number of syllables per word

Analyzing

the absolute occurrence of words

the influence of the text basis shall be tested:

Individual texts vs. text cumulations vs. corpus data

DATA HOMOGENEITYDATA HOMOGENEITY

Page 21: Peter Grzybek

Intertextual Inhomogene

ity vs.

Intratexual Inhomogene

ity

Combination (“mixture”) of different texts

A ‘text’ in itself does not consist of homogeneous

elements

Different Languages Different Languages Different Authors Different Authors Different Different Text TypesText Types

• complete novel, composed of chapters

• complete book of a novel, consisting of several chapters

• individual chapters

• dialogical vs. narrative sequences within a text

Page 22: Peter Grzybek

x

dxb

y

dy

1

1 baxy

1/ 1( 1) : , B bx A y with A a B

b

Page 23: Peter Grzybek

RussianAnna Karenina (ch. 1)

xfrequency

ylength

123456789

1013192037

2.922.142.051.501.331.501.671.001.001.001.001.001.001.00

3.032.041.701.531.431.361.311.271.241.221.171.121.111.06

a = 2.0261, b = 0.9660R² = 0.88, N = 397

y

1 11 21 31

Frequency

0

0,5

1

1,5

2

2,5

3

3,5

Mea

n W

ord

Len g

th

observed

theoretical

Page 24: Peter Grzybek

Text Language N R² a b

Anna Karenina (I,1) Russian 397 0.88 2,03 0,97Evgenij Onegin (I) Russian 1871 0.96 1,70 0,79Na badnjak Croatian 2450 0.93 1,95 0,51Zářivé hlubiny Czech 1363 0.94 1,76 0,59Hiša M.P. (I) Slovenian 1147 0.84 1,80 0,40Zakliata panna Slovak 926 0.88 1,48 0,69Hänsel und Gretel German 803 0.87 1,16 0,51Fairy Tale by Móra Hungarian 234 0.96 1,57 0,84Di lembung kuring Sundanese 431 0.91 1,86 0,51Burung api Indonesian 1393 0.92 2,44 0,26 Portrait of a Lady (I) English 1104 0.89 1,23 0,83

0.84 R² 0.96

Page 25: Peter Grzybek

1 2 3 4 5 6 7 8 9 101

1,5

2

2,5

3

3,5

41 2 3 4 5 6 7 8 9 10 11

The course of the theoretical curves

Page 26: Peter Grzybek

The relationship between parameters a and b

1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2 2,1 2,2 2,3 2,4 2,5

Parameter a

0

0,2

0,4

0,6

0,8

1

1,2P

aram

eter

b

Page 27: Peter Grzybek

The relationship between text length (N) and parameter a

0 50 100 150 200 250

Textlänge (N)

0

0,5

1

1,5

2

2,5

3P

aram

eter

a

Page 28: Peter Grzybek

Obvious data inhomogeneity

1. Texts from different languages, authors, and various text types

2. Violation of the ceteris paribus condition

Ergo: The data in this mixture are not adequate

for testing the hypothesis at stake

Page 29: Peter Grzybek

Lev N. Tolstoj: Anna Karenina Chap. I,1 vs. I (34 chapters)

1 11 21 31

Wortfrequenz

0,00

1,00

2,00

3,00

4,00

Mitt

lere

Wor

t l än g

e

A.K. (I,1) - emp.

A.K. (I,1,) - th.

A.K. (I) - emp.

A.K. (I) - th.

N(Types

)C a b

AK (I, 1)

397 0.97 2.03 0.97

AK (I) 8661 0.86 2,60 0.27

Page 30: Peter Grzybek

Henry James: Portrait of a Lady Chap. 1 vs. novel (52 chapters)

N(Types

)C a b

I 1104 0.89 1.23 0.83

I-52 10727 0.58 1,84 0.27

1 2 3 4 5 6 7 8,5 11 14,5 19,83 27,5 73,430,00

0,50

1,00

1,50

2,00

2,50beobachtet

theoretisch

1 11 21 31 41,5 80,5 91 154 306,8 53970,00

0,50

1,00

1,50

2,00

2,50

3,00

beobachtet

theoretisch

Page 31: Peter Grzybek

N(Type

s)C a b

narrative 1913 0.91 1.93 0.54

dialogues 673 0.96 1.61 0.84

Ks.Š. Gjalski: Na badnjak Narrative vs. dialogical sequences

1 11 21 31 41 51 61 71 81 910,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

Narration

Dialogue

Page 32: Peter Grzybek

Evgenij OneginText cumulation (I – VIII)

Chapter NTypes

MTokens

a b R2

II+III-IIII-IVI-VI-VII-VIItext

(I-VIII)

18712918395148515737650974768329

3209554683591093613376159781906122482

1,701,841,921,971,951,972,032,05

0,790,690,570,530,480,520,430,40

0.960.880.880.920.940.940.860.88

1 2 3 4 5 6 7 8 9 101,00

1,50

2,00

2,50

3,00

3,501 1-2 1-3 1-4 1-5 1-6 1-7 ges

Results of fitting y = ax^-b + 1 to the cumulative text of Evgenij Onegin

Page 33: Peter Grzybek

Evgenij Onegin – text cumulation (chap. I – VIII)

Fitting y = ax^-b R² = 0.92

1,6 1,7 1,8 1,9 2 2,1

Parameter a

0,00

0,20

0,40

0,60

0,80

1,00

Par

amet

er b

1,6 1,7 1,8 1,9 2 2,1

Parameter a

0,00

0,20

0,40

0,60

0,80

1,00

Par

amet

er b

Dependence of parameter b on parameter a

Page 34: Peter Grzybek

Evgenij Onegin Text cumulation (I – VIII)

Dependence of a on Text Length (N):

a = 0.6493N0.1286 (R² = 0.96 )

1 101 201 301 401 501 601 701 801 901

Textlänge (Wortformen-Types)

1,50

1,60

1,70

1,80

1,90

2,00

2,10P

aram

eter

a

Page 35: Peter Grzybek

Summary & ResultsSummary & Results (I)(I)

Data corroborate hypothesis: ( ) 1bL f F y a x

There is a specific interrelation of parameters:

a = f (N) b = g(a)

b = h(N)

f, g, h functions of the same type

Page 36: Peter Grzybek

Summary & ResultsSummary & Results (II)(II)

1. Homogeneous texts do not interfere with linguistic laws, inhomogeneous texts can distort the textual reality.

2. Text mixtures can evoke phenomena which do not exist as such in individual texts

3. Short texts do not allow a property to take appropriate shape; long texts (and corpora) contain mixed generating regimes superposing different layers, what may lead to “artificial” phenomena.

4. With an increase of text size the resulting curve of the frequency-length relationship is shifted upwards; this is caused by the fact that the number of words occurring only once increase up to a certain text length. If this assumption is correct, then b converges to zero, yielding the limit y = a.

Page 37: Peter Grzybek

F I N I S F I N I S