Peter Grzybek

Peter Grzybek

http://www-gewi.uni-graz.at/quanta

Austrian Research Fund

Project #15485

Von der Ökonomie der Sprache zur Selbst-Regulation kultureller Systeme

Korpuslinguistik vs. Textanalyse

Exakte Literaturwissenschaft:

Zur Prosa Karel Čapeks

Was tun die Wörter im Vers miteinander?

Zur Poesie A.S. Puškins

Peter Grzybek

http://www-gewi.uni-graz.at/quanta

Austrian Research Fund

Project #15485

Korpus-Linguistik

vs.

Text-Analyse

Analysis of Letter Frequencies

Methodological Problems in Former Studies

1. Insufficient Data Distinction

(graphemic and phonematic/phonetic data)

2. Insufficient Control of Data Homogeneity

(text / text segments / text mixtures (corpora)

3. Frequency Models: Continuous vs. Discrete

(a) theoretical entropy, repeat rate

(b) pi = 1

4. Goodness of Fit

Graphics vs. tests, R² vs. ²


Methodological Decisions

1. Data Distinction

Graphemic data

2. Control of Data Homogeneity

Text vs. text segments vs. text cumulations vs. text mixtures (corpus)

3. Discrete Frequency Models

Test of relevant models

4. Goodness of Fit

² test C = ² / N (C < 0.02 = * ; C < 0.01 = **)


Slavic Alphabets

inventory size

minimal 25 Slovene

maximal 46 Slovak

medium 32/33 Russian(е / ё)


Russian

А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

No. Author Text Chapter Abbr. N26 A.S. Puškin Evgenij Onegin ch. 1& 8 ASP-EO1+8 31694

pt. 8 (ch. 18) & pt. 1 (ch. 1)pt. 1 (ch. 1) &pt. 6 (ch. 8)

A.S. Puškin Evgenij Onegin && L.N. Tolstoj Anna KareninaA.S. Puškin Evgenij Onegin && F.M. Dostojevskij Prestuplenie i nakazanieA.S. Puškin Evgenij Onegin && text 24 Text 24L.N. Tolstoj Anna Karenina && text 24 Text 24F.M. Dostojevskij Prestuplenie i nakazanie && text 25 Text 25

34 M. Gor'kij & text 25 Na dne & Text 25 complete texts MG+IN 95312ch. 5, verse 1-5 per ch.epilogue, each alternate linept. 4 (ch. 1-5), every 4th line

38 Complete corpus CC 3328454

7720

28 F.M. Dostojevskij Prestuplenie i nakazanie FMD-PN1+6 29498

27 L.N. Tolstoj Anna Karenina LNT-AK8+1

29 complete texts ASP+LNT 1445733

30 complete texts ASP+FMD 947135

31 complete texts ASP+UR 117311

32 complete texts LNT+UR 1344544

33 complete texts FMD+IN 856596

4323

36 F.M. Dostojevskij Prestuplenie i nakazanie FMD-2 14464

35 Puškin, A.S. Evgenij Onegin ASP1-5

714137 L.N. Tolstoj Anna Karenina LNT-4

No. Author Text Chapter Abbr. N1 A.S. Puškin Evgenij Onegin 1 ASP-EO 1 15830

2 2 ASP-EO 2 11544

3 3 ASP-EO 3 13597

4 4 ASP-EO 4 12475

5 5 ASP-EO 5 12018

6 6 ASP-EO 6 12742

7 7 ASP-EO 7 15180

8 8 ASP-EO 8 15864

9 1-2 ASP-EO 1-2 27374

10 1-3 ASP-EO 1-3 40971

11 1-4 ASP-EO 1-4 53446

12 1-5 ASP-EO 1-5 65464

13 1-6 ASP-EO 1-6 78206

14 1-7 ASP-EO 1-7 93386

15 complete text ASP-EO 1-8 10925016 L.N. Tolstoj Anna Karenina complete text LNT-AK 133648317 Otročestvo complete text LNT-O 11395418 F.M. Dostojevskij Prestuplenie i nakazanie complete text FMD-PN 837885

19 Zapiski iz podpol'ja complete text FMD-ZAP 18824920 A.P. Čechov Čajka complete text* APČ-Č 14573521 Djadja Vanja complete text* APČ-DV 6087122 M. Gor'kij Mat' complete text* MG-MA 433177

23 Na dne complete text MG-ND 7603924 www.rusmet.ru Ural'skij rynok metallov techn. Text UR 806125 www.phyton.ru Instr. sredstva […] techn. Text IN 18711

Zipf (Zeta) distribution

Basic assumption:

r x fr = c fr = c / r

1

1

1, 1,2,3,..., 1,

r a aj

cP r a c

r j

1 11 21 310

200000

400000

600000

800000

1000000

1200000

1400000beobachtet f(i)

Zeta NP(i)

Zipf-Mandelbrot distribution

Basic assumption:

fr = c / (r + b)a

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320

5000

10000

15000

20000

25000

f(i)

NP(i)

1

1

1, 1,2,3,..., 1, 1,

( ) ( )r a aj

cP r a b c

b r b j

Zipf and Zipf-Mandelbrot Distributions: Goodness of Fit

(38 Russian samples)

1 2 3 4 5 6 7 8 9 1011 1213141516171819202122232425262728293031323334353637380,00

0,05

0,10

0,15

0,20rt. Zeta Zipf-Mandelbrot

1 2 3 4 5 6 7 8 9 1011 1213141516171819202122232425262728293031323334353637380,00

0,05

0,10

0,15

0,20rt. geometric Good1

Geometric Distribution and Good Distribution

1 rrP p q , 1, 2,...,r

r b

aP c r n

r 1

1

jn

bj

ca

j

n = inventory size, x = class

2 parameters: K and M

Negative Hypergeometric Distribution

2

1 1

1

x

M x K M n x

x n xP

K n

n

1 11 21 310

200000

400000

600000

800000

1000000

1200000beobachtet f(i)

neg. hypergeom. NP(i)

Analysis of Russian Letter Frequencies:

Corpus: 37 Texts (ca. 8.5 mio. letters)

Analysis of Russian Letter Frequencies

Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus

1 11 21 310,00

0,02

0,04

0,06

0,08

0,10

1 11 21 310,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

4,00

Parameter K

Parameter M

Constancy of goodness of fit (C) Constancy of Parameters (K, M)


Analysis of Slovene Letter Frequencies

Corpus: ca. 130.000 letters

Goodness of fit

(C= 0.0094)


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290

2000

4000

6000

8000

10000

12000

14000

16000

18000

beobachtet

neg.hypergeom.

Analysis of Slovene Letter Frequencies

Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus

Constancy of goodness of fit (C) Constancy of Parameters (K, M)


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

K

M

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200,00

0,05

0,10

0,15

0,20NHG

Analysis of Slovene Letter and Phoneme Frequencies:

Corpus: ca. 130.000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290

2000

4000

6000

8000

10000

12000

14000

16000

18000

beobachtet

neg.hypergeom.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290

2000

4000

6000

8000

10000

12000

14000

16000

beobachtet

neg.hypergeom.

Slovene Letters Slowene Phonemes

First Tentative Results of Slowak Letter Frequencies

1 11 21 31 410

200

400

600

800

1000

1200

1400

beobachtet

neg.hypergeom.

Tasks:

1. Interpretation of Parameters: „foreign letters Q-W-X“ influence inventory size

2. Exploration of Data Basis: Texts, Text Segments, Text cumulations, text mixtures

The Question of Data Homogeneity

“[…] the magnitude of words tends, on the whole, to stand in an inverse (not necessarily proportionate) relationship

to the number of occurrences”

Zipf (1935: 25)

Four major problems in research

What is the direction of dependence:

Does frequency depend on length or vice versa?

What is the unit of measurement:

Is word length measured in letters, phonemes, syllables, morphemes, ...?

What is frequency:

Absolute occurrence or the rank of words, or of word forms?

What is the text basis:

Corpus data, frequency dictionaries, ..., individual texts?

Assuming that

word length is a variable of frequency

Measuring

word length in the number of syllables per word

Analyzing

the absolute occurrence of words

the influence of the text basis shall be tested:

Individual texts vs. text cumulations vs. corpus data

DATA HOMOGENEITYDATA HOMOGENEITY

Intertextual Inhomogene

ity vs.

Intratexual Inhomogene

ity

Combination (“mixture”) of different texts

A ‘text’ in itself does not consist of homogeneous

elements

Different Languages Different Languages Different Authors Different Authors Different Different Text TypesText Types

• complete novel, composed of chapters

• complete book of a novel, consisting of several chapters

• individual chapters

• dialogical vs. narrative sequences within a text

x

dxb

y

dy

1

1 baxy

1/ 1( 1) : , B bx A y with A a B

b

RussianAnna Karenina (ch. 1)

xfrequency

ylength

123456789

1013192037

2.922.142.051.501.331.501.671.001.001.001.001.001.001.00

3.032.041.701.531.431.361.311.271.241.221.171.121.111.06

a = 2.0261, b = 0.9660R² = 0.88, N = 397

y

1 11 21 31

Frequency

0

0,5

1

1,5

2

2,5

3

3,5

Mea

n W

ord

Len g

th

observed

theoretical

Text Language N R² a b

Anna Karenina (I,1) Russian 397 0.88 2,03 0,97Evgenij Onegin (I) Russian 1871 0.96 1,70 0,79Na badnjak Croatian 2450 0.93 1,95 0,51Zářivé hlubiny Czech 1363 0.94 1,76 0,59Hiša M.P. (I) Slovenian 1147 0.84 1,80 0,40Zakliata panna Slovak 926 0.88 1,48 0,69Hänsel und Gretel German 803 0.87 1,16 0,51Fairy Tale by Móra Hungarian 234 0.96 1,57 0,84Di lembung kuring Sundanese 431 0.91 1,86 0,51Burung api Indonesian 1393 0.92 2,44 0,26 Portrait of a Lady (I) English 1104 0.89 1,23 0,83

0.84 R² 0.96

1 2 3 4 5 6 7 8 9 101

1,5

2

2,5

3

3,5

41 2 3 4 5 6 7 8 9 10 11

The course of the theoretical curves

The relationship between parameters a and b

1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2 2,1 2,2 2,3 2,4 2,5

Parameter a

0

0,2

0,4

0,6

0,8

1

1,2P

aram

eter

b

The relationship between text length (N) and parameter a

0 50 100 150 200 250

Textlänge (N)

0

0,5

1

1,5

2

2,5

3P

aram

eter

a

Obvious data inhomogeneity

1. Texts from different languages, authors, and various text types

2. Violation of the ceteris paribus condition

Ergo: The data in this mixture are not adequate

for testing the hypothesis at stake

Lev N. Tolstoj: Anna Karenina Chap. I,1 vs. I (34 chapters)

1 11 21 31

Wortfrequenz

0,00

1,00

2,00

3,00

4,00

Mitt

lere

Wor

t l än g

e

A.K. (I,1) - emp.

A.K. (I,1,) - th.

A.K. (I) - emp.

A.K. (I) - th.

N(Types

)C a b

AK (I, 1)

397 0.97 2.03 0.97

AK (I) 8661 0.86 2,60 0.27

Henry James: Portrait of a Lady Chap. 1 vs. novel (52 chapters)

N(Types

)C a b

I 1104 0.89 1.23 0.83

I-52 10727 0.58 1,84 0.27

1 2 3 4 5 6 7 8,5 11 14,5 19,83 27,5 73,430,00

0,50

1,00

1,50

2,00

2,50beobachtet

theoretisch

1 11 21 31 41,5 80,5 91 154 306,8 53970,00

0,50

1,00

1,50

2,00

2,50

3,00

beobachtet

theoretisch

N(Type

s)C a b

narrative 1913 0.91 1.93 0.54

dialogues 673 0.96 1.61 0.84

Ks.Š. Gjalski: Na badnjak Narrative vs. dialogical sequences

1 11 21 31 41 51 61 71 81 910,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

Narration

Dialogue

Evgenij OneginText cumulation (I – VIII)

Chapter NTypes

MTokens

a b R2

II+III-IIII-IVI-VI-VII-VIItext

(I-VIII)

18712918395148515737650974768329

3209554683591093613376159781906122482

1,701,841,921,971,951,972,032,05

0,790,690,570,530,480,520,430,40

0.960.880.880.920.940.940.860.88

1 2 3 4 5 6 7 8 9 101,00

1,50

2,00

2,50

3,00

3,501 1-2 1-3 1-4 1-5 1-6 1-7 ges

Results of fitting y = ax^-b + 1 to the cumulative text of Evgenij Onegin

Evgenij Onegin – text cumulation (chap. I – VIII)

Fitting y = ax^-b R² = 0.92

1,6 1,7 1,8 1,9 2 2,1

Parameter a

0,00

0,20

0,40

0,60

0,80

1,00

Par

amet

er b

1,6 1,7 1,8 1,9 2 2,1

Parameter a

0,00

0,20

0,40

0,60

0,80

1,00

Par

amet

er b

Dependence of parameter b on parameter a

Evgenij Onegin Text cumulation (I – VIII)

Dependence of a on Text Length (N):

a = 0.6493N0.1286 (R² = 0.96 )

1 101 201 301 401 501 601 701 801 901

Textlänge (Wortformen-Types)

1,50

1,60

1,70

1,80

1,90

2,00

2,10P

aram

eter

a

Summary & ResultsSummary & Results (I)(I)

Data corroborate hypothesis: ( ) 1bL f F y a x

There is a specific interrelation of parameters:

a = f (N) b = g(a)

b = h(N)

f, g, h functions of the same type

Summary & ResultsSummary & Results (II)(II)

1. Homogeneous texts do not interfere with linguistic laws, inhomogeneous texts can distort the textual reality.

2. Text mixtures can evoke phenomena which do not exist as such in individual texts

3. Short texts do not allow a property to take appropriate shape; long texts (and corpora) contain mixed generating regimes superposing different layers, what may lead to “artificial” phenomena.

4. With an increase of text size the resulting curve of the frequency-length relationship is shifted upwards; this is caused by the fact that the number of words occurring only once increase up to a certain text length. If this assumption is correct, then b converges to zero, yielding the limit y = a.

F I N I S F I N I S

Peter Grzybek

Documents

Transcript of Peter Grzybek