Peter Grzybek
-
Upload
leah-parker -
Category
Documents
-
view
32 -
download
0
description
Transcript of Peter Grzybek
Peter Grzybek
http://www-gewi.uni-graz.at/quanta
Austrian Research Fund
Project #15485
Von der Ökonomie der Sprache zur Selbst-Regulation kultureller Systeme
Korpuslinguistik vs. Textanalyse
Exakte Literaturwissenschaft:
Zur Prosa Karel Čapeks
Was tun die Wörter im Vers miteinander?
Zur Poesie A.S. Puškins
Peter Grzybek
http://www-gewi.uni-graz.at/quanta
Austrian Research Fund
Project #15485
Korpus-Linguistik
vs.
Text-Analyse
Analysis of Letter Frequencies
Methodological Problems in Former Studies
1. Insufficient Data Distinction
(graphemic and phonematic/phonetic data)
2. Insufficient Control of Data Homogeneity
(text / text segments / text mixtures (corpora)
3. Frequency Models: Continuous vs. Discrete
(a) theoretical entropy, repeat rate
(b) pi = 1
4. Goodness of Fit
Graphics vs. tests, R² vs. ²
Analysis of Letter Frequencies
Methodological Decisions
1. Data Distinction
Graphemic data
2. Control of Data Homogeneity
Text vs. text segments vs. text cumulations vs. text mixtures (corpus)
3. Discrete Frequency Models
Test of relevant models
4. Goodness of Fit
² test C = ² / N (C < 0.02 = * ; C < 0.01 = **)
Analysis of Letter Frequencies
Slavic Alphabets
inventory size
minimal 25 Slovene
maximal 46 Slovak
medium 32/33 Russian(е / ё)
Analysis of Letter Frequencies
Russian
А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
No. Author Text Chapter Abbr. N26 A.S. Puškin Evgenij Onegin ch. 1& 8 ASP-EO1+8 31694
pt. 8 (ch. 18) & pt. 1 (ch. 1)pt. 1 (ch. 1) &pt. 6 (ch. 8)
A.S. Puškin Evgenij Onegin && L.N. Tolstoj Anna KareninaA.S. Puškin Evgenij Onegin && F.M. Dostojevskij Prestuplenie i nakazanieA.S. Puškin Evgenij Onegin && text 24 Text 24L.N. Tolstoj Anna Karenina && text 24 Text 24F.M. Dostojevskij Prestuplenie i nakazanie && text 25 Text 25
34 M. Gor'kij & text 25 Na dne & Text 25 complete texts MG+IN 95312ch. 5, verse 1-5 per ch.epilogue, each alternate linept. 4 (ch. 1-5), every 4th line
38 Complete corpus CC 3328454
7720
28 F.M. Dostojevskij Prestuplenie i nakazanie FMD-PN1+6 29498
27 L.N. Tolstoj Anna Karenina LNT-AK8+1
29 complete texts ASP+LNT 1445733
30 complete texts ASP+FMD 947135
31 complete texts ASP+UR 117311
32 complete texts LNT+UR 1344544
33 complete texts FMD+IN 856596
4323
36 F.M. Dostojevskij Prestuplenie i nakazanie FMD-2 14464
35 Puškin, A.S. Evgenij Onegin ASP1-5
714137 L.N. Tolstoj Anna Karenina LNT-4
No. Author Text Chapter Abbr. N1 A.S. Puškin Evgenij Onegin 1 ASP-EO 1 15830
2 2 ASP-EO 2 11544
3 3 ASP-EO 3 13597
4 4 ASP-EO 4 12475
5 5 ASP-EO 5 12018
6 6 ASP-EO 6 12742
7 7 ASP-EO 7 15180
8 8 ASP-EO 8 15864
9 1-2 ASP-EO 1-2 27374
10 1-3 ASP-EO 1-3 40971
11 1-4 ASP-EO 1-4 53446
12 1-5 ASP-EO 1-5 65464
13 1-6 ASP-EO 1-6 78206
14 1-7 ASP-EO 1-7 93386
15 complete text ASP-EO 1-8 10925016 L.N. Tolstoj Anna Karenina complete text LNT-AK 133648317 Otročestvo complete text LNT-O 11395418 F.M. Dostojevskij Prestuplenie i nakazanie complete text FMD-PN 837885
19 Zapiski iz podpol'ja complete text FMD-ZAP 18824920 A.P. Čechov Čajka complete text* APČ-Č 14573521 Djadja Vanja complete text* APČ-DV 6087122 M. Gor'kij Mat' complete text* MG-MA 433177
23 Na dne complete text MG-ND 7603924 www.rusmet.ru Ural'skij rynok metallov techn. Text UR 806125 www.phyton.ru Instr. sredstva […] techn. Text IN 18711
Zipf (Zeta) distribution
Basic assumption:
r x fr = c fr = c / r
1
1
1, 1,2,3,..., 1,
r a aj
cP r a c
r j
1 11 21 310
200000
400000
600000
800000
1000000
1200000
1400000beobachtet f(i)
Zeta NP(i)
Zipf-Mandelbrot distribution
Basic assumption:
fr = c / (r + b)a
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320
5000
10000
15000
20000
25000
f(i)
NP(i)
1
1
1, 1,2,3,..., 1, 1,
( ) ( )r a aj
cP r a b c
b r b j
Zipf and Zipf-Mandelbrot Distributions: Goodness of Fit
(38 Russian samples)
1 2 3 4 5 6 7 8 9 1011 1213141516171819202122232425262728293031323334353637380,00
0,05
0,10
0,15
0,20rt. Zeta Zipf-Mandelbrot
1 2 3 4 5 6 7 8 9 1011 1213141516171819202122232425262728293031323334353637380,00
0,05
0,10
0,15
0,20rt. geometric Good1
Geometric Distribution and Good Distribution
1 rrP p q , 1, 2,...,r
r b
aP c r n
r 1
1
jn
bj
ca
j
n = inventory size, x = class
2 parameters: K and M
Negative Hypergeometric Distribution
2
1 1
1
x
M x K M n x
x n xP
K n
n
1 11 21 310
200000
400000
600000
800000
1000000
1200000beobachtet f(i)
neg. hypergeom. NP(i)
Analysis of Russian Letter Frequencies:
Corpus: 37 Texts (ca. 8.5 mio. letters)
Analysis of Russian Letter Frequencies
Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus
1 11 21 310,00
0,02
0,04
0,06
0,08
0,10
1 11 21 310,00
0,50
1,00
1,50
2,00
2,50
3,00
3,50
4,00
Parameter K
Parameter M
Constancy of goodness of fit (C) Constancy of Parameters (K, M)
Negative Hypergeometric Distribution
Analysis of Slovene Letter Frequencies
Corpus: ca. 130.000 letters
Goodness of fit
(C= 0.0094)
Negative Hypergeometric Distribution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290
2000
4000
6000
8000
10000
12000
14000
16000
18000
beobachtet
neg.hypergeom.
Analysis of Slovene Letter Frequencies
Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus
Constancy of goodness of fit (C) Constancy of Parameters (K, M)
Negative Hypergeometric Distribution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200,00
0,50
1,00
1,50
2,00
2,50
3,00
3,50
K
M
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200,00
0,05
0,10
0,15
0,20NHG
Analysis of Slovene Letter and Phoneme Frequencies:
Corpus: ca. 130.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290
2000
4000
6000
8000
10000
12000
14000
16000
18000
beobachtet
neg.hypergeom.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290
2000
4000
6000
8000
10000
12000
14000
16000
beobachtet
neg.hypergeom.
Slovene Letters Slowene Phonemes
First Tentative Results of Slowak Letter Frequencies
1 11 21 31 410
200
400
600
800
1000
1200
1400
beobachtet
neg.hypergeom.
Tasks:
1. Interpretation of Parameters: „foreign letters Q-W-X“ influence inventory size
2. Exploration of Data Basis: Texts, Text Segments, Text cumulations, text mixtures
The Question of Data Homogeneity
“[…] the magnitude of words tends, on the whole, to stand in an inverse (not necessarily proportionate) relationship
to the number of occurrences”
Zipf (1935: 25)
Four major problems in research
What is the direction of dependence:
Does frequency depend on length or vice versa?
What is the unit of measurement:
Is word length measured in letters, phonemes, syllables, morphemes, ...?
What is frequency:
Absolute occurrence or the rank of words, or of word forms?
What is the text basis:
Corpus data, frequency dictionaries, ..., individual texts?
Assuming that
word length is a variable of frequency
Measuring
word length in the number of syllables per word
Analyzing
the absolute occurrence of words
the influence of the text basis shall be tested:
Individual texts vs. text cumulations vs. corpus data
DATA HOMOGENEITYDATA HOMOGENEITY
Intertextual Inhomogene
ity vs.
Intratexual Inhomogene
ity
Combination (“mixture”) of different texts
A ‘text’ in itself does not consist of homogeneous
elements
Different Languages Different Languages Different Authors Different Authors Different Different Text TypesText Types
• complete novel, composed of chapters
• complete book of a novel, consisting of several chapters
• individual chapters
• dialogical vs. narrative sequences within a text
x
dxb
y
dy
1
1 baxy
1/ 1( 1) : , B bx A y with A a B
b
RussianAnna Karenina (ch. 1)
xfrequency
ylength
123456789
1013192037
2.922.142.051.501.331.501.671.001.001.001.001.001.001.00
3.032.041.701.531.431.361.311.271.241.221.171.121.111.06
a = 2.0261, b = 0.9660R² = 0.88, N = 397
y
1 11 21 31
Frequency
0
0,5
1
1,5
2
2,5
3
3,5
Mea
n W
ord
Len g
th
observed
theoretical
Text Language N R² a b
Anna Karenina (I,1) Russian 397 0.88 2,03 0,97Evgenij Onegin (I) Russian 1871 0.96 1,70 0,79Na badnjak Croatian 2450 0.93 1,95 0,51Zářivé hlubiny Czech 1363 0.94 1,76 0,59Hiša M.P. (I) Slovenian 1147 0.84 1,80 0,40Zakliata panna Slovak 926 0.88 1,48 0,69Hänsel und Gretel German 803 0.87 1,16 0,51Fairy Tale by Móra Hungarian 234 0.96 1,57 0,84Di lembung kuring Sundanese 431 0.91 1,86 0,51Burung api Indonesian 1393 0.92 2,44 0,26 Portrait of a Lady (I) English 1104 0.89 1,23 0,83
0.84 R² 0.96
1 2 3 4 5 6 7 8 9 101
1,5
2
2,5
3
3,5
41 2 3 4 5 6 7 8 9 10 11
The course of the theoretical curves
The relationship between parameters a and b
1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2 2,1 2,2 2,3 2,4 2,5
Parameter a
0
0,2
0,4
0,6
0,8
1
1,2P
aram
eter
b
The relationship between text length (N) and parameter a
0 50 100 150 200 250
Textlänge (N)
0
0,5
1
1,5
2
2,5
3P
aram
eter
a
Obvious data inhomogeneity
1. Texts from different languages, authors, and various text types
2. Violation of the ceteris paribus condition
Ergo: The data in this mixture are not adequate
for testing the hypothesis at stake
Lev N. Tolstoj: Anna Karenina Chap. I,1 vs. I (34 chapters)
1 11 21 31
Wortfrequenz
0,00
1,00
2,00
3,00
4,00
Mitt
lere
Wor
t l än g
e
A.K. (I,1) - emp.
A.K. (I,1,) - th.
A.K. (I) - emp.
A.K. (I) - th.
N(Types
)C a b
AK (I, 1)
397 0.97 2.03 0.97
AK (I) 8661 0.86 2,60 0.27
Henry James: Portrait of a Lady Chap. 1 vs. novel (52 chapters)
N(Types
)C a b
I 1104 0.89 1.23 0.83
I-52 10727 0.58 1,84 0.27
1 2 3 4 5 6 7 8,5 11 14,5 19,83 27,5 73,430,00
0,50
1,00
1,50
2,00
2,50beobachtet
theoretisch
1 11 21 31 41,5 80,5 91 154 306,8 53970,00
0,50
1,00
1,50
2,00
2,50
3,00
beobachtet
theoretisch
N(Type
s)C a b
narrative 1913 0.91 1.93 0.54
dialogues 673 0.96 1.61 0.84
Ks.Š. Gjalski: Na badnjak Narrative vs. dialogical sequences
1 11 21 31 41 51 61 71 81 910,00
0,50
1,00
1,50
2,00
2,50
3,00
3,50
Narration
Dialogue
Evgenij OneginText cumulation (I – VIII)
Chapter NTypes
MTokens
a b R2
II+III-IIII-IVI-VI-VII-VIItext
(I-VIII)
18712918395148515737650974768329
3209554683591093613376159781906122482
1,701,841,921,971,951,972,032,05
0,790,690,570,530,480,520,430,40
0.960.880.880.920.940.940.860.88
1 2 3 4 5 6 7 8 9 101,00
1,50
2,00
2,50
3,00
3,501 1-2 1-3 1-4 1-5 1-6 1-7 ges
Results of fitting y = ax^-b + 1 to the cumulative text of Evgenij Onegin
Evgenij Onegin – text cumulation (chap. I – VIII)
Fitting y = ax^-b R² = 0.92
1,6 1,7 1,8 1,9 2 2,1
Parameter a
0,00
0,20
0,40
0,60
0,80
1,00
Par
amet
er b
1,6 1,7 1,8 1,9 2 2,1
Parameter a
0,00
0,20
0,40
0,60
0,80
1,00
Par
amet
er b
Dependence of parameter b on parameter a
Evgenij Onegin Text cumulation (I – VIII)
Dependence of a on Text Length (N):
a = 0.6493N0.1286 (R² = 0.96 )
1 101 201 301 401 501 601 701 801 901
Textlänge (Wortformen-Types)
1,50
1,60
1,70
1,80
1,90
2,00
2,10P
aram
eter
a
Summary & ResultsSummary & Results (I)(I)
Data corroborate hypothesis: ( ) 1bL f F y a x
There is a specific interrelation of parameters:
a = f (N) b = g(a)
b = h(N)
f, g, h functions of the same type
Summary & ResultsSummary & Results (II)(II)
1. Homogeneous texts do not interfere with linguistic laws, inhomogeneous texts can distort the textual reality.
2. Text mixtures can evoke phenomena which do not exist as such in individual texts
3. Short texts do not allow a property to take appropriate shape; long texts (and corpora) contain mixed generating regimes superposing different layers, what may lead to “artificial” phenomena.
4. With an increase of text size the resulting curve of the frequency-length relationship is shifted upwards; this is caused by the fact that the number of words occurring only once increase up to a certain text length. If this assumption is correct, then b converges to zero, yielding the limit y = a.
F I N I S F I N I S