Dynamic Development of Vocabulary Richness of Text Miroslav Kubát & Radek Čech University of Ostrava Czech Republic
Aim • To analyze a dynamic development of vocabulary richness from a methodological point of view.
• Q1: Do various text segmentations affect the development? • Q2: Are there significant differences between texts? • Q3: Are there significant differences between genres?
vocabulary richness
Example of different vocabulary richness development in two hypothetical texts 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1
2
3
4
5 chapter
text_1
6
text_2
7
8
9
10
Development of two texts, segmentation 300 tokens AVERAGE VOCABULARY RICHNESS anglické_listy 0.791 výlet_do_španěl 0.789 (non-significiant difference according to u-test, α = 0.05)
0.84 0.83
vocabulary richness
0.82 0.81
0.8 0.79 0.78 0.77
0.76 0.75 0.74
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 anglické_listy
výlet_do_španěl
Distance (d) measurement 0.5
0.4
MATTR
𝑑𝑖 = [ 𝑀𝐴𝑇𝑇𝑅𝑖 − 𝑀𝐴𝑇𝑇𝑅𝑖+1
2
+ 1]1/2
0.3
𝑀𝐴𝑇𝑇𝑅𝐼 − 𝑀𝐴𝑇𝑇𝑅𝐼+1 0.2
0.1 𝑖−1 +𝑖 =1 0 0
1
2
3 i
4
5
Vocabulary richness MATTR • A text is divided into the overlapped subtexts of the same length. • TTR is computed for every subtext. • MATTR is defined as a mean of particular values. σ𝑁−𝐿 𝑖=1 𝑉𝑖 𝑀𝐴𝑇𝑇𝑅 𝐿 = 𝐿(𝑁 − 𝐿 + 1)
L…arbitrarily chosen length of a window, L < N N…text length in tokens Vi…number of types in an individual window
Vocabulary richness MATTR a, b, c, a, a, d, f a, b, c | b, c, a | c, a, a | a, a, d | a, d, f σ𝑁−𝐿 3+3+2+2+3 𝑖=1 𝑉𝑖 𝑀𝐴𝑇𝑇𝑅 3 = = = 0.87 𝐿(𝑁 − 𝐿 + 1) 3(7 − 3 + 1) L=3 N=7 Vi…number of types in an individual window
Data • 16 Czech long texts. • 3 genres (travel books, novels, scientific texts). • One author (Karel Čapek).
Three ways of text segmentation • Chapters • Paragraphs • Sequences of 300 tokens
Development of two texts, segmentation 300 tokens AVERAGE VOCABULARY RICHNESS Anglické_listy 0.791 Výlet_do_španěl 0.789 (non-significiant difference according to u-test, α = 0.05)
0.84 0.83
vocabulary richness
0.82 0.81
0.8 0.79 0.78 0.77
0.76 0.75 0.74
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 anglické_listy
výlet_do_španěl
Development of two texts, segmentation 300 tokens 1.009 1.008 1.007
Wilcoxon-Mann-Whitney Test p-value: 0.00004
1.006 d
1.005 1.004 1.003 1.002 1.001 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65
výlet do španěl
anglické listy
average d
Average distances (d) in texts 1.0035 1.003 1.0025 1.002 1.0015 1.001 1.0005 1
chapters
paragraphs
sequences_300
Average distances (d) 1.0018 1.0016 1.0014
average d
1.0012 1.001 1.0008 1.0006
1.0004 1.0002 1 chapters
paragraphs
sequences_300
Average distances (d) 1.001
1.0009 1.0008
average d
1.0007 1.0006 1.0005
1.0004 1.0003 1.0002 1.0001 1 sequences_300
sequences_500
sequences_1000
Wilcoxon-Mann-Whitney Test (α = 0.05) | Chapter segmentation cesta na sever italské listy obrázky z holandska výlet do španěl hordubal krakatit obyčejný život povětroň první parta tovarna na absolutno válka s mloky život a dílo skladatele foltýna objektivní metoda pragmatis mus směry v estetice
italské listy
obrázky z holandska
výlet do španěl
cesta na sever
0.385
×
0.469
0.920
×
0.240
0.582
0.539
×
0.688
0.111
0.212
0.116
×
0.468 0.941
0.030 0.229
0.078 0.340
0.034 0.135
0.754 0.526
× 0.304
×
0.391
0.941
0.997
0.515
0.174
0.028
0.289
×
0.740 0.081
0.101 0.000
0.227 0.003
0.089 0.002
0.877 0.154
0.700 0.164
0.591 0.012
0.117 0.000
× 0.075
×
0.041
0.107
0.123
0.628
0.011
0.003
0.008
0.102
0.009
0.000
×
0.335
0.933
0.834
0.548
0.120
0.045
0.223
0.870
0.126
0.001
0.160
×
0.721
0.116
0.241
0.164
0.845
0.867
0.553
0.264
0.924
0.224
0.046
0.172
×
0.027
0.000
0.001
0.001
0.071
0.087
0.002
0.000
0.034
0.677
0.000
0.000
0.147
×
0.684
0.138
0.215
0.184
0.978
0.937
0.525
0.257
0.879
0.202
0.048
0.129
0.979
0.123
×
0.529
0.781
0.892
0.558
0.242
0.061
0.329
0.980
0.221
0.004
0.188
0.933
0.315
0.003
0.334
hordubal
krakatit
obyčejný život
povětroň první parta
tovarna na absolutno
válka s mloky
život a dílo objektivní pragmatis skladatele metoda mus foltýna
anglické listy
29 significant differences
Wilcoxon-Mann-Whitney Test (α = 0.05) | Paragraph segmentation anglické listy cesta na sever italské listy obrázky z holandska výlet do španěl hordubal krakatit obyčejný život povětroň první parta tovarna na absolutno válka s mloky život a dílo skladatele foltýna objektivní metoda pragmatis mus směry v estetice
cesta na sever
italské listy
obrázky z holandska
výlet do španěl
hordubal
krakatit
obyčejný život
povětroň první parta
tovarna na absolutno
válka s mloky
život a dílo objektivní pragmatis skladatele metoda mus foltýna
0.019 × 0.138
0.454×
0.091
0.000 0.003 ×
0.023 0.035 0.453
0.977 0.483 0.983 0.518 0.036 0.284
0.001 × 0.001 0.951× 0.014 0.059 0.062×
0.003 0.043 0.063
0.645 0.245 0.580 0.748 0.441 0.925
0.000 0.000 0.001
0.654 0.652 0.490
0.699 0.658 0.533
0.003 × 0.066 0.268 × 0.148 0.165 0.716×
0.107
0.000 0.002
0.837
0.000
0.000
0.006
0.000
0.000
0.000 ×
0.012
0.962 0.438
0.000
0.917
0.994
0.015
0.613
0.549
0.402
0.000 ×
0.024
0.972 0.528
0.000
0.969
0.972
0.049
0.643
0.658
0.506
0.000
0.983
0.492
0.056 0.323
0.022
0.082
0.096
0.965
0.008
0.092
0.201
0.011
0.036
0.078 ×
0.003
0.463 0.142
0.000
0.397
0.502
0.005
0.681
0.173
0.115
0.000
0.432
0.463
0.010 ×
0.481
0.073 0.348
0.021
0.108
0.110
0.909
0.014
0.120
0.230
0.013
0.054
0.094
0.993
45 significant differences
×
0.012
Wilcoxon-Mann-Whitney Test (α = 0.05) | 300 Tokens segmentation anglické listy cesta na sever italské listy obrázky z holandska výlet do španěl hordubal krakatit obyčejný život povětroň první parta tovarna na absolutno válka s mloky život a dílo skladatele foltýna objektivní metoda pragmatis mus směry v estetice
cesta na sever
italské listy
obrázky z holandska
výlet do španěl
hordubal
krakatit
obyčejný život
tovarna na povětroň první parta absolutno
válka s mloky
život a dílo objektivní pragmatis skladatele metoda mus foltýna
0.234× 0.023
0.175×
0.487
0.944 0.228×
0.000 0.001 0.015
0.001 0.120 0.018 0.602 0.320 0.421
0.009× 0.064 0.117× 0.332 0.001 0.063×
0.039 0.185 0.010
0.397 0.456 0.888 0.116 0.166 0.651
0.420 0.852 0.234
0.003 0.000 0.011
0.075 0.008 0.215
0.880× 0.212 0.360× 0.585 0.521 0.106×
0.132
0.753 0.260
0.645
0.002
0.037
0.514
0.605
0.696
0.320×
0.002
0.045 0.955
0.149
0.044
0.555
0.151
0.186
0.015
0.443
0.083×
0.050
0.387 0.526
0.377
0.011
0.155
0.941
0.846
0.348
0.688
0.601
0.311×
0.018
0.150 0.748
0.256
0.022
0.302
0.553
0.519
0.135
0.980
0.331
0.616
0.709×
0.015
0.093 0.775
0.130
0.242
0.840
0.245
0.298
0.078
0.425
0.156
0.828
0.345
0.452×
0.219
0.913 0.115
0.753
0.000
0.012
0.303
0.343
0.957
0.162
0.647
0.034
0.349
0.156
28 significant differences
0.066
Average distances (d) in genres 1.0025
average d
1.002
1.0015
1.001
1.0005
1
chapters travel_book
paragraphs novel
scientific_text
sequences_300
u-test, α = 0.05, u ≥ 1.96 means significant difference CHAPTERS
travel book
novel
novel
1.008178
x
scientific text
1.604092
0.807497
PARAGRAPHS
travel book
novel
novel
0.512984
x
scientific text
0.60703
0.11041
300_TOKENS
travel book
novel
novel
0.482236
x
scientific text
0.751127
0.578005
There is no significant difference between genres in our corpus.
Chapters
• "Natural" units. • Relatively long units. • Appropriate units for linguistic and literary research.
• Different lengths.
Paragraphs
• "Natural" units. • Appropriate units for linguistic and literary research.
• Very short units for vocabulary richness measurement. • Different lengths.
Sequences of text with the same length
• Same length. • Good length for vocabulary richness measurement.
• Artificial units. • Last part of a text must be removed. • Problematic linguistic and literary interpretation.
Preliminary Conclusion and Discussion • The longer the sequences of text, the smaller average distances between the subsequent sequences. • Both chapters and paragraphs are not suitable for development analysis due to their different length. • Sequences of arbitrary chosen length seems to be the best way; however, the artificial character of these units makes the interpretation of the results problematic. • Development of vocabulary richness could be used for text analysis. • Differences among genres were not corroborated in our corpus. • Is there another unit for this analysis?
Thank You For Your Attention!
References • Čech, R. (2015). Development of thematic concentration of text (in Karel Čapek’s books of travel). Czech and Slovak Linguistic Review. (accepted) • Čech, R., Popescu, I. I., Altmann, G. (2014). Metody kvantitativní analýzy (nejen) básnických textů. Olomouc: Univerzita Palackého v Olomouci. • Covington, M. A., McFall J. D. (2010). Cutting the Gordian Knot: The Moving Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics 17(2), 94-100. • Hřebíček, L. (2000). Variation in Sequences. Prague: Oriental Institute. • Köhler, R., Galle, M. (1993). Dynamic Aspects of Text Characteristics. In: Hřebíček, L., Altmann, G. (eds.), Quantitative Text Analysis. Trier: WVT, 46-53. • Kubát, M., Milička, J. (2013). Vocabulary Richness Measure in Genres. In: Journal of Quantitative Linguistics, 20(4):339-349. • Popescu, I. I. et al. (2009). Word frequency studies. Berlin/New York: Mouton de Gruyter. • Tuzzi, A., Popescu, I. I., Altmann, G. (2010). Quantitative Analysis of Italian Texts. Lüdenscheid: RAM.