Development of a speech recognizer for Castilian Roselien Bommerez
Promotor: Carmen Garcia-Mateo, Patrick Wambacq Supervisor: Laura Docio, Jacques Duchateau May 19, 2006
Copyright Copyright by K.U.Leuven Zonder voorafgaande schriftelijke toestemming van zowel de promotoren als de auteur is overnemen, kopi¨eren, gebruiken of realiseren van deze uitgave of gedeelten ervan verboden. Voor aanvragen tot, of informatie i.v.m. het overnemen en/of gebruik en/of realisatie van gedeelten uit deze publicatie, kunt U zich wenden tot de K.U.Leuven, Departement Elektrotechniek - ESAT, Kasteelpark Arenberg 10, B-3001 Heverlee (Belgi¨e). Tel. +32-16 32 11 30 & Fax. +32-16-32 19 86 of via e-mail:
[email protected] Voorafgaande schriftelijke toestemming van de promotoren is eveneens vereist voor het aanwenden van de in dit afstudeerwerk beschreven (originele) methoden, producten, schakelingen en programma’s voor industrieel of commercieel nut en voor de inzending van deze publicatie ter deelname aan wetenschappelijke prijzen of wedstrijden.
Copyright by K.U.Leuven Without written permission of the promotors and the author it is forbidden to reproduce or adapt in any form or by any means any part of this publication. Requests for obtaining the right to reproduce or utilize parts of this publication should be addressed to K.U.Leuven, Departement Elektrotechniek - ESAT, Kasteelpark Arenberg 10, B 3001 Heverlee (Belgium). Tel. +32-16-32 11 30 & Fax. +32-16-32 19 86 or by e-mail:
[email protected] A written permission of the promotors is also required to use the methods, products, schematics and programs described in this work for industrial or commercial use, and for submitting this publication in scientific contests.
I
Preface In the first place I would like to express my gratitude to my promotors and supervisors in Leuven and in Vigo. I am grateful to Prof. Dr. Ir. Carmen Garc´ıa-Mateo and Dra. Ir. Laura Docio Fern´andez from the University of Vigo for letting me take part in their research and for their expert help. I would like to thank Prof. Dr. Ir. Patrick Wambacq and Dr. Ir. Jacques Duchateau for their frequent and competent advice and their assistance in realizing and finalising this dissertation. Secondly I would like to thank Prof. Dr. Ir. Dirk Van Compernolle and Prof. Dr. Ir. Yves Moreau for reading and assessing this dissertation. Furthermore I wish to thank all my professors and research assistants. They have guided me for the past five years and have imparted to me the knowledge necessary to bring this dissertation to a good end. Finally my gratitude goes to my family and my closest friends for their unrelenting support.
II
Summary This dissertation describes the development and the testing of a speech recognizer for Castilian speech. The training and testing is based on a database of 23 radio shows in Castilian. This database was first divided in a training set, a validation set and a test set. This was done using a number of criteria. When this division had been made, the language models were developed. The language models are trigram models trained on the training set of the radio shows. Next to the universal model, five adapted models were also estimated. Three of them were adapted to the three main topics of the radio shows: politics, culture and news. The other two were adapted to the style of speech: spontaneous or planned. The last step to take before recognition could be done, was the building of the acoustic models. This was done starting from a seed model. This model was then adapted by means of a subset of the training set, containing only clean speech. The adaptation was done using the MLLR and MAP techniques. Two sex-dependent models were developed in addition to a universal model. Based on the developed models, recognition experiments were done. Using the universal language and acoustic models a WER of 61.28% was obtained. Several attempts were made to improve this result. Some of them didn’t have the desired effect e.g. varying the global pruning factor and disabling voice activity detection. Topic adaptation didn’t help either. Three big improvements were made. The first one was the use of the adapted acoustic models. This caused the WER to drop by 3%. A second one was obtained when the word insertion penalty was altered. A word insertion penalty of -5 reduced the WER by 1.3%. The third and biggest WER reduction was reached when 13 minutes of corrupted speech in the test set were removed. This resulted in a WER of 53.84%. Combining this with the first improvement led to a WER of 50.97%. The experiment in which this is combined with the second improvement, was not done. However, extrapolation indicates that in this way a WER of 50.21% would have been reached.
III
Contents 1 Introduction
1
2 Theoretical background 2.1 Introduction . . . . . . . . . . . . . . . . . 2.2 Language Model . . . . . . . . . . . . . . . 2.2.1 N-grams . . . . . . . . . . . . . . . 2.2.2 Smoothing techniques . . . . . . . 2.2.3 Evaluation . . . . . . . . . . . . . . 2.3 Acoustic Model . . . . . . . . . . . . . . . 2.3.1 The observation vector . . . . . . . 2.3.2 Continuous Hidden Markov Models 2.4 Speech Recognition . . . . . . . . . . . . . 2.5 Conclusion . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
3 3 3 3 4 6 7 7 7 8 9
3 Experimental framework 3.1 Introduction . . . . . . . 3.2 Recognizer . . . . . . . . 3.3 Database . . . . . . . . . 3.3.1 Radio shows . . . 3.3.2 El Correo Gallego 3.4 Conclusion . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
10 10 10 11 11 13 14
4 Division of the database 4.1 Introduction . . . . . . . 4.2 Topic classification . . . 4.3 Statistics of the database 4.4 Division of the database 4.5 Conclusion . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
15 15 15 17 17 19
5 Language Model 20 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 IV
5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
Language models based on the radio shows . . . El Correo Gallego . . . . . . . . . . . . . . . . . Evaluation of the simple language models . . . . Interpolation of the two language models . . . . Evaluation of the interpolated language models Additional language models . . . . . . . . . . . Preparation for recognition use . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . .
6 Acoustic Model 6.1 Introduction . . . . . . . . 6.2 Filtering of useless speech 6.3 Acoustic Model adaptation 6.4 Conclusion . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
20 21 22 23 24 27 27 29
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
30 30 30 31 32
7 Recognition experiments 7.1 Introduction . . . . . . . . . . . . . . . . 7.2 Generation of the Master Label Files . . 7.3 Recognition parameters and evaluation . 7.4 Universal model . . . . . . . . . . . . . . 7.5 Sex-dependent acoustic models . . . . . 7.6 Topic adapted language models . . . . . 7.7 Change of some recognition parameters . 7.7.1 Global beam width pruning factor 7.7.2 Voice Activity Detection . . . . . 7.7.3 Word Insertion Penalty . . . . . . 7.8 Ceiling model . . . . . . . . . . . . . . . 7.9 Unigram and bigram language models . . 7.10 Analysis of the results . . . . . . . . . . 7.10.1 Spontaneous speech . . . . . . . . 7.10.2 Heterogeneity of the database . . 7.11 Combining the improvements . . . . . . 7.12 Comparison with the Galician recognizer 7.13 Conclusion . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
33 33 33 34 36 36 37 39 39 39 40 43 43 44 44 45 47 48 51
. . . .
. . . .
. . . .
. . . .
8 Conclusion
. . . .
. . . .
. . . .
52
V
List of Tables 3.1 General information about the radio shows . . . . . . . . . . . . . . . . 12 4.1 4.2 4.3 4.4
The topics present in the database The three chosen topics . . . . . . . Statistics of the radio shows . . . . Statistics of the chosen sets . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
5.1 Vocabulary sizes and number of words of the training sets guage models mXdXtX . . . . . . . . . . . . . . . . . . . . 5.2 Evaluation of the simple LMs . . . . . . . . . . . . . . . . 5.3 The computed lambdas . . . . . . . . . . . . . . . . . . . . 5.4 Evaluation of the interpolated LMs, voc size 20 Kwords . . 5.5 Evaluation of the interpolated LMs, voc size 10 Kwords . . 5.6 The effect of pruning . . . . . . . . . . . . . . . . . . . . . 5.7 Evaluation of the unigram and bigram LMs (for m0d0t0) . 5.8 Chart of phonetic transcriptions . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
16 17 18 19
of the lan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21 22 23 25 26 26 27 28
6.1 Statistics of speakers appearing in more than one radio show . . . . . . 32 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12
Recognition results when using the universal models . . . . . . . . . . . Recognition of female speakers when using the female acoustic model . Recognition of male speakers when using the male acoustic model . . . Recognition results when using the sex-dependent acoustic models . . . Recognition of male speakers when using the female acoustic model . . Recognition of female speakers when using the male acoustic model . . Recognition of culture when using the LM adapted to the topic ‘culture’ Recognition of politics when using the LM adapted to the topic ‘politics’ Recognition of news when using the LM adapted to the topic ‘news’ . . Recognition results when using the topic adapted LMs . . . . . . . . . Recognition results for various pruning factors . . . . . . . . . . . . . . Recognition results when disabling voice activity detection, for various pruning factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VI
36 36 37 37 37 37 38 38 38 38 40 41
7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20 7.21 7.22 7.23 7.24 7.25 7.26 7.27 7.28 7.29 7.30 7.31
Recognition results for various word insertion penalties . . . . . . . . . Recognition results when using the ceiling language model . . . . . . . Recognition results when using the unigram language model . . . . . . Recognition results when using the bigram language model . . . . . . . Recognition of spontaneous speech when using the universal and the matched language model . . . . . . . . . . . . . . . . . . . . . . . . . . Recognition of planned speech when using the universal and the matched language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recognition results of show 5 . . . . . . . . . . . . . . . . . . . . . . . Recognition results of show 10 . . . . . . . . . . . . . . . . . . . . . . . Recognition results of show 21 . . . . . . . . . . . . . . . . . . . . . . . Statistics of the test shows . . . . . . . . . . . . . . . . . . . . . . . . . Recognition results of show 5 when deleting the corrupted turns . . . . Recognition results when using the universal models and deleting the corrupted turns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recognition of female speakers when using the female acoustic model and deleting the corrupted turns . . . . . . . . . . . . . . . . . . . . . . Recognition of male speakers when using the male acoustic model and deleting the corrupted turns . . . . . . . . . . . . . . . . . . . . . . . . Recognition results when using the sex-dependent acoustic models and deleting the corrupted turns . . . . . . . . . . . . . . . . . . . . . . . . Recognition results for word insertion penalty -5 when deleting the corrupted turns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amount of speech available for adapting the specific acoustic models of the Transcrigal project . . . . . . . . . . . . . . . . . . . . . . . . . . . OOV rate and perplexity of the LMs of the Transcrigal project . . . . . Results of the recognition experiments of the Transcrigal project . . . .
VII
42 43 44 44 44 45 45 45 45 46 46 47 47 48 48 48 49 50 50
List of Figures 3.1 The tree-based recognition system [5] . . . . . . . . . . . . . . . . . . . 11 3.2 Example of an info file . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 7.1 Example of a Master Label File . . . . . . . . . . . . . . . . . . . . . . 34 7.2 Parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
VIII
Chapter 1 Introduction Automatic speech recognition is a topic that has kept researchers busy for many years now. The first attempts to develop a speech recognition engine go all the way back to the fifties. Since then many new techniques, strategies and algorithms have been developed. It is in this domain that this dissertation has to be situated. The research was done during the first term at the Signal Processing Group of the University of Vigo in Spain. Within their Transcrigal project, a recognition system had already been developed for the transcription of television news shows in Galician, a Spanish language spoken in Galicia. My job was to develop new language and acoustic models for Castilian, the standard Spanish language, using the same approach and obtaining comparable recognition results. The database at my disposal consisted of 23 radio news shows in Castilian. This database had already been manually transcribed. The language and acoustic models were developed based on these shows. A point of special interest in the research was the adaptation of the models to the topic and the style of speech and to the sex of the speaker. N-grams are used for the estimation of the language models. Continuous Hidden Markov Models are used for the development of the acoustic models. These two concepts will be explained in the theoretical background discussed in Chapter 2. The way they can combine and the way they are combined specifically in this research, will also be discussed in this chapter. In Chapter 3 the experimental framework will be explained. The recognition engine available for the testing phase and, more importantly, the database of training and testing material will be described. In Chapter 4 the division of this database will be considered. With the help of certain criteria, the material from the radio news shows was divided into a training 1
CHAPTER 1. INTRODUCTION
2
set, a validation set and a test set. In Chapter 5 the development and the evaluation of the language models will be reported on. Not only a universal model, but also several adapted models will be described. In Chapter 6 the development of the acoustic models will be described. However, this development was mainly done by my supervisor in Vigo, because there was not enough time to be trained for this subject. In Chapter 7 the ultimate recognition experiments will be reported on in detail. The results will be described and analyzed. Finally a comparison with the recognition results of the Transcrigal project will be made. In Chapter 8 a final conclusion will be drawn about the results and possible future work regarding this subject. In Appendix A a part of a transcription output file is annexed. At the end of this dissertation a summary of the research, written in Dutch, is added.
Chapter 2 Theoretical background 2.1
Introduction
In this chapter a bit of theoretical background information about speech recognition will be given. c = W1 W2 ...Wn The goal of speech recognition is to determine which word sequence W is the most probable, if an acoustic sequence O = O1 O2 ...On is observed. The word sequence with the maximum posterior probability P (W |O) is searched for, using Bayes’ rule: P (W ) · P (O|W ) P (W |O) = P (O) Since O is fixed for the search, P (O) always remains the same. This means that the above mentioned maximization comes down to a maximization of P (W ) · P (O|W ). The probability P (W ) is represented in a language model, while the probability P (W |O) is represented in an acoustic model [1]. These two models and their use in recognition engines will be discussed in this chapter.
2.2 2.2.1
Language Model N-grams
A language model (LM) can be either statistical or deterministic. Deterministic models accept or don’t accept a sequence of words. Statistical models however estimate a probability P (W ) for each sequence of words. P (W ) can be decomposed into a
3
CHAPTER 2. THEORETICAL BACKGROUND
4
product of conditional probabilities: P (W ) = P (w1 ) · P (w2 |w1 ) · P (w3 |w2 , w1 ) . . . P (wn |w1 , . . . , wn−1 ) n Y = P (wi |w1 , w2 , . . . , wi−1 ) i=1
Taking into account the whole sentence history of a word, would however lead to far too many probabilities that have to be estimated. With a vocabulary size V and a sentence of length N, V N parameters would have to be estimated. A solution to this complexity problem is to assume that the probability of a word, given its history, only depends on a fixed number of previous words. This leads to n-gram language models, where n-1 is the number of previous words the probability depends on: P (wi |w1 , . . . , wi−1 ) = P (wi |wi−n+1 , . . . , wi−1 ) This type of language model and more specifically the trigram model, is used in this dissertation. In a trigram model the conditional probability of a word P (wi |w1 , . . . , wi−1 ) is estimated by P (wi |wi−2 , wi−1 ). In this way the number of parameters that have to be estimated, is V 3 at the most, since not all trigrams exist. Similarly we can use unigram probabilities P (wi ) or bigram probabilities P (wi |wi−1 ). A training set is needed to estimate these conditional probabilities. The parameters are then estimated in such a way that this training set has the highest total probability of occurring. This happens by Maximum Likelihood Estimation (MLE). This estimator turns out to be the number of times that the word preceded by its history occurs in the training set, scaled by the number of times the history itself occurs in the set: C(wi−2 , wi−1 , wi ) P (wi |wi−2 , wi−1 ) = C(wi−2 , wi−1 ) These counts C are gathered from the training set. This makes n-gram language models quite easy to generate, since there is no manual work. A disadvantage however is that not every possible trigram occurs in the training set. Few of them occur often and some occur only once, but most of them never occur. Trigrams that never occur, have zero probability. For some trigrams this is desirable because they indeed never occur in the modeled language. But for others this may be very undesirable. If one trigram of a sentence has zero probability, the whole sentence has zero probability, since the probabilities of all the trigrams of the sentence are multiplied. This means that this sentence can never be recognized correctly. To prevent these situations from happening, smoothing is done.
2.2.2
Smoothing techniques
The purpose of smoothing techniques is to solve this data sparsity by reserving a part of the probability mass of the occuring n-grams for the n-grams with zero count. This
CHAPTER 2. THEORETICAL BACKGROUND
5
can be done in many ways. Two of them are used in this research. Good-Turing discounting A rather easy smoothing method is Good-Turing discounting. This method multiplies each observed n-gram count and consequently also the probability of this n-gram by a discount coefficient factor da . This factor depends on the number of times the n-gram occurs in the set. The more the n-gram occurs, the more important the n-gram and, consequently, the smaller the discount factor should be. The discount factor of an n-gram occurring a times, is defined as ca+1 da = (a + 1) · a · ca where ca is the number of n-grams occurring exactly a times and ca+1 the number of n-grams occurring a + 1 times. A problem with this formula is that it fails when ca = 0 and there is a count cb > 0 for b > a [2]. Katz backoff smoothing The second method makes use of the Good-Turing discount factor, but changes it to some extent to prevent the problem mentioned above. The formula is changed and a cutoff value k is defined. From this value onwards counts a for a > k are not discounted. This means that high frequency n-grams are not discounted, which is desirable because they are important. The revised discount factor is defined as c ca+1 −(k+1)· k+1 (a+1)· a·c c1 a if 1 ≤ a ≤ k c 1−(k+1)· k+1 da = c1 1 if a > k Moreover, Katz smoothing uses a backoff scheme. This means that lower-order models are used to determine the probability assigned to the n-grams with zero count. This is of course better than distributing the reserved mass probability, that has been removed from the n-grams with nonzero count, equally over all the zero count n-grams. The probabilities are thus calculated as follows: P (wi |wi−n+1 , . . . , wi−1 ) = α(wi−n+1 , . . . , wi−1 ) · P (wi |wi−n+2 , . . . , wi−1 ) if c(wi−n+1 , . . . , wi ) = 0 c(wi−n+1 ,...,wi ) dc(wi−n+1 ,...,wi ) · c(w if 1 ≤ c(wi−n+1 , . . . , wi ) ≤ k i−n+1 ,...,wi−1 ) c(wi−n+1 ,...,wi ) if c(w ,...,w ) > k i−n+1
c(wi−n+1 )
with
i
β(wi−n+1 , . . . , wi−1 ) wi :c(wi−n+1 ,...,wi )=0 P (wi |wi−n+2 , . . . , wi−1 )
α(wi−n+1 , . . . , wi−1 ) = P
CHAPTER 2. THEORETICAL BACKGROUND
6
The function β(wi−n+1 , . . . , wi−1 ) is defined as the total probability mass of all the unseen words following the context wi−n+1 , . . . , wi−1 [2].
2.2.3
Evaluation
Language models can be evaluated before they are actually used for recognition experiments. To this purpose a test set is used, which is text that hasn’t been used for training and, consequently, has never before been seen by the language model. Evaluation happens by perplexity (ppl) calculation and measurement of the out-ofvocabulary (OOV) words of the test set. The OOV words of a test set are the words that occur in the test set without occurring in the list of words occurring in the language model. This number is usually expressed as the percentage of the total number of words of the test set. Every word that doesn’t occur in the vocabulary list is seen as an OOV word, even if an identical word has already been registered as such. So if a word occurs five times in the test set and doesn’t occur in the vocabulary list, this counts for five OOV words. The perplexity of a test set over the language model is calculated via the entropy of the set. Formally this entropy is calculated with the formula H = −E{log P (w|h)} where E is the expected value. This value however cannot be calculated, since the real distribution is unknown. The solution is to measure log P (w|h) of all the words of a corpus, in this case the test set, and then to take the average. This leads to the formula L 1X Htest = − log P (wk |hk ) L k=1 where L is the number of words in the test set. The OOV words are not included in this sum. The perplexity is a derivative measure of this entropy, namely P P = 10Htest The perplexity is a measure for the average number of words branching from a previous word. The bigger the perplexity, the more branches the speech recognizer has to consider. A big perplexity indicates that either the language is very complex or the language model is bad. There is however no analytical relation between the perplexity over a language model and the word error rate in which the model will result when used for recognition, because the perplexity doesn’t take into account acoustic confusability [3].
CHAPTER 2. THEORETICAL BACKGROUND
2.3 2.3.1
7
Acoustic Model The observation vector
The probability P (O|W ) is calculated by means of the acoustic model. A first question to be solved is: what exactly are these observations O? The observations are obtained through preprocessing of the incoming speech. First a Short Term Fourier Transformation (STFT) of the speech is done. This means that the speech is divided in overlapping frames, typical of 10-20 ms, and on each frame a zero-padded Fast Fourier Transformation (FFT) is performed. Then the log of the resulting spectrum is expanded as a series of cosines: log |S(k, t)| ≈
M X
ci,t · cos (2π
i=0
ki Nf f t ); k = 0 . . . Nf f t 2
The ci,t are the cepstra of the frame t, and they form the vector of observations used for recognition. The series is interrupted at a certain number, in this dissertation at 12. The first-order and second-order time derivatives of the cepstra, ∆c and ∆∆c, representing the velocity and the acceleration, are also included in the vector of observations. The normalized log energy and its derivatives are included as well. This results in an observation vector of dimension 39 for each frame. Some extra preprocessing is done for robustness: mel filtering to take into account the log kind of behaviour of frequency resolution, pre-emphasing to equalize the spectrum, spectral subtraction to reduce noise and cepstral mean normalisation to reduce the distortion caused by the transmission channel.
2.3.2
Continuous Hidden Markov Models
Continuous Hidden Markov Models (CHMM) are used to model these speech observations. Such a CHMM consists of a number of states, each state emitting a random variable X according to an output probabilistic function. In frame t the state is qt . A CHMM has a lot of parameters: • initial distribution: πi = P (q1 = i) • transition probabilities: aij = P (qt = j|qt−1 = i) P i • output probability density functions: bi (x) = C c=1 wic · N (x, µic , Σic ) where N (x, µ, Σ) =
1 (2π)D/2 det(Σ)1/2
1 exp(− (x − µ)t Σ−1 (x − µ)) 2
CHAPTER 2. THEORETICAL BACKGROUND
8
This is a Gaussian density function with mean vector µ and covariance matrix Σ for state i. D is the dimension of the observation vector x, wic are the mixture weights and C is the number of Gaussians in this Gaussian mixture density function. The model is hidden because knowing an observation vector doesn’t mean knowing the state that emitted it, since the emission is a statistical process. Words or sentences can be modeled by a CHMM. The job of the acoustic model is to provide the probability P (O|Φ) of any observation given the model Φ: P (O|Φ) =
X
P (O, q|Φ) =
q
X q
πq1 bq1 (o1 )
T Y
bqt (ot )aqt−1 qt
(2.1)
t=2
where q is a state sequence that could have generated the observation sequence O of length T . The word or sentence associated with the model having the highest probability given the observations, is chosen as the word or sentence that was probably pronounced. In a model consisting of N states, N T sequences have to be examined. This means that T N T operations have to be done for each model. More efficient are the forward or the backward procedure, both having a complexity of only T N 2 . Before such CHMMs can be used for speech recognition, their parameters A,B and π have to be determined. This is done using a training set. The parameters are then determined following the Expectation Maximization (EM) algorithm. This means that the likelihood of the training data following the developed models, is maximized. This can be done by using the iterative Baum-Welsh algorithm, also known as the forward-backward algorithm. However, the algorithm generally used for training is the Viterbi algorithm, which demands less calculations than the entire Baum-Welsh algorithm and gives more or less equivalent results.
2.4
Speech Recognition
Speech recognition is done using the language and acoustic models described above. There are many ways of performing recognition. A huge network can be composed. Each stage of the network contains the models of all the words in the vocabulary list. Transitions within a word are determined by the acoustic model, while transitions between words are determined by the language model. In this way each possible sentence is a path through this network and the path with the highest probability is the recognized sentence.
CHAPTER 2. THEORETICAL BACKGROUND
9
Because this method generates a huge network, an approach used more often is to use only one stage. This stage contains the models of all the words in the vocabulary. At the end of the stage, the system returns to the starting point. In this research a tree-based recognition system is used. This means that the words are not placed vertically underneath one another, but in a tree. This means that words having the same first letter are placed together in one tree. Then the words of this tree having the same second letter are put in a subtree of this tree, etc. In this way duplicate work is avoided as much as possible. In each of these cases the Viterbi algorithm is a very popular algorithm for finding the most likely sequence of hidden states that results in a sequence of observed events. This algorithm is a dynamic programming algorithm that picks and remembers the best path, instead of summing up probabilities of different paths coming to the same destination state (approximating the sum of equation 2.1 by the q that maximizes the probability of the observations) [4]. In each state two things are remembered: the best score up to that state and the best scoring path (backtrace) over t=1. . . x, in case the best path goes through that state at time x.
2.5
Conclusion
In this chapter the theory about speech recognition has been briefly explained. Language modeling based on n-grams and acoustic modeling based on Continuous Hidden Markov Models were discussed. The preprocessing of the speech signals was also described. Finally the recognition system and the accompanying Viterbi algorithm were explained.
Chapter 3 Experimental framework 3.1
Introduction
The experimental framework consists in the first place of a recognition engine, and in the second place of data for training and testing of the language and acoustic models. Both will be described more in detail in this chapter.
3.2
Recognizer
The recognizer used for the experiments, is an automatic speech recognizer (ASR) for large vocabulary continuous speech recognition, based on CHMMs. It’s a two-pass recognizer. The first pass is basically a Viterbi synchronous beam search, implemented using the token passing paradigm [5]. One of the basic properties of the token passing algorithm is that if two or more tokens land in the same node in the graph, only the best scoring token is retained. In this way inferior paths are removed automatically from the search beam [6]. A single lexical tree, as has been described in Section 2.4, is used as recognition system, with the language model separated from the network [5]. A representation of this kind of system can be seen in Figure 3.1. In the second pass all words extracted in the first pass, are organized in a word graph, which is rescored using an A* algorithm. The goal of this pass is to include in the search the words not propagated in the previous pass [5]. The A* algorithm is an algorithm that finds the best path through a network from an initial node to a goal node. It does so by using a heuristic estimate for each node, estimating the best route that goes through that node and then visiting the nodes in order of this estimate [7].
10
CHAPTER 3. EXPERIMENTAL FRAMEWORK
11
Separated knowledge sources pronunciation information
language model information
x
x
y y z
z
lexical tree
x y z
p(x|x)
x y z
p(x|y) p(y|y)
x y
p(x|z) p(y|z)
z
p(z|z)
p(y|x) p(z|x)
p(z|y)
Search algorithm Conditioned tokens lexicon
LM
token propagation token recombination
lexicon
LM
pruning
Figure 3.1: The tree-based recognition system [5]
3.3 3.3.1
Database Radio shows
A first part of the database consists of recordings of 23 radio news shows in Castilian. Each show lasts about half an hour, which results in 12 hours and 47 minutes of data. The duration of each show can be seen in the second column of Table 3.1. For each show there is an audiofile with extension .WAV. The encoding is 16 bits at 16kHz, single channel. Each show is accompanied by its transcription in XML format, generated using Transcriber software. Each transcription is divided in sections. These sections divide the show in untranscribed sections, fillers (introductions, credits, ...) and reports, which are single stories identified by a specific topic. Then each section is again divided in speaker turns, which contain speech of a single speaker [8]. In all, there are 216 different speakers and 2794 turns. But these numbers differ a lot for each show, as can be seen in the third and fourth column of Table 3.1. What strikes one most in this table, is that there are 221 speakers in the shows, of which 216 speakers are different from one another. This means that very few speakers appear in more than one show. In the transcriptions some symbols are used for special cases. Mispronounced
CHAPTER 3. EXPERIMENTAL FRAMEWORK show show 1 show 2 show 3 show 4 show 5 show 6 show 7 show 8 show 9 show 10 show 11 show 12 show 13 show 14 show 15 show 16 show 17 show 18 show 19 show 20 show 21 show 22 show 23 total average
duration 38:34 36:25 30:59 32:35 31:33 30:01 27:26 30:02 31:24 30:19 41:31 32:47 31:47 29:59 30:02 30:41 32:33 30:38 35:27 31:02 1:01:00 46:24 13:56 12:47:05 33:21
12
#speakers #turns 4 337 4 52 9 201 5 136 6 87 41 131 13 206 3 61 7 41 26 66 8 126 4 25 8 62 9 69 7 79 7 245 5 76 2 96 11 87 5 74 16 196 18 280 3 61 221 2794 9.6 121.5
Table 3.1: General information about the radio shows
words that are still intelligible, are marked with one star immediately attached to the mispronounced word. For example “*transportation” is the transcription of the mispronounced “transportetation”. A double star “**” is used for words that are completely unintelligible. Speaker noises are transcribed by [fil] or [spk], where the first is used for filled pauses like uh, um, ah, er, mm, and the second for speaker noises like a sigh, a smack, a laugh, a loud breath, . . . [8] In addition to these transcriptions, there is an extra file with additional information for each turn. This file includes information about the speaker: his name, his sex and his accent (native or nonnative), as well as information about the type of speech (spontaneous or planned), the topic, the duration, ... The total number of different topics mentioned in these info files is 48. An example of such an info file can be seen
CHAPTER 3. EXPERIMENTAL FRAMEWORK
13
[SRA] N1 0300E 002t401s0146MN.raw [SRI] N1 0300E 002t401s0146MN.info [SRC] N1 0300E.wav [SRL] N1 0300E.TRS [DRA] /home/gts/voz/laura/BDCastell/Audio [DRI] /home/gts/voz/laura/BDCastell/Info.v4 [DRC] /home/gts/bdd/Transcrigal/BDCastell/BDCastell/BCAST1ES/DATA [DRL] /home/gts/voz/laura/BDCastell/TRS [TBE] 59.509 [TEN] 61.271 [FMT] raw [TCD] 401 [TOP] General Estafas [BLK] News [SCD] 0146 [SPK] Manuel [SEX] male [DIA] native [MOD] planned [FID] high [CHA] studio [BCK] clean [LB0] qu´e har´ıa yo sin usted sin estos pa˜ nuelos Figure 3.2: Example of an info file
in Figure 3.2. Each turn is also accompanied by its audio file in .RAW format. By means of a script, these files are deduced from the audio file of the whole show. It are these files that will be used in the recognition experiments.
3.3.2
El Correo Gallego
In addition to this database, there is an already developed language model based on the Castilian portion of “El Correo Gallego”, a Galician newspaper. The text corpus on which the model is based, consists of approximately 64 Mwords. This language model will be used to interpolate with the language model trained on the Castilian radio shows mentioned earlier.
CHAPTER 3. EXPERIMENTAL FRAMEWORK
3.4
14
Conclusion
The recognizer at our disposal is an automatic speech recognizer, based on Continuous Hidden Markov Models. It’s a two-pass recognizer, primarily consisting of the Viterbi recognition algorithm, working synchronously with a beam search, followed by an A* algorithm. The database consists of transcribed material from radio shows and an already generated language model based on the Castilian part of “El Correo Gallego”, a Galician newspaper. The 23 radio shows result in 12,5 hours of Castilian speech and will be used to develop and test other language models and to train acoustic models.
Chapter 4 Division of the database 4.1
Introduction
For the development of the language and acoustic models, it is necessary to divide the radio shows into a training set, a validation set, and a test set. The training set will be used to train the language model as well as the acoustic model. The validation set will serve to determine the mixture weights of the radio show model and the newspaper model in the interpolated language model. Finally the test set will be given as input to the recognizer to test the quality of the developed models. This division however needs to be done evenly with regard to the sex of the speakers, the mode of the speech (spontaneous or planned), and the topic of speech. To this purpose it is first necessary to gather the topics in a number of main, bigger topics, so that there are enough data to make separate language models adapted to these topics. After that, a number of statistics of the database have to be made, for each show separately and for the database in its entirety. Based on these statistics a decision can then be made about the final division of the database.
4.2
Topic classification
First the topic labeling in the info files needs to be revised, because certain reports about e.g. politics had been labeled ‘general’. This causes topics to lack data that in fact are present. Then statistics are made about all the topics appearing in the info files. The results can be seen in Table 4.1.
15
CHAPTER 4. DIVISION OF THE DATABASE Table 4.1: The topics present in the database
Topic Agricultura Campo Agricultura Mar Agricultura Mujer Comunicaci´on Cultura Cultura Arte Cultura Cine Cultura Esp´ıas Cultura H´eroes Cultura Libros Cultura Literatura Cultura M´ usica Cultura Poes´ıa Cultura Premios Cultura Teatro Cultura Relatos Debate Deporte F´ utbol Deporte Tenis Deporte Toros Econom´ıa Entrevista Entrevista F´ utbol General General Estafas Historia Humor Literatura Libros Literatura Poes´ıa Noticias Pol´ıtica Pol´ıtica Alava Pol´ıtica Arag´on Pol´ıtica Barcelona Pol´ıtica Chipre Pol´ıtica Europa
#turns #words 34 1973 49 2860 49 2081 2 103 89 4114 8 908 103 2757 15 2987 3 938 21 1094 64 6087 226 5505 16 2208 4 333 51 1716 1 462 248 8930 1 178 1 322 100 2899 50 1233 380 22911 94 5588 215 6229 53 2448 168 11523 113 1427 25 2535 42 2127 184 6696 176 10415 2 72 2 50 1 46 31 1405 2 202 Continued on next page
16
CHAPTER 4. DIVISION OF THE DATABASE
17
Table 4.1 – continued from previous page Topic #turns #words Pol´ıtica Euskadi 12 1326 Pol´ıtica Guip´ uzcoa 2 109 Pol´ıtica Internacional 26 1965 Pol´ıtica Internacional Cuba 4 133 Pol´ıtica Internacional Irak 2 65 Pol´ıtica Internacional Israel 2 92 Pol´ıtica Madrid 6 355 Pol´ıtica Nacional 1 123 Pol´ıtica Navarra 3 118 Prensa 78 2957 Sociedad 35 2379 Total 2794 132984
Starting from this table, two main topics can be distinguished which have enough data to be treated separately: politics (all the topics in Table 4.1 containing the word Politics) and culture (all the topics containing the words Culture or Literature). The remaining turns are united in the topic ‘news’. The resulting amount of data for each topic can be found in Table 4.2. Topic Culture Politics News
#turns 668 272 1854
#words 33771 16476 82737
Table 4.2: The three chosen topics
4.3
Statistics of the database
Now statistics can be made about every show, representing the time of male and female speakers, the time of spontaneous and planned speech and the number of turns of the different topics. The results can be seen in Table 4.3.
4.4
Division of the database
Finally the database can be divided into a training, a validation and a test set. The division should be made in such a way that the training and the test set contain enough
CHAPTER 4. DIVISION OF THE DATABASE Show #fem(s) 1 1053 2 223 3 54 4 275 5 1314 6 432 7 671 8 692 9 131 10 813 11 859 12 0 13 589 14 343 15 14 16 937 17 0 18 0 19 704 20 0 21 629 22 413 23 95 TOT 10239
#male(s) 801 419 1771 1432 561 1352 475 616 1582 821 1528 1776 1068 1260 1762 625 1476 1794 1287 1718 1618 2294 742 28779
#spont(s) 1854 0 725 0 106 1421 0 1308 0 0 0 0 743 0 0 751 0 788 479 1718 1704 866 0 12464
#plan(s) 0 2064 1100 1707 1769 363 1181 0 1713 1634 2387 1776 914 1603 1776 811 1476 1006 1512 0 543 1841 837 28012
18 #cult 20 0 0 0 2 0 48 0 40 15 1 18 18 0 31 245 76 0 71 5 78 0 0 668
#pol 0 0 0 0 46 2 0 0 0 35 0 0 0 48 40 0 0 0 0 0 101 0 0 272
#news 317 52 201 136 39 129 158 61 1 16 125 7 44 21 8 0 0 96 16 69 17 280 61 1854
Table 4.3: Statistics of the radio shows
data of all the topics (for the language models adapted to the topic) and enough female as well as male data. As validation set the shows 4, 15 and 19 are chosen. As test set the shows 5, 10 and 21 are taken. All the other shows are used as training material. The final statistics about these sets can be found in Table 4.4. This table shows that the conditions mentioned earlier, are reasonably met. A lot of other divisions were tried, but gave a less satisfactory result. The only other division that will later be used, is the division where the validation set is used for testing and the test set for validation.
CHAPTER 4. DIVISION OF THE DATABASE Set Training Validation Test
#fem(s) 6490 993 2756
#male(s) 21298 4481 3000
#spont(s) 10175 479 1811
#planned(s) 19071 4995 3945
19 #cult 471 102 95
#pol 50 40 182
#news 1622 160 72
Table 4.4: Statistics of the chosen sets
4.5
Conclusion
After investigation of the different topics, a topic classification was made into three main topics: politics, culture and the more general news. Then statistics were made about each radio show and, based on those, a division was made in such a way that the training and the test set contain enough data of all the topics and enough female as well as male data.
Chapter 5 Language Model 5.1
Introduction
When the preliminary investigation is finished, the model development can start. The first thing to do is to make and evaluate appropriate language models. Based on the training set of the radio shows several language models will be developed. These will then be linearly interpolated with the already developed language model based on the newspaper data. After the development, all the language models will be evaluated by perplexity and OOV rate computations. Next to the universal and adapted models, some extra models will be developed, by way of comparison. Finally all the models will be prepared for use in the recognition engine.
5.2
Language models based on the radio shows
The training test of the radio shows defined in the previous step, is now used to make a first series of language models. These models are created using the SRILM toolkit. With the command ngram-count, standard n-gram language models with Good-Turing discounting and Katz backoff for smoothing are estimated [9]. For each combination of mode (spontaneous or planned), dialect (native or nonnative) and topic (culture, politics or news) containing enough data, noted as mXdXtX 1 , such a language model and a corresponding vocabulary list are created, using only this part of the training set that contains speech of that particular mode, dialect and topic. For this research trigram models are used. Later on unigram and bigram models will also be estimated and evaluated by way of comparison.
1
m: 0(everything), 1(planned) or 2(spontaneous); d: 0(everything), 1(native) or 2 (nonnative); t: 0(everything), 1(culture), 2(politics) or 3(news)
20
CHAPTER 5. LANGUAGE MODEL Combination m0d0t0 m0d0t1 m0d0t2 m0d0t3 m0d1t0 m0d1t1 m0d1t2 m0d1t3 m0d2t0 m0d2t1 m0d2t2 m0d2t3 m1d0t0 m1d0t1 m1d0t2 m1d0t3 m1d1t0 m1d1t1
voc size 10408 4412 1103 7855 10333 4412 1103 7768 422 0 0 422 6560 2873 1097 4551 6504 2873
#words 87287 22367 3421 61499 86073 22367 3421 60285 1214 0 0 1214 43400 13612 3402 26386 42708 13612
21 Combination m1d1t2 m1d1t3 m1d2t0 m1d2t1 m1d2t2 m1d2t3 m2d0t0 m2d0t1 m2d0t2 m2d0t3 m2d1t0 m2d1t1 m2d1t2 m2d1t3 m2d2t0 m2d2t1 m2d2t2 m2d2t3
voc size 1097 4480 289 0 0 289 4596 1190 16 4095 4548 1190 16 4043 212 0 0 212
#words 3402 25694 692 0 0 692 29949 4833 19 25097 29427 4833 19 24575 522 0 0 522
Table 5.1: Vocabulary sizes and number of words of the training sets of the language models mXdXtX
Table 5.1 shows the vocabulary size and the number of words of the test set of each combination. As can be seen, not all the combinations generate a language model, because for some combinations there are no training data. This is especially the case for combinations involving nonnative speakers, of which there are only very few. This means that, for the sequel of the experiments, no difference will be made between native and nonnative speakers. Moreover, the difference between topics will only be made if the mode is undetermined (m0). Thus the combinations that will be used, are m0d0t0, m0d0t1, m0d0t2, m0d0t3, m1d0t0 and m2d0t0.
5.3
El Correo Gallego
The second language model is based on the newspaper “El Correo Gallego”. A big part of this newspaper is written in Castilian and only this part has been used for the training of the language model. There is only one universal model and the corpus on which it is trained, consists of nearly 64 Mwords and has a vocabulary size of 371 Kwords. This is a much bigger language model, but it is of course less adapted to the speech used in the radio shows, since it is a newspaper and thus contains written
CHAPTER 5. LANGUAGE MODEL test set m0d0t0 m0d0t1 m0d0t2 m0d0t3 m1d0t0 m2d0t0
lm m0d0t0 ppl OOV% 210.9 9.73 201.1 9.85 205.3 9.21 220.7 10.06 234.1 10.45 173.8 8.33
matched lm mXdXtX ppl OOV% 210.9 9.73 171.3 18.27 192.6 15.73 214.5 11.26 208.7 14.48 146.6 15.19
22 lm CorGal ppl OOV% 185.1 5.07 219.5 5.82 154.6 4.10 194.4 5.41 185.1 5.39 185.0 4.45
Table 5.2: Evaluation of the simple LMs
speech. Moreover, although the language is Castilian, most of the topics are still typical of Galicia, e.g. Galician politics. So this language model as such will not be sufficient to recognize the radio spoken Castilian speech.
5.4
Evaluation of the simple language models
These language models can be evaluated through perplexity calculation and measurement of the out-of-vocabulary words of the test set. These two quality indicating numbers are calculated for the test set of all the previously mentioned mXdXtX combinations and over three different language models: the universal m0d0t0 language model, the matched mXdXtX model and the model based on “El Correo Gallego” (CorGal). The results can be seen in Table 5.2. Obviously the OOV rate when using the newspaper model, is lower than the one when using the radio show model, since the vocabulary list of the former is much bigger (371 Kwords versus 10 Kwords). For exactly the same reason, the OOV rate when using the specific language model, is higher than the one when using the universal language model trained on the radio shows. The perplexities when using the adapted models, are lower than those when using the universal model. This demonstrates that the specific models can be useful for better recognition results. They are however not always smaller than the perplexities when using the newspaper model; especially for politics the difference is big. This indicates that, for the topic ‘politics’, the newspaper model is a better model than the model based on the radio shows. But still the perplexities are quite high. One way to try to reduce them is to interpolate the radio show models and the newspaper model.
CHAPTER 5. LANGUAGE MODEL combination m0d0t0 m0d0t1 m0d0t2 m0d0t3 m1d0t0 m2d0t0
partition A λradio λCorGal 0.348 0.652 0.369 0.631 0.106 0.894 0.420 0.580 0.281 0.719 0.509 0.491
23 partition B λradio λCorGal 0.359 0.641 0.429 0.571 0.319 0.681 0.383 0.617 0.326 0.674 0.424 0.576
Table 5.3: The computed lambdas
5.5
Interpolation of the two language models
With the mXdXtX radio show models and the newspaper model, mixed models can be made through linear interpolation, with the formula P (w) = λ1 P1 + λ2 P2 where the lambdas are the interpolation weights. These lambdas can be determined by means of perplexity calculations on the validation set. The perplexities of the validation set over the first language model, based on the radio shows, and over the second language model, based on the newspaper, are calculated. These perplexities are given as input to the SRILM command computebest-mix. This command uses the EM algorithm to calculate the interpolation weights. Through an iterative computation the interpolation weights are adapted until the perplexity of the validation set over the interpolated model is minimized. The algorithm stops when the interpolation weights change by less than a certain threshold value (default 0.001) [10]. The lambda computations are made for two different partitions of the database. The first one is the one described in Chapter 3, while the other is the first one with validation and test set switched. To make further references easier, these partitions will be named partition A and partition B respectively. The computed lambdas can be seen in Table 5.3. These lambdas indicate that, like it has already been stated in Section 5.3, especially for the topic ‘politics’, the newspaper based language model is far more suitable than the radio shows based model. Furthermore this table shows that the lambdas for partition A and B are quite different from one another, although their validation set comes from the same source, namely the radio show data. This indicates that the data of the validation set of the two partitions don’t bear much resemblance. Since the validation set of B is the test set of A and vice versa, this proves that, whichever partition will finally be used, the validation and test set don’t have much in common. This means that the lambdas are calculated to minimize the perplexity of the validation set, but will probably not
CHAPTER 5. LANGUAGE MODEL
24
minimize the perplexity of the test set, which is what is desired for optimal recognition of this set. The vocabulary lists also have to be mixed. Two different resulting vocabulary sizes, 10 Kwords and 20 Kwords, are used. This mixing also happens by using the previously calculated lambdas. The most frequent 10K and 20K words are selected, considering the weights given to the two mixed vocabulary lists. Then the interpolated models can be estimated, using the two different language models, the computed lambdas, the mixed vocabulary list and the SRILM toolkit. Finally the interpolated models are pruned by a factor of 2, 5 · 10−8 . Pruning means that those n-gram probabilities whose removal causes the perplexity of the training set to increase by less than that treshold value, are indeed removed. The retained explicit probabilities are not changed. The implicit (backed-off) probability estimates however are changed by a recalculation of the backoff weights [11]. This pruning operation causes the resulting n-gram model to be reduced by more or less 1/4 of its original size.
5.6
Evaluation of the interpolated language models
The same quality indicating numbers that were used in Section 5.4, can now be used to evaluate the composed interpolated language models. In Table 5.4 and 5.5 the OOV rate and perplexity of the test set but also of the validation set over the resulting pruned models can be seen. As additional information the vocabulary sizes of the testing sets are also shown. The results are shown for the two vocabulary sizes, 10 Kwords and 20 Kwords, and for the two different partitions, partition A and partition B. Several conclusions can be drawn and decisions can be made, based on these tables: • Except for politics, partition A shows better results for the test set. The better results for partition B in the case of politics are a consequence of the very small test set (vocabulary size only 851). For the validation set however partition B shows better results. From this we can conclude that the test set of partition A, which is at the same time the validation set of partition B, is the least complex and probably the easiest recognizable set. Since it is the test set that is going to be recognized, partition A is finally chosen as the definitive partition. • The number of OOV words when using the smaller vocabulary size of 10 Kwords, is of course bigger. The perplexity, by contrast, is smaller. A reason for this can be found in the fact that the OOV words are not included in the calculation of the perplexity. This causes language models with a higher OOV rate to have a
CHAPTER 5. LANGUAGE MODEL combination m0d0t0 m0d0t1 m0d0t2 m0d0t3 m1d0t0 m2d0t0 combination m0d0t0 m0d0t1 m0d0t2 m0d0t3 m1d0t0 m2d0t0
voc size test 3520 1243 2279 1133 2799 1416
ppl test 148.6 159.5 151.8 156.5 155.2 137.4
voc size test 3279 1618 851 1704 2629 498
ppl test 159.9 175.2 120.9 161.8 158.3 150.0
25 partition A OOV% test ppl valid OOV% valid 4.81 159.8 5.88 5.59 174.7 7.64 4.28 117.2 3.85 4.78 161.4 4.97 5.14 158.5 6.00 4.19 149.0 7.40 partition B OOV% test ppl valid OOV% valid 5.88 148.6 4.81 7.64 158.8 5.59 4.21 141.0 4.47 4.97 155.8 4.78 6.00 154.8 5.14 7.40 136.4 4.19
Table 5.4: Evaluation of the interpolated LMs, voc size 20 Kwords
lower perplexity. However, the difference in perplexity is quite big. So, a part of this difference is still due to the actual lower perplexity of the models with a vocabulary size of 10 Kwords. But since a lower OOV rate is seen as a bigger advantage than a smaller perplexity, the models generated with a vocabulary size of 20 Kwords are used in further experiments. • When looking at the results for the chosen partition A and the vocabulary size of 20 Kwords, conclusions can be drawn about the efficiency of the topic or mode adapted models. The perplexity and the OOV rate for these models are hardly better than for the universal model. Only for spontaneous speech the result for the specific model is somewhat better. The topic ‘politics’ gives the best result of the three topics. Those two models are thus the most likely candidates for generating better recognition results. The effect of pruning can be seen in Table 5.6, where perplexities of the test set over the models are compared for the pruned and unpruned case. Pruning has no effect on the OOV rate. Only the results for partition A and the vocabulary size of 20 Kwords are shown. They prove that pruning increases the perplexity of the test set as well as the perplexity of the validation set only slightly (2-3%).
CHAPTER 5. LANGUAGE MODEL
combination m0d0t0 m0d0t1 m0d0t2 m0d0t3 m1d0t0 m2d0t0 combination m0d0t0 m0d0t1 m0d0t2 m0d0t3 m1d0t0 m2d0t0
voc size test 3520 1243 2279 1133 2799 1416
ppl test 126.7 134.4 126.8 134.2 130.7 117.2
voc size test 3279 1618 851 1704 2629 498
ppl test 135.4 142.2 110.4 139.8 133.3 126.6
26
partition A OOV% test ppl valid OOV% valid 7.63 135.5 8.98 8.29 139.8 11.07 7.53 104.1 7.01 7.50 139.7 7.68 8.20 133.4 9.16 6.73 127.4 10.27 partition B OOV% test ppl valid OOV% valid 8.99 126.6 7.64 10.91 133.8 8.33 7.37 120.7 7.34 7.70 133.7 7.55 9.20 130.6 8.21 10.40 118.4 6.54
Table 5.5: Evaluation of the interpolated LMs, voc size 10 Kwords
combination m0d0t0 m0d0t1 m0d0t2 m0d0t3 m1d0t0 m2d0t0
pruned unpruned ppl test ppl valid ppl test ppl valid 148.6 159.8 144.8 155.4 159.5 174.7 155.9 169.6 151.8 117.2 149.0 114.0 156.5 161.4 151.0 156.7 155.2 158.5 152.0 154.2 137.4 149.0 132.4 145.8
Table 5.6: The effect of pruning
CHAPTER 5. LANGUAGE MODEL lm unigram bigram
ppl test 637.3 189.5
27 ppl valid 626.1 196.8
Table 5.7: Evaluation of the unigram and bigram LMs (for m0d0t0)
5.7
Additional language models
By way of comparison, some additional language models are developed for later investigation of the “ceiling” and “floor” of recognition. These extra models are only developed for the universal case m0d0t0. The “ceiling” model is a model that normally gives much better recognition results than the previously developed models. As ceiling model a simple non-interpolated language model that is trained not only on the training set but on the whole database of radio shows, including the test set, has been chosen. This should improve the recognition of this test set considerably, as will be proved in Chapter 7. The “floor” model is a model that normally gives a much worse result than the previously developed models. As “floor” model a bigram and even a unigram interpolated model have been chosen. In Table 5.7 it can already be seen that the perplexities are much higher, which will also show in the recognition results. Of course the OOV rate remains unchanged.
5.8
Preparation for recognition use
The final step consists in preparing the previously developed language models for use in the recognition experiments. First the lexical tree that has been discussed in Section 2.4, has to be generated. To this purpose, the vocabulary list is converted into a list with the phonetically transcribed words, using the SAMPA (Speech Assessment Methods Phonetic Alphabet) chart of Table 5.8 [12]. “ll” and “y” are both transcribed as “Z” because of their resemblance in pronunciation. After this principal transcription, a series of exceptions and rules is run through the resulting list. This operation results in a list that contains each word, followed by several alternative transcriptions of that word. Finally the lexical tree is generated by use of the parser of the recognizer and by some extra commands for converting the list of transcriptions into the tree. There are three resulting files, all in binary format: the lexical tree, the vocabulary and the phonetic transcription. These files will all be needed when the language models are used in the recognition experiments. Ultimately the interpolated language models also have to be converted into a binary format, resulting in language models of the type “minimal perfect hash”. This means
CHAPTER 5. LANGUAGE MODEL
SAMPA p b t d k g m n J C f T s Z D x l Z R b r a e i o u N S
description voiceless bilabial plosive voiced bilabial plosive voiceless dental plosive voiced dental plosive voiceless velar plosive voiced velar plosive voiced bilabial nasal voiced alveolar nasal voiced palatal nasal voiceless palatal/postalveolar affricate voiceless labiodental fricative voiceless (inter)dental fricative voiceless alveolar fricative voiced palatal fricative voiced dental approximant voiceless velar fricative voiced alveolar lateral voiced palatal lateral voiced alveolar trill voiced bilabial approximant voiced alveolar tap/flap central open vowel front mid vowel front closed vowel back mid rounded vowel back closed rounded vowel voiced velar nasal voiceless postalveolar fricative
28
example padre barba tasa dar cada gala mala nada ca˜ na chico f also zona sala ayer cada jam´on lata lleno carro lava caro tal tela tila todo tul hongo xeneral
Table 5.8: Chart of phonetic transcriptions
transcription padre barba tasa dar kada gala mala nada kaJa Ciko f also Tona sala aZer kaDa xamon lata Zeno kaRo laba karo tal tela tila todo tul oNgo Seneral
CHAPTER 5. LANGUAGE MODEL
29
that the language models are represented compactly by a table of compressed numbers and by a number of perfect hash automata converting words into the corresponding hash keys [13].
5.9
Conclusion
In this chapter the development of the language models has been described. Based on the material from the radio shows, a first series of models was developed and evaluated. Such a model was developed for each topic and each mode (spontaneous or planned) of speech. To obtain better results, these models were interpolated with the language model based on the newspapers. The resulting models were also evaluated and proved to be of better quality. After this, some additional special models were developed by way of comparison: a unigram and bigram model of lower quality, and a model of higher quality, trained not only on the training set but also on the test set that will be used in the recognition experiments. Finally all the developed models were prepared for the recognizer.
Chapter 6 Acoustic Model 6.1
Introduction
After the language models, the acoustic models have to be developed. Not only a universal model, but also separate models for male and for female speakers have to be made. The speech used to develop these models, has to be clean. This means that the speech may not contain any unintelligible words or speaker noises. To this purpose these clean data will first be filtered out from the training material of the radio shows. The remaining material will then be used to adapt a seed model by means of MLLR (Maximum Likelihood Linear Regression) and MAP (Maximum a Posteriori) techniques, using the HTK toolkit.
6.2
Filtering of useless speech
First the nonclean speech has to be filtered out, since it is useless for the adaptation of the acoustic models. The background noises, like jingles and music, have already been filtered out. This can also be seen in Figure 3.1, where the line “[BCK] clean” means that there is no background noise left in the audiofiles that will be used for adaptation of the acoustic models and for testing. There are however still unintelligible words and speaker noises, as has already been described in Section 3.3.1. They are marked by stars and [fil] or [spk] insertions. All the turns containing any of these impurities, are filtered out from the audio that will be used for the acoustic model development. Only 45953 of the 87287 originally spoken words then remain. This is more or less 53% of the original training set, which counts for more or less 5 hours of speech.
30
CHAPTER 6. ACOUSTIC MODEL
6.3
31
Acoustic Model adaptation
Since there is little training material available, acoustic model development starts from a speaker independent seed model, trained on the Spanish SpeechDAT database, consisting of 25 hours of Castilian speech. This speech has been recorded through the public fixed telephone network, sampled at 8kHz and codified by the A-law using 8 bits per sample [14]. This 8kHz is only half of the 16kHz at which the radio data are sampled. This difference is solved by using more channels in the filter bank to consider the new frequence content of the 16kHz data. For the 8kHz data, 24 channels are used; for the 16kHz data 27 channels are used. As acoustic units for the seed model demiphones are used, which each consist of a 2 state HMM. Only 640 of the 784 (28x28) possible demiphones are modeled. The demiphones are context-dependent: the left demiphone of a phoneme is dependent of its left neighbour, the right one of its right neighbour. So the acoustic model consists of 1280 states. Each state is modeled by a mixture of 4 to 8 Gaussian distributions with a 39 dimensional feature space: 12 mel frequency cepstrum coefficients, the normalized log energy, and their first-order and second-order time derivatives [14]. The clean training material prepared earlier, is then used to adjust this seed model to the test set by adaptation based on MLLR and MAP techniques. Normally this strategy can also be used to adapt the speaker independent models to individual speakers, for instance anchorspeakers appearing in several shows. But in this case this is not applicable since there are only four speakers appearing in more than one radio show. Three of them appear in two different shows, one of them in three shows. In Table 6.1 their time of appearance in the training set and the test set can be seen. Only three speakers are present in the training set as well as in the test set. Except for Gemma Nierga the time they speak is very short. So speaker adaptation wouldn’t really help here. MLLR is a technique that computes a set of linear transformations for the mean and variance parameters of the Gaussian mixture components used in the HMM system of the seed model. When these transformations are applied, the component means shift and the variances change so that each state of the HMM system is more likely to produce the training adaptation data. Global MLLR computes one transformation that will be applied to every mixture component. However, to improve flexibility different transformations can be calculated for separate groups of mixture components. The components could be grouped into the broad phone classes (silence, vowels, plosives, glides, nasals, fricatives, . . . ), but more robust and dynamic grouping is effected when a regression class tree is used. Such a tree clusters components that are close in acoustic space. The tree is constructed with a centroid splitting algorithm based
CHAPTER 6. ACOUSTIC MODEL Speaker Jos´e Mar´ıa Aznar Ana Palacio Eduardo Larrocha Gemma Nierga
time in training set (s) 38.5 116.1 59.6 1671.4
32 time in test set (s) 29.8 22.1 0 545.1
Table 6.1: Statistics of speakers appearing in more than one radio show
on Euclidean distance measure, using the HHEd command of the HTK toolkit. The terminal nodes define how the mixture components are finally divided. After the generation of this tree, a transformation matrix is generated for each terminal node that has enough training adaptation data associated with it. When the terminal node doesn’t have enough data associated with it, a transformation matrix is calculated for the parent of that node. This algorithm is implemented using a top-down approach. The search starts at the root node and progresses downwards, generating transformation matrices for each node that has sufficient data and that is either a terminal node or has children without sufficient data [15]. MAP is a process that is based on prior knowledge about the model parameters. Based on that knowledge the update formula for the parameters can be applied to each mixture component separately. Since MAP adapts at the component level, a lot of training data are required [15]. All these different techniques are combined in an adaptation process consisting of three different passes. In the first pass a global speech MLLR adaptation is performed. The second pass uses this global transformation and a regression class tree to estimate a set of more specific transformations. Sixteen regression classes are used. Ultimately these MLLR adapted models are used as prior knowledge in the final application of the MAP technique [14]. This adaptation process is done three times, each time based on a different training set: the entire training set for the universal acoustic model, the subset of the training set containing only male speakers for the male model and the subset containing only female speakers for the female model.
6.4
Conclusion
The training adaptation material was cleaned by removing unintelligible speech and speaker noises. Then this material was used to adapt a speaker independent seed model by use of the MLLR and MAP techniques, available in the HTK toolkit. A universal model as well as models for female and male speakers were developed.
Chapter 7 Recognition experiments 7.1
Introduction
Once all the necessary models have been developed, recognition experiments can start. They are the ultimate test for the estimated models. To analyze the recognition output files, the manually transcribed test set corpus first will have to be converted into Master Label Files (MLFs), which is a format similar to the format of the files generated by the recognition engine. Then the recognition output MLF and the previously generated original MLF will be compared and statistics like the word error rate, number of insertions, etc. will be computed and analyzed. After having executed the recognition and analyzed the results using the universal m0d0t0 language model and the universal acoustic model, several other specific experiments, possibly improving the recognition results, will be done and evaluated.
7.2
Generation of the Master Label Files
For the evaluation of the recognition results, label files have to compared. To avoid the use of thousands of label files, each having very few label entries, Master Label Files can be used. Such an MLF collects all the needed label definitions in one file. It always starts with the line “#!MLF!#”, identifying it as an MLF. This line is followed by an enumeration of all the label files that would otherwise have been generated for each turn separately. The name of the label file is followed by each label entry (word), on separate lines, and ends with a final point on the last line. An example of an MLF can be seen in Figure 7.1.
33
CHAPTER 7. RECOGNITION EXPERIMENTS
34
#!MLF!# “*/N1 03002 000t166s0177MN.lab” en radio nacional de espa˜ na . “*/N1 03002 001t166s0178FN.lab” siete d´ıas . “*/N1 03002 002t166s0177MN.lab” con guillermo ordu˜ na . ... Figure 7.1: Example of a Master Label File
7.3
Recognition parameters and evaluation
Several files have to be fed into the recognition engine: the language model, the three binary files described in Section 5.8 (the vocabulary tree, the vocabulary and the phonetic transcription), a list of the turns that have to be recognized, the acoustic model and a file with parameters. Some of the parameter settings that matter for further experiments, can be seen in Figure 7.2. There are three parameters concerning pruning during recognition (different from the pruning of the language models). “P” is the global beam width pruning value and is temporarily fixed at 180. “Pp” is the word-end pruning value, fixed at 70. “Ph” is the maximum number of active hypotheses during recognition and it is fixed at 50000. The reason why three different pruning methods are used, is that pruning has a great influence on the computational complexity and, consequently, also on the speed of the recognition process. However, pruning has the disadvantage of affecting the quality of the word lattices, thus also lowering the quality of the recognition output. Global beam width pruning limits the total number of active hypotheses during search. This is done by not taking into consideration those hypotheses whose log
CHAPTER 7. RECOGNITION EXPERIMENTS
35
-P 180 -Pp 70 -peso lm 12 -vad 1 -LMpp 0 -Ph 50000 Figure 7.2: Parameter settings
probability is lower than the current maximum log probability minus a threshold value, namely the beam width pruning value, for the moment set at 180. Word-end pruning works in the same way as beam width pruning, but only considers the word-end nodes during the beam search. This means that pruning is done at the word level instead of at the demiphone model level. Maximum active hypotheses pruning is again pruning at the demiphone model level. This method puts a limit to the number of active hypotheses during recognition. The maximum is set at 50000 [16]. “Peso lm” is the weight of the language model and is always equal to 12. Voice activity detection is used when “vad” is 1 and not used when it is 0. For the moment this parameter is set on 1. Finally “LMpp” is the word insertion penalty. The higher its value, the bigger the penalty on insertions is. So the less insertions will be made. Normally this value is set on 0. When all the parameters have been set and the input files have been given to the recognizer, the recognition can start. An output MLF is generated. After recognition, the output has to be evaluated. This happens again by use of the HTK toolkit. With the HResults command each of the recognized and reference strings is matched by performing an optimal string match, using dynamic programming. This method calculates a score for each possible match. Identical labels get a score of 0, a label insertion carries a score of 7, a deletion a score of 7 and a substitution a score of 10. The optimal string match is the label alignment with the lowest score [14]. When this alignment is made, the number of correctly recognized words (H), substitution errors (S), deletion errors (D) and insertion errors (I) can be calculated. From these numbers, the word error rate (WER) can be calculated as D+S+I · 100% N where N is the total number of labels in the reference transcriptions [17]. The percentage of correctness is calculated as H N −D−S · 100% = · 100% %Corr = N N W ER =
CHAPTER 7. RECOGNITION EXPERIMENTS
7.4
36
Universal model
A first experiment consists in recognizing all the test material available by using the universal models, language as well as acoustic. By universal language model is meant the model generated by interpolation of the m0d0t0 radio show model and the newspaper model. A part of the transcription produced by the recognizer, can be seen in Appendix A. For each small part of a sentence the recognized text is shown. This is followed by the acoustic, language and finally total log probability of this optimal transcription. The resulting word error rate and other statistics obtained when using these universal models, can be seen in Table 7.1. For the sake of completeness the statistics about the recognition of the separate turns are also shown. A word error rate of 61.28% is a rather bad result. The turn percentage correct is even worse, since there are only very few turns that are recognized completely correctly. In the next sections attempts will be made to lower the word error rate. Unit %Corr Turn 0.57 Word 44.48
H 2 8076
D / 2442
S I 347 / 7640 1045
N 349 18158
WER / 61.28
Table 7.1: Recognition results when using the universal models
7.5
Sex-dependent acoustic models
One technique that will certainly result in lower word error rates is using the sexdependent acoustic models. The model for female speakers is used to recognize all the test material containing female speech. The model for male speakers is used for the recognition of the male test material. The results are shown in Tables 7.2 and 7.3. The average result for the whole test set can be seen in 7.4. The word error rate has decreased by a reasonable amount. Unit %Corr Sentence 1.68 Word 47.54
H 3 4107
D S / 176 975 3557
I / 519
N WER 179 / 8639 58.47
Table 7.2: Recognition of female speakers when using the female acoustic model
This is normal since men and women have a completely different timbre and height of voice. This difference is also proved by the results shown in Tables 7.5 and 7.6. These results are obtained when the acoustic model for female speech is used for the recognition of male speakers and vice versa.
CHAPTER 7. RECOGNITION EXPERIMENTS Unit %Corr Sentence 1.76 Word 47.29
H 3 4502
D / 1397
S 167 3620
37 I N / 170 534 9519
WER / 58.31
Table 7.3: Recognition of male speakers when using the male acoustic model
Unit %Corr Sentence 1.72 Word 47.41
H 6 8609
D / 2372
S I 343 / 7177 1053
N 349 18158
WER / 58.39
Table 7.4: Recognition results when using the sex-dependent acoustic models
Unit %Corr Turn 0.59 Word 23.27
H 1 2215
D / 1861
S I 169 / 5443 445
N 170 9519
WER / 81.41
Table 7.5: Recognition of male speakers when using the female acoustic model
Unit %Corr Turn 0.00 Word 39.22
H 0 3388
D / 1102
S I 179 / 4149 616
N 179 8639
WER / 67.91
Table 7.6: Recognition of female speakers when using the male acoustic model
7.6
Topic adapted language models
Not only specific acoustic models, but also specific language models can be used. The language models adapted to the three different topics can be used to recognize the subsets of the test set, containing the data of the topic the model is adapted to. The results can be seen in Tables 7.7 till 7.9. These tables not only show the results when these adapted models are used, but also the results when the universal model is used. The results for the complete test set can be seen in Table 7.10. Tables 7.7 till 7.9 show that the topic adapted models don’t have any noticeable effect on the results. These conclusions could also be drawn from the perplexity measurements over these models made in Section 5.6. The perplexities over the adapted models didn’t decrease in comparison with the perplexity over the universal model. The reason for this bad adaptation is that the topics are not homogeneous enough. One topic gathers a lot of very different stories. These stories don’t bear much resemblance to one another when it comes to the type of words that is used or the
CHAPTER 7. RECOGNITION EXPERIMENTS LM Unit Universal Turn Word Adapted Turn Word
%Corr 1.05 42.85 1.05 42.80
H 1 1959 1 1957
38
D S / 94 820 1793 / 94 822 1793
I N / 95 263 4572 / 95 272 4572
WER / 62.90 / 63.15
Table 7.7: Recognition of culture when using the LM adapted to the topic ‘culture’
LM Universal Adapted
Unit %Corr Turn 0.55 Word 48.83 Turn 0.55 Word 48.64
H 1 4873 1 4854
D / 1257 / 1298
S I 181 / 3850 517 181 / 3828 526
N 182 9980 182 9980
WER / 56.35 / 56.63
Table 7.8: Recognition of politics when using the LM adapted to the topic ‘politics’
LM Unit Universal Turn Word Adapted Turn Word
%Corr 0.00 34.44 0.00 35.00
H 0 1242 0 1262
D S / 72 367 1997 / 72 380 1964
I N / 72 265 3606 / 72 278 3606
WER / 72.91 / 72.71
Table 7.9: Recognition of news when using the LM adapted to the topic ‘news’
Unit %Corr Turn 0.57 Word 44.46
H 2 8073
D / 2500
S I 347 / 7585 1076
N 349 18158
WER / 61.47
Table 7.10: Recognition results when using the topic adapted LMs
structure. For the topic ‘news’ there is a lot of heterogeneity because all the topics that don’t belong to either politics or culture, have been grouped in this general topic. This means that the types of words used in the different stories don’t bear a strong similarity. But in the other two topics there is also a lot of heterogeneity. For instance, the topic ‘culture’ gathers stories about the life of an actor, about a father telling his son a story containing a moral, about a photographer and her work, about espionage, etc. The topic would be more homogeneous if e.g. in every show the life and the work of a writer were discussed. Since this is not the case, the topic adaptation doesn’t result in a better recognition. Based on these tables the recognition of the three topics can also be compared.
CHAPTER 7. RECOGNITION EXPERIMENTS
39
The topic ‘politics’ is recognized the best of the three. The reason for this is that the data about politics appear to bear more resemblance to the language model based on the newspapers, like has already been concluded in Sections 5.4 and 5.5. Since this model is bigger and its weight in the universal model as well as in the model adapted to politics is the biggest, the politics related speech is recognized the best. Politics is also the topic that is mainly represented in the test set. From Table 4.4 can be calculated that 52% of the test data set is about politics. This is why the resulting word error rate for the whole test set, reported in Table 7.10, is still fairly good compared to the word error rate of the subsets of culture and news. There is however no improvement compared to the 61.28% obtained when the universal language model is used.
7.7
Change of some recognition parameters
Another way to try to improve the recognition results is to change the values of some of the parameters described in Section 7.3.
7.7.1
Global beam width pruning factor
A first parameter that can be changed, is the beam width pruning factor. This factor determines from which value on an improbable path isn’t taken into consideration. The bigger this value, the better the recognition should be, since there are fewer errors introduced due to the erasing of paths that could be the right ones. So widening the search beam width normally reduces pruning errors. The experiment is done for pruning factor 180 till 250, in steps of 10. The results can be seen in Table 7.11. The word error rate barely changes. So the change of this parameter doesn’t have the desired effect.
7.7.2
Voice Activity Detection
Another parameter that can be changed, is the parameter that determines whether voice activity detection is used or not. All the previous results have been generated by using voice activity detection. The results when disabling this detection, can be seen in Table 7.12. By way of comparison the results are generated for each beam width pruning factor that was also used in the previous section. The obtained results turn out to remain unchanged or to have worsened. So changing this parameter doesn’t help either to improve the recognition results.
CHAPTER 7. RECOGNITION EXPERIMENTS Pruning 180 190 200 210 220 230 240 250
Unit Turn Word Turn Word Turn Word Turn Word Turn Word Turn Word Turn Word Turn Word
%Corr 0.57 44.48 0.57 44.50 0.57 44.45 0.57 44.37 0.57 44.36 0.57 44.41 0.57 44.40 0.57 44.36
H 2 8076 2 8081 2 8072 2 8056 2 8054 2 8064 2 8063 2 8054
D / 2442 / 2451 / 2455 / 2451 / 2452 / 2453 / 2458 / 2443
S 347 7640 347 7626 347 7631 347 7651 347 7652 347 7641 347 7637 347 7661
40 I / 1045 / 1034 / 1020 / 1025 / 1023 / 1027 / 1026 / 1025
N 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158
WER / 61.28 / 61.19 / 61.16 / 61.28 / 61.28 / 61.25 / 61.25 / 61.29
Table 7.11: Recognition results for various pruning factors
7.7.3
Word Insertion Penalty
A last parameter that is experimented with, is the word insertion penalty. As already has been mentioned, this parameter determines the relative levels of insertion and deletion errors. The more positive this value is, the bigger the penalty for insertions is. This means that there will be less insertions, but more deletions. The more negative this value is, the bigger the reward for insertions is. In that case there will be many more insertions and a lot less deletions. Intuition and experience tell us that better recognition happens when the number of deletions is of the same order as the number of insertions. Table 7.1 shows that the number of deletions in the universal experiment is 2442, while the number of insertions is 1045. This means that by making the word insertion penalty negative, the biggest improvement in recognition results is to be expected. The results for penalties varying from -25 to 25 in steps of 5, can be seen in Table 7.13. This table shows that the results meet the expectations. The more negative the parameter is, the more insertion errors and the less deletion errors are made. When the penalty is increased in the positive direction, the opposite effect takes place. The optimal value proves to be -5. In this case the number of deletion errors is more or less the same as the number of insertion errors, thus realizing the best word error rate. The result of some extreme values of this parameter can be illustrated by an example. The original manually transcribed text of a short fragment is:
CHAPTER 7. RECOGNITION EXPERIMENTS Pruning 180 190 200 210 220 230 240 250
Unit Turn Word Turn Word Turn Word Turn Word Turn Word Turn Word Turn Word Turn Word
%Corr 0.57 44.26 0.57 44.23 0.57 44.21 0.57 44.18 0.57 44.16 0.57 44.19 0.57 44.09 0.57 44.02
H 2 8037 2 8031 2 8027 2 8022 2 8018 2 8024 2 8005 2 7994
D / 2425 / 2442 / 2448 / 2448 / 2435 / 2452 / 2450 / 2449
S 347 7696 347 7685 347 7683 347 7688 347 7705 347 7682 347 7703 347 7715
41 I / 1001 / 983 / 986 / 991 / 990 / 986 / 979 / 987
N 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158
WER / 61.25 / 61.19 / 61.22 / 61.28 / 61.30 / 61.24 / 61.31 / 61.41
Table 7.12: Recognition results when disabling voice activity detection, for various pruning factors
unos nombres algunos desconocidos incluso unos hechos inesperados temidos otros alimentan cada semana los sumarios de la actualidad nosotros nos fijamos especialmente en los sonidos y en las voces los recogemos y los guardamos con cuidado porque no es f´acil tenerlo luego todo a mano para poner a su disposici´on este mosaico radiof´onico de cada semana The transcription obtained with insertion penalty 0, is: uno dos nombres algunos desconocidos incluso unos hechos y esperamos heridos otros alimentar en cada semana do sumario la actualidad nosotros nos fijamos especialmente de los unidos y las voces de los recoge museo guardamos con cuidado un tema f´acil perder uno coma no para poner a su disposici´on estemos ah´ı como radiof´onico de cada semana The transcription obtained when insertion penalty 20 is used, clearly has fewer words (54 instead of 63 with penalty 0): los nombres algunos r´ıos incluso unos hechos y esperamos heridos otros alicante semana do sumario la corrida nosotros somos m de los unidos y las voces los cogemos vamos un uno lo f´acil pero como no ha por su disposici´on estemos ah´ı corri´o fui con asma
CHAPTER 7. RECOGNITION EXPERIMENTS Penalty Unit -25 Turn Word -20 Turn Word -15 Turn Word -10 Turn Word -5 Turn Word 0 Turn Word 5 Turn Word 10 Turn Word 15 Turn Word 20 Turn Word 25 Turn Word
%Corr 0.00 34.28 0.00 41.04 0.00 45.63 0.29 48.39 0.29 47.47 0.57 44.48 0.29 39.38 0.29 32.55 0.00 25.32 0.00 18.88 0.00 14.41
H D 0 / 6224 734 0 / 7452 883 0 / 8286 1110 1 / 8786 1398 1 / 8619 1860 2 / 8076 2442 1 / 7150 3239 1 / 5910 4218 0 / 4598 5412 0 / 3428 6964 0 / 2617 8229
S 349 11200 349 9823 349 8762 348 7974 348 7679 347 7640 348 7769 348 8030 349 8148 349 7766 349 7312
42 I / 8530 / 6728 / 3253 / 2122 / 1351 / 1045 / 852 / 689 / 490 / 368 / 258
N 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158 349 18158
WER / 112.70 / 96.01 / 72.28 / 63.30 / 59.97 / 61.28 / 65.32 / 71.25 / 77.38 / 83.15 / 87.01
Table 7.13: Recognition results for various word insertion penalties
The transcription obtained when insertion penalty -20 is used, clearly has more words (83 instead of 63 with penalty 0): el mundo agosto el nombre del ex claro que unos desconocidos incluso un los hechos trimestre de los lobos el ministro su entorno dos rally mediterr´aneo cada semana del humor su marido de la actualidad a nosotros nos fijamos especialmente del mundo de los unidos y en las mujeres de lugo que se recogen musulm´an guardamos continuidad algunos porque nuestra civil y un incremento del humor con un abogado manuel para poder asumir disposiciones tenemos ob´elix como o cabaleiro fue muy concentrados semana
CHAPTER 7. RECOGNITION EXPERIMENTS
7.8
43
Ceiling model
After having tried to improve the recognition results, in this and in the next section the “floor” and the “ceiling” situation mentioned in Section 5.7 are investigated. The “ceiling” situation is the recognition when using a language model that is not only trained on the training set, but also on the test and the validation set. The model is not interpolated with the newspaper model. The results can be seen in Table 7.14. The word error rate decreases considerably. This might seem an unrealistic situation. When speech data have to be recognized, the transcription is in the normal course of events not known in advance. But more or less the same could be done when using the transcriptions of radio news shows to train language models for the recognition of the television news show covering the same news later in the day. Unit %Corr Turn 2.87 Word 55.96
H 10 10161
D / 2194
S I 339 / 5803 854
N 349 18158
WER / 48.74
Table 7.14: Recognition results when using the ceiling language model
7.9
Unigram and bigram language models
Since the recognition engine used in all the other experiments, doesn’t seem to work with unigram and bigram language models, the “floor” experiment has to be done using the recognizer of the HTK toolkit. Since there was no time left for me to learn how to use this recognizer, this recognition experiment was done by my supervisor in Vigo. By means of the unigram or bigram model described in Section 5.7, the word network is generated using the HBuild command. This network contains a list of nodes representing words and a list of arcs representing the transitions between words. These arcs have probabilities attached to them, representing the unigram or bigram probabilities of the language model. Then a dictionary supplying pronunciations for each word in the network and a set of acoustic HMM demiphone models are given as input to the HNet command, together with the previously made word network. This command generates the equivalent network of HMMs. Finally recognition can be executed using the HRec command with as input this network and the speech to be recognized [18]. The results using this recognition method with unigram and bigram language models, can be seen in Tables 7.15 and 7.16. The results are quite disastrous. This is because the smaller the context of a word that is taken into account is, the worse the language is modeled and so the worse the recognition results are.
CHAPTER 7. RECOGNITION EXPERIMENTS Unit Turn Word
%Corr 0.29 23.82
H D 1 / 4326 1318
S 348 12514
44 I / 3630
N WER 349 / 18158 96.17
Table 7.15: Recognition results when using the unigram language model
Unit Turn Word
%Corr 0.29 25.41
H D 1 / 4614 1273
S 348 12271
I / 3500
N WER 349 / 18158 93.84
Table 7.16: Recognition results when using the bigram language model
7.10
Analysis of the results
The relatively bad quality of the results can be explained by a number of new experiments and by a closer examination of the recognition output.
7.10.1
Spontaneous speech
As can be seen in Table 4.4, one third of the speech in the test set is spontaneous. Spontaneous speech is more difficult to recognize: the sentence constructions are sometimes weird, words are repeated and there are a lot of speaker noises like uhm and ah. Tables 7.17 and 7.18 show the recognition results of these two speaker styles separately. Table 7.17 shows the results for spontaneous speech recognition using the universal and the matched m2d0t0 language model; Table 7.18 shows the results for planned speech using the universal and the matched m1d0t0 language model. Two things can be concluded from these tables. Firstly they show that planned speech is indeed recognized better than spontaneous speech: the word error rate is 3% lower. Secondly it can be concluded that the matched models don’t make much difference. LM Unit universal Turn Word matched Turn Word
%Corr 0.00 41.76 0.00 41.26
H D 0 / 2589 1175 0 / 2558 1204
S 132 2436 132 2438
I N / 132 299 6200 / 132 291 6200
WER / 63.06 / 63.44
Table 7.17: Recognition of spontaneous speech when using the universal and the matched language model
CHAPTER 7. RECOGNITION EXPERIMENTS LM universal matched
Unit %Corr Turn 0.92 Word 45.89 Turn 0.92 Word 45.69
H 2 5487 2 5464
D / 1267 / 1246
45
S I 215 / 5204 746 215 / 5248 739
N 217 11958 217 11958
WER / 60.35 / 60.49
Table 7.18: Recognition of planned speech when using the universal and the matched language model
7.10.2
Heterogeneity of the database
A second reason for the bad results is to be found in the heterogeneity of the database. As already has been mentioned, the shows are all about very different topics and they don’t have a fixed pattern, like e.g. news, sports, weather. When the three different shows of the test set are recognized separately, a big difference in recognition quality between the three can be noticed. This can be seen in Tables 7.19 till 7.21. Unit Turn Word
%Corr 0.00 34.57
H 0 2086
D S / 87 509 3439
I N / 87 427 6034
WER / 72.51
Table 7.19: Recognition results of show 5
Unit Turn Word
%Corr 3.03 58.80
H 2 2693
D S / 64 482 1405
I N / 66 240 4580
WER / 46.44
Table 7.20: Recognition results of show 10
Unit %Corr Turn 0.00 Word 43.69
H 0 3297
D / 1451
S I 196 / 2798 376
N 196 7546
WER / 61.29
Table 7.21: Recognition results of show 21
The word error rate of show 10 is the smallest. Two reasons for this can be concluded from Table 7.22. The main reason is that this show doesn’t contain any spontaneous speech. Since planned speech is recognized much better, the recognition results of this show have improved. Moreover half of the show is about politics, the topic that was recognized the best in Section 7.6. Next in line is the word error rate of show 21, which is quite high: more or less equal to the average of the entire test set. The reasons can again easily be seen in
CHAPTER 7. RECOGNITION EXPERIMENTS Show #fem(s) 5 1314 10 813 21 629
#male(s) 561 821 1618
#spont(s) 106 0 1704
46
#plan(s) 1769 1634 543
#cult 2 15 78
#pol 46 35 101
#news 39 16 17
Table 7.22: Statistics of the test shows
Table 7.22. This table shows that show 21 contains 1704 seconds of spontaneous speech and only 543 seconds of planned speech. Moreover, this show contains much more speech related to culture than show 10. As was concluded in Section 7.6 this topic is recognized less well than politics. These two statistical facts explain the worse word error rate of show 21. Explaining the very bad word error rate of show 5 is more difficult. Table 7.22 shows that there is some spontaneous speech, but not that much. There is also a lot of speech about politics, which is recognized the best. It could be stated that the bad word error rate is due to the equally high percentage of the topic ‘news’. But this alone cannot explain a word error rate of 72.51%. When examining the recognition output file of this show more closely, something strange can be noticed from halfway the show onwards. The same words are repeated over and over again until the end of the show. It turns out that the original audio file of this show is corrupted: it only contains the first 17.5 minutes of the 31.5 minutes’ show. This has caused the script that generates the .RAW files for recognition, to keep repeating the last words of the original audio file. The manual transcription of this show however contains the text of the entire show. This of course causes the word error rate to be much higher than it would normally have been. The result of this show, when the corrupted data aren’t taken into account, can be seen in Table 7.23. This result is more or less the same as for show 10 and more what would be expected taking into account the statistics of show 5. Unit Turn Word
%Corr 0.00 57.78
H 0 1990
D S / 41 429 1025
I N / 41 177 3444
WER / 47.36
Table 7.23: Recognition results of show 5 when deleting the corrupted turns
From that perspective, the results of the recognition experiments of the separate topics can also be better explained and can be recalculated without the corrupted turns. The high word error rate of the topic ‘news’ is caused by the high percentage of news in the corrupted part of show 5. 24 Of the 46 corrupted speaker turns belong to the topic ‘news’. When these turns are deleted and the results are recalculated, a word error rate of 55.12% instead of 72.91% is obtained for the topic ‘news’. The
CHAPTER 7. RECOGNITION EXPERIMENTS
47
remaining 22 corrupted turns are about politics, which causes the word error rate of this topic to be even lower when these turns are not considered: 48.71% instead of 56.35%. The word error rate of culture related speech doesn’t change, because none of the corrupted turns are about culture. The recalculated results of the entire test set using the universal language and acoustic models, can be seen in Table 7.24. The error rate decreases by more than 7%. Unit Turn Word
%Corr 0.66 51.26
H D 2 / 7980 2362
S 301 5227
I N / 303 794 15569
WER / 53.84
Table 7.24: Recognition results when using the universal models and deleting the corrupted turns
7.11
Combining the improvements
Three improvements have been made in the preceding experiments: using the sexdependent acoustic models, changing the word insertion penalty and deleting the corrupted data. These improvements can be combined to obtain an even lower word error rate. The results of combining the use of the adapted acoustic models and the deletion of the corrupted turns, can be seen in Tables 7.25 and 7.26. Apparently, female speakers are better recognized than male speakers. However, this was hidden in the results of Section 7.5 because of the higher percentage of female speech in the corrupted turns. The results of the entire test set minus the corrupted data, when using these adapted acoustic models, can be seen in Table 7.27. The combination of these two improvements results in a decrease of the word error rate by more than 11%. Unit Turn Word
%Corr 1.88 56.62
H D 3 / 4040 873
S 157 2222
I N / 160 435 7135
WER / 49.47%
Table 7.25: Recognition of female speakers when using the female acoustic model and deleting the corrupted turns
The improvement obtained by changing the word insertion penalty, can also be combined with the deletion of the corrupted data. After having recalculated the results for all the word insertion penalties tried in Section 7.7.3, a word insertion penalty of -5 still gives the best results. The results of the combination of these two improvements can be seen in Table 7.28.
CHAPTER 7. RECOGNITION EXPERIMENTS Unit %Corr Turn 2.10 Word 52.64
H 3 4440
D / 1355
48
S I 140 / 2639 412
N 143 8434
WER / 52.24
Table 7.26: Recognition of male speakers when using the male acoustic model and deleting the corrupted turns
Unit Turn Word
%Corr 1.72 54.47
H D 6 / 8480 2228
S 297 4861
I N / 303 847 15569
WER / 50.97
Table 7.27: Recognition results when using the sex-dependent acoustic models and deleting the corrupted turns
Unit Turn Word
%Corr H 0.33 1 54.94 8553
D / 1792
S I 302 / 5224 1129
N 303 15569
WER / 52.32
Table 7.28: Recognition results for word insertion penalty -5 when deleting the corrupted turns
The experiment in which the three improvements are combined, has unfortunately not been done. The results can however be extrapolated. The 52.32% of Table 7.28 is an improvement of 1.52% with respect to the 53.84% that was obtained in Section 7.10.2 under the same circumstances, but by using a word insertion penalty of 0. It can however not be assumed that the effect of changing the word insertion penalty and the effect of using the sex-dependent acoustic models can be added up. This means that this 1.52% can’t be just subtracted from the 50.97% of Table 7.27, obtained when using the sex-dependent acoustic models and deleting the corrupted data. But half of this 1.52% can be expected to be valid when adapted acoustic models are used. This means that extrapolation of the results indicates that a word error rate of 50.97% - 0.76% = 50.21% would have been obtained, if the three improvements had been combined.
7.12
Comparison with the Galician recognizer
In this section the obtained results are compared with the results of the recognizer for mainly Galician speech, that had already been developed in Vigo within the Transcrigal project. The experimental framework of Transcrigal consists of 14 television news shows in
CHAPTER 7. RECOGNITION EXPERIMENTS
49
Galician and to some extent in Castilian. Each of the shows takes approximately one hour and consists of three blocks: news, sports and weather. These blocks have one and the same anchorperson for every show. Only 10% of the speech is Castilian, spoken by non-reporters. Nine shows were chosen for the training set, two for the validation set and three for the test set. This material was complemented by a number of language resources: 40 hours of speech for the training of the acoustic models and 553 MBytes of text for the training of the language models (12 MBytes of Galician captions, 271 MBytes of Galician newspaper material and 270 Mbytes of Castilian newspaper material). The recognition engine that was used, is the same as the one used in the research described in this dissertation. The development of the acoustic models was also exactly the same. Six different models were developed: a universal one, one for each anchor, one for male reporters and one for female reporters. The amount of clean data for the adaptation of each specific acoustic model can be seen in Table 7.29. Type Universal News anchor Sports anchor Weather anchor Male reporters Female reporters
Time (min) 27.0 7.1 4.6 3.9 5.2 6.2
Table 7.29: Amount of speech available for adapting the specific acoustic models of the Transcrigal project
The strategy for the estimation of the language models as described in this dissertation, is the same as the one used within the Transcrigal project: interpolation of all the available sources, a vocabulary size of 20 Kwords and pruning with a threshold of 2, 5 · 10−8 . Five different language models were interpolated: one based on the training material of the television shows, one based on the Castilian newspaper, one based on the Galician newspaper and one based on the Galician captions. The fifth language model was based on a subset of the training set, containing only the material of the block and style of the interpolated model that was being estimated. Six interpolated models adapted to different blocks and styles, were developed. Table 7.30 shows the perplexity and OOV rate of their test set over the universal model and over the adapted model 2 . This table shows that the adapted models are much more effective than they proved to be in the research described in this dissertation. The reason is that there are well-separated blocks in the television shows, which is not the case in the radio shows used for this dissertation. Especially the models 2
GA = Galician, CA = Castilian
CHAPTER 7. RECOGNITION EXPERIMENTS Block
Style
#words
news
planned GA spont GA spont CA planned GA spont GA+CA planned GA average
106304 11346 12656 34036 9795 11965
sports weather
50
OOV% universal adapted 4.68 4.33 5.62 5.54 6.89 4.20 5.36 3.43 5.17 3.18 1.19 0.87 4.81 3.95
ppl universal adapted 118.4 110.5 231.7 207.8 381.5 140.1 198.5 139.3 403.8 129.3 77.7 27.3 151.8 111.8
Table 7.30: OOV rate and perplexity of the LMs of the Transcrigal project
adapted to spontaneous Castilian speech and the weather model are much better than the universal model, since these models have a very specific vocabulary. The results when using these models for recognition, can be seen in Table 7.31. The results are shown for four different experiment setups: using the universal language and acoustic model, using the universal language model but the adapted acoustic model and vice versa, and finally using the adapted language and acoustic model. Acoustic model universal adapted
Language model universal topic adapted universal topic adapted
WER 41.72 40.32 36.62 35.31
Table 7.31: Results of the recognition experiments of the Transcrigal project
The word error rates obtained when the universal language model and the adapted acoustic models are used in this research and in the Transcrigal project, can be compared. The 36.62% word error rate obtained within the Transcrigal project is approximately 14% lower than the 50.97% that has been arrived at in this research. Several reasons for this difference can be given. The first one is that in the television shows used in the Transcrigal project, there are anchorspeakers appearing in each show. This makes it possible to adapt acoustic models to these speakers. This can’t be done in this research because there are no anchorspeakers in the radio shows. This has caused the acoustic models of the television shows to be a lot better adapted and to result in a better recognition. Moreover the radio shows used in this research, are very heterogeneous when it comes to topics and structure. There is no fixed structure. The television shows of
CHAPTER 7. RECOGNITION EXPERIMENTS
51
Transcrigal, by contrast, have a fixed structure consisting of 3 blocks: news, sports and weather. Because of this, the language models trained on these television shows bear more resemblance to the topics and language of the test set of the television shows, even when the models are not adapted to the topic or style. The results are even better when the models are adapted, because of the bigger resemblance between the topics withing each block. Because of this the use of adapted language models causes an improvement of approximately 1.3% in the Transcrigal project. In this research however, the language model adaptation doesn’t have any noticeable effect on the recognition results. Another reason is that more data are used to train the language models used in the Transcrigal project. The more data are used, the better the model represents the language. As a final reason it can be stated that more time was available for the Transcrigal project than there was for my research. This means that there was more time for a thorough investigation of the data and for the optimalisation of the recognition results. More information about the Transcrigal project can be found in [14].
7.13
Conclusion
In this chapter the actual recognition experiments have been described. Because the word error rate when using the universal models is rather high, several attempts were made to improve this result. The female and male acoustic models turned out to give better results. The topic adapted language models didn’t have any noticeable effect on the recognition results. Changing some recognition parameters didn’t turn out to be very effective either, except for a change of the word insertion penalty. This alteration improved the recognition results substantially. The additional language models of Section 5.7 were also used and gave the expected better or worse results. Then some reasons for the bad quality of recognition were stated and proved by a number of extra experiments: the bad recognition of spontaneous speech and the heterogeneity of the radio shows. A final reason was discovered later. A part of the audio file of one of the test shows turned out to be corrupted. Not taking into consideration these corrupted turns also resulted in a big decrease of the word error rate. Finally the improvements that were made, were combined. This resulted in a decrease of approximately 12% compared to the initially obtained word error rate.
Chapter 8 Conclusion In this dissertation a speech recognition system for Castilian speech has been developed. A database of 23 radio shows was used for the development of the language and acoustic model. This database was first divided into a training set, a validation set and a test set. This was done using certain criteria concerning the amount of female/male, spontaneous/planned and politics/culture/news related speech each set should contain. The training set was then used to estimate language models. Next to the universal model, models adapted to the three different topics (politics, culture and news in general) were also developed. Models adapted to spontaneous and planned speech were estimated. Afterwards these models were interpolated with a universal model based on the Castilian written part of “El Correo Gallego”, a Galician newspaper. All the resulting models were then evaluated by means of perplexity (ppl) and outof-vocabulary (OOV) rate measurements. The ppl of the topic and mode adapted models didn’t turn out to be better than the ppl of the universal model, except in the case of politics and spontaneous speech. The acoustic models were adapted from already existing acoustic models, trained on telephone data. Only clean speech without speaker noises was used for this adaptation. A universal model as well as sex-dependent models were developed. Finally these models were used for a series of recognition experiments. Recognition of the complete test set with the universal language and acoustic model resulted in a WER of 61.28%. Several attempts were made to lower this WER. First the sex-dependent acoustic models were used. This lowered the WER by nearly 3%. Afterwards recognition experiments were done with the topic adapted 52
CHAPTER 8. CONCLUSION
53
language models. This didn’t improve the results; the WER remained around 61%. Some parameters of the recognition system were then varied to see their effect on the results. Varying the global pruning factor and disabling voice activity detection didn’t make a significant difference compared to the 61.28% obtained with the original parameters. Varying the word insertion penalty however, did have a positive effect. A penalty of -5 turned out to be the optimal penalty and this reduced the WER by 1.3%. Subsequently two extreme recognition experiments were done. The first one used a language model that was trained not only on the training set but also on the test set. This resulted in a WER of 48.74%. This might seem an unrealistic situation. But more or less the same could be done when using the transcriptions of radio news shows to train language models for the recognition of the television news show covering the same news later in the day. The second extreme situation is the one where unigram or bigram language models were used. These models represent language less accurately and thus result in a much bigger WER. A final improvement of the WER was obtained by not taking into consideration a piece of corrupted data in which - because of missing data - the last sentence of a turn is repeated during 15 minutes in the audio files used for recognition, while the reference transcription contains different text. Not taking into consideration the corrupted turns decreased the WER by 7.44%. Extrapolation indicates that integration of the different improvements that were obtained (sex-dependent acoustic models, word insertion penalty change and deletion of the corrupted data), would result in a WER of about 50.21%. This is approximately 14% more than the WER obtained when adapted acoustic models and a universal language model were used within the Transcrigal project. This difference is due to several reasons. A first one is the heterogeneity of the radio shows. The shows are very different when it comes to topics and structure. On the other hand, the television shows used as training and testing material within the Transcrigal project, have a fixed structure of news, sports and weather. Consequently, the training material shows more resemblance to the test than it is the case for the radio show material. Another reason is that there are no anchorspeakers in the radio shows; in the television shows there is one anchorspeaker for each block. This has caused the acoustic model adaptation to be more effective. A third reason is the bigger amount of training data that are used for the estimation of the language models of the Transcrigal project. Further improvements could be made by using more training data. More attention could also be paid to finding a solution for the difficult recognition of spontaneous speech.
References 1. X. Huang, A. Acero, H. Hon: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall, 2001, pp. 413–414. 2. Steve Young, Gunnar Evermann, et al.: The HTK Book (for HTK Version 3.3), Cambridge University Engineering Department, 2005, pp. 196–198.
3. X. Huang, A. Acero, H. Hon: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall, 2001, pp. 554–556. 4. X. Huang, A. Acero, H. Hon: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall, 2001, pp. 385–386. 5. A. Cardenal-Lopez, F. J. Dieguez-Tirado, and C. Garcia- Mateo: Fast LM lookahead for large vocabulary continuous speech recognition using perfect hashing, in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Orlando, FL, May 2002, vol. 1, pp. 705-708. 6. K. Demuynck, J. Duchateau, D. Compernolle and P. Wambacq: An efficient search space representation for large vocabulary continuous speech recognition, Speech Communication, vol. 30, no. 1, Jan. 2000, pp. 37-53. 7. Wikipedia, A* search algorithm, 8. Antonio Rinc´on, Asunci´on Moreno: Broadcast news speech DB for Spanish, Applied Technologies on Language and Speech S.L. (ATLAS), July 2003 9. Andreas Stolcke: Srilm – an extensible language modeling toolkit, In Proc. Intl. Conf. on Spoken Language Processing, Denver, CO, pp. 901-904 10. SRILM Manual Pages,
54
REFERENCES
55
11. A. Stolcke: Entropy-based Pruning of Backoff Language Models, in Proc. DRAPA NewsTranscriptionand Understanding Workshop, Lansdowne, VA, 1998, pp.270274. 12. Wikipedia, SAMPA chart, 13. Jan Daciuk, Gertjan van Noord: Finite Automata for Compact Representation of Language Models in NLP, 14. J. Dieguez-Tirado, C. Garcia-Mateo, et al.: Adaptation strategies for the acoustic and language models in bilingual speech transcription, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), March 2005, pp. 833–836. 15. Steve Young, Gunnar Evermann, et al.: The HTK Book (for HTK Version 3.3), Cambridge University Engineering Department, 2005, pp. 133–140. 16. Yang Liu, Mary P. Harper, et al.: The Effect of Pruning and Compression on Graphical Representations of the Output of a Speech Recognizer, Computer Speech and Language, Vol. 7, No. 4, October 2003, pp. 329-356 17. Steve Young, Gunnar Evermann, et al.: The HTK Book (for HTK Version 3.3), Cambridge University Engineering Department, 2005, pp. 183–184. 18. Steve Young, Gunnar Evermann, et al.: The HTK Book (for HTK Version 3.3), Cambridge University Engineering Department, 2005, pp. 161–163.
APPENDIX A: Recognition output ——N1 03002 000t166s0177MN.raw——en radio nacional de espa˜ na Acustica -1.512367e+04 LM -1.929854e+02 Total -1.531666e+04 ———————– ——N1 03002 001t166s0178FN.raw——as´ı puntillas Acustica -6.574663e+03 LM -9.509721e+01 Total -6.669760e+03 ———————– ——N1 03002 002t166s0177MN.raw——ya confirm´o el Acustica -1.601067e+04 LM -2.493336e+02 Total -1.626000e+04 ———————– ——N1 03002 003t166s0177MN.raw——uno dos nombres algunos desconocidos incluso Acustica -2.270888e+04 LM -6.865100e+02 Total -2.339539e+04 unos hechos y esperamos Acustica -1.280838e+04 LM -3.591810e+02 Total -1.316756e+04 heridos otros alimentar en cada semana do sumario la actualidad Acustica -2.921332e+04 LM -9.063204e+02 Total -3.011964e+04 nosotros nos fijamos especialmente de los unidos y las voces de los recoge museo guardamos con cuidado Acustica -4.307638e+04 LM -1.380901e+03 Total -4.445728e+04 un tema f´acil perder uno coma no para poner a su disposici´on estemos ah´ı como radiof´onico de cada semana Acustica -4.398372e+04 LM -1.385938e+03 Total -4.536966e+04 ———————– ——N1 03002 004t166s0177MN.raw——en esta ocasi´on vamos a ofrecerle no solamente las voces que componen el alem´an d´ıas pol´ıtico en la comunidad de madrid con la recepci´on de los fondos bruto socialistas Acustica -5.960181e+04 LM -1.842916e+03 Total -6.144473e+04 y que ha sido devuelto su la semana Acustica -1.661781e+04 LM -5.529256e+02 Total -1.717073e+04 56
APPENDIX A
57
como reportajes tenemos uno desde bruselas explicando los el proyecto de constituci´on europea ha elaborado durante un a˜ no y medio y que se someter´a ahora la pr´oxima cumbre de los quince Acustica -7.319959e+04 LM -1.936708e+03 Total -7.513630e+04 los veinticinco a˜ nos de los dioses el grupo especial de la polic´ıa detiene en su haber numerosas y arriesga las misiones contra el terrorismo y la droga Acustica -6.157213e+04 LM -1.666265e+03 Total -6.323840e+04 y precisamente sobre el drama de la droga aprovechando la presentaci´on de la nueva campa˜ na contra los estupefacientes futuro trabajo no se va a permitir escuchar los dos armadores testimonios de quienes se han visto atrapados en ella salud Acustica -8.986489e+04 LM -2.583776e+03 Total -9.244867e+04 el retrato de mi que lo hacemos y grandiosa de muy p´erez Acustica -2.481714e+04 LM -8.889479e+02 Total -2.570609e+04 compon´ıan esta edici´on de siete d´ıas f´acil encajar s´olidos de recorremos la ´epoca dorada muy dura y Acustica -4.503377e+04 LM -1.374696e+03 Total -4.640847e+04 ———————– ——N1 03002 005t810s0177MN.raw——grupos grandeza tren talgo subido si somos ofrecer ya nuestra carta semanal coloma sus socios o porque les un un podido estar expresamente muy al tanto de las noticias Acustica -7.835289e+04 LM -2.686667e+03 Total -8.103956e+04 escuchen a la hora l´opez y las voces de esta semana que ha sido una semana desde luego polif´onica Acustica -4.053003e+04 LM -1.244040e+03 Total -4.177407e+04 ———————– ——N1 03002 006t810s0179FN.raw——herida ya que siempre ha ocupado Acustica -1.437718e+04 LM -4.823441e+02 Total -1.485953e+04 vaya t´ıpica te pol´ıtico que si armador casi nadie y otras muchas semanas que le la comunidad de madrid Acustica -4.654516e+04 LM -1.444571e+03 Total -4.798973e+04 intentar explic´artelo s´ı Acustica -1.236631e+04 LM -3.970040e+02 Total -1.276332e+04 seguramente ya sabes que pesa en el que la vida en esta comunidad no tuvieron juntos el voto m´as que pedir Acustica -4.591700e+04 LM -1.362346e+03 Total -4.727935e+04 osea firmantes estar´a izquierda y firmar y cinco para los populares Acustica -2.606586e+04 LM -9.556840e+02 Total -2.702155e+04 s´ı de esta manera y en coalici´on post electoral podr´ıa gobernar aunque el partido m´as votado fue el pp
APPENDIX A
58
Acustica -4.647459e+04 LM -1.315658e+03 Total -4.779024e+04 de las artes estall´o el martes cuando los diputados del psoe y hasta ahora desconocidas eduardo tamayo y mar´ıa teresa s´aez Acustica -5.608239e+04 LM -1.249682e+03 Total -5.733208e+04 s´ı a sentar a m´ı la votaci´on para elegir que sienta en la c´amara auton´omica Acustica -3.018616e+04 LM -9.279412e+02 Total -3.111410e+04 convulsion´o y as´ı millas muchos nervios Acustica -2.215405e+04 LM -6.696196e+02 Total -2.282367e+04 sus razones jornales aunque no aceptan el pacto con izquierda unida aunque hay sospechas de que pueda cross Acustica -4.724457e+04 LM -1.432839e+03 Total -4.867741e+04 total que sali´o elegida la candidata del pp para presidir la asamblea de las es la peor para el psoe sino que tambi´en se pon´ıa en peligro la presidencia de la comunidad Acustica -6.720702e+04 LM -1.571434e+03 Total -6.877846e+04 y ya segura para el candidato socialista rafael simancas con apoyo de izquierda unida a partir de ah´ı puede pasar de todo y si empiezas titular de todo pisar una mina sector descontento persona y ambos militantes que significa mayo hay presiones econ´omicas catedr´atico para cre´ıan pa´ıs llamado a su partido Acustica -1.323498e+05 LM -3.664390e+03 Total -1.360141e+05 como descubri´o y el mar y frases con las que podr´ıamos pasar la curia y Acustica -3.698377e+04 LM -1.115679e+03 Total -3.809944e+04
Ontwikkeling van een spraakherkenner voor het Castiliaans 1. Inleiding Spraakherkenning is een domein waar al heel lang interesse voor is en bijgevolg ook onderzoek naar is. De eerste pogingen om automatische spraakherkenningssystemen te ontwikkelen dateren van de jaren vijftig. Sindsdien zijn er heel veel nieuwe technieken ontdekt en ontwikkeld. Het is binnen het domein van de spraakherkenning dat mijn thesis zich situeert. Mijn onderzoek is gedaan gedurende het eerste semester aan het Centrum voor Signaalverwerking van de universiteit van Vigo, in Spanje. Binnen hun Transcrigal project was er al een spraakherkenner ontwikkeld gebaseerd en getest op bilinguale televisieuitzendingen (90% Galicisch, 10% Castiliaans). Mijn taak bestond erin om, gebaseerd op een nieuwe, volledig Castiliaanse database bestaande uit radio-uitzendingen, nieuwe modellen te ontwikkelen en te gebruiken voor herkenningsexperimenten, hierbij zo goed mogelijk de kwaliteit van de voorheen ontwikkelde modellen en herkenning benaderend. De ontwikkelde modellen zijn zowel taalmodellen, die de te herkennen taal modelleren, als akoestische modellen, die de uitspraak modelleren. De taalmodellen zijn gebaseerd op n-grammen. Dit type wordt binnen spraakherkenning meestal gebruikt omwille van zijn eenvoud. Behalve een universeel model werden ook verscheidene aangepaste modellen geschat. Deze zijn aangepast aan het onderwerp of aan de stijl (spontane of geplande spraak). De akoestische modellen zijn gebaseerd op Continuous Hidden Markov Models (CHMM). De ontwikkeling van deze modellen is voornamelijk gedaan door mijn begeleidster in Spanje omdat er niet voldoende tijd was om mij hier volledig in in te werken. Ook hier zijn aangepaste modellen ontwikkeld, afhankelijk van het geslacht van de spreker.
59
SAMENVATTING
60
2. Experimenteel kader De herkenningsmachine die gebruikt wordt, is een automatische spraakherkenner gebaseerd op CHMMs. Deze herkenner werkt in twee stappen. In een eerste stap wordt het Viterbi-algoritme toegepast in combinatie met een beam search. De tweede stap is een herscoring aan de hand van het A*-algoritme. De database bestaat uit een verzameling van 23 radio-uitzendingen in het Castiliaans. Deze uitzendingen duren in totaal 12 uur en 47 minuten. Van elke uitzending bestaat, naast de audio-opname, ook een manuele transcriptie. In totaal zijn er 216 verschillende sprekers en 2794 turns. Van deze 216 sprekers zijn er maar 4 die in verschillende uitzendingen voorkomen, en dan nog maar in 2 of hoogstens 3 uitzendingen. Daardoor is sprekersadaptatie niet erg effectief en zal die dus ook niet toegepast worden. Per turn is er eveneens een manueel aangemaakt bestand dat informatie bevat over het geslacht en het dialect (autochtoon of allochtoon) van de spreker, over de stijl, de duur, het onderwerp, . . . Naast deze database van radio-uitzendingen is er ook een reeds bestaand taalmodel ter beschikking dat gebaseerd is op het Castiliaans geschreven deel van “El Correo Gallego”, een Galicische krant. Deze tekst bevat ongeveer 64 Mwoorden. Dit model zal gebruikt worden voor interpolatie met de taalmodellen die ontwikkeld worden op basis van de radio-uitzendingen.
3. Indeling van de database Voor de ontwikkeling van de taal- en akoestische modellen is het eerst nodig de hierboven genoemde database op te delen in een trainingsset, een validatieset en een testset. De trainingsset zal gebruikt worden om de modellen te trainen. De validatieset zal dienen voor het bepalen van de interpolatiegewichten. De testset tenslotte zal gebruikt worden voor de evaluatie van de taalmodellen en als testdata bij de herkenningsexperimenten die op het einde uitgevoerd zullen worden. Deze indeling gebeurt op basis van een aantal criteria. Alle onderwerpen moeten in de drie sets aanwezig zijn en er moeten ook mannelijke en vrouwelijke sprekers in voorkomen. Hiervoor moet een classificatie gebeuren van de 48 verschillende onderwerpen die aan bod komen. Eerst moet de labeling van onderwerpen in de informatiebestanden herzien worden, vermits deze niet altijd correct gebeurd is. Daarna worden de 48 onderwerpen gegroepeerd in 3 overkoepelende onderwerpen: cultuur, politiek en algemeen nieuws. Vervolgens worden de uitzendingen ingedeeld. Drie uitzendingen worden gekozen als testset en drie als validatieset. De overige 17 uitzendingen blijven over voor de
SAMENVATTING
61
trainingsset.
4. Taalmodellering Voor de taalmodellering moeten eerst modellen geschat worden op basis van de database van radio-uitzendingen. Deze kunnen daarna ge¨ınterpoleerd worden met het reeds bestaande model gebaseerd op de krantenteksten. Zes verschillende taalmodellen worden geschat: een universeel model, ´e´en voor elk onderwerp en ´e´en voor elke stijl van spreken (spontaan of gepland). Hiervoor wordt de trainingsset opgesplitst in verschillende subsets die enkel de spraak bevatten van dat bepaalde onderwerp of met die bepaalde stijl. De schatting gebeurt, zoals reeds gezegd, op basis van n-grammen, meer bepaald trigrammen. Deze trigrammen worden geteld in de subset die van toepassing is op het te ontwikkelen model. Veel trigrammen komen echter nooit voor in de trainingsset, terwijl ze in het Castiliaans en dus in de testset misschien wel kunnen voorkomen. Daarom worden smoothingtechnieken zoals Good-Turing discounting en Katz backoff toegepast. Vervolgens worden de ontwikkelde modellen ge¨evalueerd op basis van perplexiteitsberekeningen (ppl) en het percentage out-of-vocabulary (OOV) woorden. Deze getallen worden berekend voor de testset. Ze worden onderzocht voor het universeel radiomodel, het aangepaste radiomodel en het krantenmodel. De perplexiteit van de testset is een maatstaf voor het aantal woorden dat kan volgen op een bepaald woord en geeft dus een indicatie over de complexiteit van de taal en/of de kwaliteit van het model. Indien de taal niet overdreven complex is, zoals het Spaans, maar de perplexiteit wel heel hoog ligt, wil dit zeggen dat de modellen geen goede voorstelling geven van de gemodelleerde taal. OOV-woorden zijn woorden die voorkomen in de testset, maar niet in de lijst van woorden die voorkomen in het taalmodel. Indien dit getal heel hoog ligt, verloopt de herkenning slecht aangezien vele woorden hoe dan ook niet juist herkend kunnen worden. Uit de resultaten blijkt dat de perplexiteit van de testset over de aangepaste modellen lager is dan die over het universeel model. Het OOV-percentage echter is groter voor deze aangepaste modellen, aangezien ze een kleiner vocabularium hebben en de onderwerpen en stijlen blijkbaar niet zo uniform zijn dat ze een zelfde type woorden gebruiken. Voor het model gebaseerd op de krantenteksten, ligt de perplexiteit nog lager dan voor de aangepaste radiomodellen. Dit is heel uitgesproken het geval voor het onderwerp ‘politiek’, wat reeds aanduidt dat dit model beter geschikt is voor politiek dan het radiomodel.
SAMENVATTING
62
Na deze evaluatie worden de zes ontwikkelde radiomodellen ge¨ınterpoleerd met het krantenmodel. De interpolatiegewichten worden bepaald aan de hand van het EMalgoritme. Dit houdt in dat de perplexiteit van de validatieset over het ge¨ınterpoleerde model iteratief geminimaliseerd wordt. Bij de resulterende gewichten valt vooral op dat voor het model van politiek het grootste gewicht bij het krantenmodel ligt, wat nogmaals aantoont dat dit model beter geschikt is voor het onderwerp ‘politiek’. De interpolatie wordt uitgevoerd voor twee verschillende setindelingen. De eerste is degene die in de vorige sectie besproken is. De tweede is dezelfde maar met de validatie- en testset omgewisseld. De vocabularia van de twee te interpoleren modellen moeten ook gecombineerd worden tot ´e´en vocabularium. Dit gebeurt eveneens op basis van deze interpolatiegewichten. Twee verschillende resulterende vocabulariumgroottes worden hiervoor geprobeerd: 10 Kwoorden en 20 Kwoorden. Aan de hand van het resulterend vocabularium en de berekende gewichten kunnen de twee modellen dan ge¨ınterpoleerd worden. Na interpolatie worden de zes modellen nog gepruned met een factor 2, 5 · 10−8 . Dit houdt in dat de erg onwaarschijnlijke trigrammen verwijderd worden, wat de grootte van de taalmodellen aanzienlijk vermindert. Ook deze ge¨ınterpoleerde modellen worden ge¨evalueerd aan de hand van ppl- en OOVberekeningen. De perplexiteiten zijn aanzienlijk verminderd ten opzichte van die van de eenvoudige modellen. Dit bewijst dat de interpolatie effectief is. Wat de twee verschillende setindelingen betreft, geeft de eerste verdeling de beste resultaten voor de testset. Deze verdeling wordt dan ook gebruikt voor het verdere verloop van het onderzoek. Wat de vocabulariumgrootte betreft, wordt gekozen voor 20 Kwoorden omdat deze een veel lagere OOV-rate hebben en een ppl die, hoewel hij wel hoger is dan bij 10 Kwoorden, toch nog zeer aanvaardbaar is. De ppl van de modellen die aangepast zijn aan onderwerp en stijl blijken hoger te liggen dan die van het universeel model. Dit doet twijfelen aan hun effici¨entie voor herkenning. Alleen het model van politiek en dat van spontane spraak hebben een iets lagere ppl. Hierna worden nog 3 extra taalmodellen gemaakt die bij de experimenten zullen gebruikt worden ter vergelijking. Een eerste is een model dat niet enkel getraind is op de trainingsset maar ook op de test- en de validatieset. Dit geeft normaal gezien betere herkenningsresultaten, aangezien het model al getraind is op de tekst die nadien moet herkend worden. Het tweede en derde zijn een unigram- en bigrammodel, die de taal veel slechter modelleren en dus meestal resulteren in een slechtere herkenning. Tenslotte worden alle ontwikkelde modellen klaargemaakt voor gebruik in de her-
SAMENVATTING
63
kenningsmachine. De woorden van het vocabularium worden vervangen door hun fonetische transcriptie en aan de hand hiervan wordt de lexicale boom opgesteld, die samen met het akoestisch model de basis vormt van de herkenner.
5. Akoestische modellering Na de taalmodellen moeten ook de akoestische modellen aangemaakt worden. Sprekersadaptatie is hierbij niet erg nuttig. Wel worden er, behalve het universeel akoestisch model, ook geslachtsafhankelijke modellen gemaakt. Een eerste stap is het filteren van de trainingsset. Alleen zuivere spraak, zonder ongewenste geluiden van de spreker of uit de omgeving, mag gebruikt worden voor het trainen van deze modellen. Na filtering blijven er een vijftal uren zuivere spraak over. Omdat dit niet genoeg is om een model van nul aan te maken, wordt er uitgegaan van een reeds bestaand startmodel dat getraind is aan de hand van 25 uur Castiliaanse spraak. Als akoestische eenheden voor dit model werden demifonen (halve fonemen) gekozen. Er zijn 640 demifonen. Ze zijn contextafhankelijk: het linkerdemifoon van een foneem is afhankelijk van zijn linkerbuur, terwijl het rechterdemifoon afhankelijk is van zijn rechterbuur. Elk demifoon wordt voorgesteld door een HMM van twee toestanden. Elke toestand wordt gemodelleerd door een mix van 4 tot 8 Gaussianen met een 39-dimensionale kenmerkenvector: 12 mel-frequentie cepstrum co¨effici¨enten, hun eerste en tweede afgeleiden, de genormalizeerde log-energie en de afgeleiden daarvan. Dit startmodel wordt dan aangepast met behulp van de voorheen gefilterde trainingsset van de database van radio-uitzendingen. Deze adaptatie gebeurt door het gebruik van de MLLR- en MAP-technieken. Het toegepaste adaptatieproces bestaat uit drie stappen. Eerst wordt een globale MLLR-adaptatie gedaan. Vervolgens wordt aan de hand van een regression class tree een set van meer specifieke MLLR-adaptaties toegepast. Tenslotte worden de resulterende modellen gebruikt als voorkennis bij de uitvoering van het MAP-algoritme.
6. Herkenningsexperimenten Aan de hand van de opgestelde modellen en de lexicale boom kan tenslotte begonnen worden aan de herkenningsexperimenten. Eerst moeten enkele parameters ingesteld worden, die nadien nog aangepast kunnen worden om eventueel betere resultaten te bekomen. Drie belangrijke parameters zijn de pruningsparameters: de global beam width
SAMENVATTING
64
pruning factor, de word-end pruning factor en het maximum aantal actieve hypotheses tijdens de herkenning. Deze zijn initieel ingesteld op respectievelijk 180, 70 en 50000. Een andere parameter waarmee gespeeld kan worden, is de word insertion penalty. Dit is de ‘straf’ die gekoppeld is aan het overgaan van ´e´en woord naar een ander. Zo bepaalt deze parameter of er veel of weinig inserties zullen zijn. Tenslotte kan de voice activity detection, die standaard toegepast wordt, uitgeschakeld worden. Na de herkenning worden de resultaten ge¨evalueerd aan de hand van de berekening van het aantal inserties, deleties en substituties. Hieruit wordt dan de word error rate (WER) berekend als D+S+I W ER = · 100% N
6.1 Universeel model Het eerste experiment wordt gedaan aan de hand van de universele modellen, zowel het taal- als het akoestisch model. Hiermee wordt een WER van 61.28% bekomen. Slechts 2 van de 349 zinnen worden volledig juist herkend. Omdat de WER nogal hoog ligt, worden hierna verscheidene pogingen ondernomen om deze te doen dalen.
6.2 Aangepaste akoestische modellen Een eerste logische poging is om niet langer het universeel akoestisch model, maar wel de geslachtsafhankelijke modellen te gebruiken. Hierbij dient het vrouwelijk model om alle vrouwelijke spraak te herkennen en het mannelijk model om alle mannelijke spraak te herkennen. Dit zorgt voor een opmerkelijke verbetering van de WER: een daling van 61.28% naar 58.39%. Deze verbetering is er omdat mannen en vrouwen een heel andere toonhoogte en een ander stemtimbre hebben. De geslachtsafhankelijke modellen passen zich hieraan aan en zijn dus veel beter in staat die mannelijke of vrouwelijke spraak te herkennen.
6.3 Aangepaste taalmodellen Een andere mogelijkheid is het gebruik van de aangepaste taalmodellen in plaats van de aangepaste akoestische modellen. Hiervoor worden de drie modellen gebruikt die aangepast zijn aan de onderwerpen cultuur, politiek en algemeen nieuws. Deze modellen dienen elk voor de herkenning van de spraak in verband met het onderwerp waaraan ze aangepast zijn. De WER die hiermee bereikt wordt over de hele testset, is 61.47%. Dit betekent nauwelijks een verschil met het resultaat dat bekomen werd bij gebruik van het universeel taalmodel.
SAMENVATTING
65
Ook voor de drie onderwerpen afzonderlijk is er niet meer dan 0.2 `a 0.3% verschil tussen de resultaten bekomen aan de hand van het universeel model en de resultaten bekomen aan de hand van het aangepaste model. Dit wil zeggen dat de adaptatie van de taalmodellen geen positief effect heeft op de herkenning. De reden hiervoor is dat de onderwerpen veel te heterogeen zijn. Elk onderwerp combineert heel wat verschillende deelonderwerpen en verhalen. Deze verhalen vertonen weinig overeenkomst wat woordtype of structuur betreft. Voor het onderwerp ‘algemeen nieuws’ is dit logisch omdat hierin alles verzameld is wat noch cultuur noch politiek is. Maar ook voor de twee andere onderwerpen is dit het geval. In beide gevallen scoort het onderwerp ‘politiek’ het best met een WER van 56%, tegenover 72% voor algemeen nieuws en 63% voor cultuur. Dit kon enigszins afgeleid worden uit de besluiten die getrokken werden tijdens de evaluatie van de taalmodellen. De perplexiteit van het model aangepast aan politiek, was daar ook het laagst. De reden daarvoor was dat het krantenmodel beter aansluit bij het onderwerp ‘politiek’. Aangezien voor dit model meer trainingsdata beschikbaar waren, verloopt de herkenning ook iets beter.
6.4 Verandering van enkele herkenningsparameters Zoals reeds aangehaald, kan de invloed van enkele herkenningsparameters op het herkenningsresultaat getest worden door het vari¨eren van deze parameters. Global beam width pruning factor Een eerste parameter waarmee ge¨experimenteerd wordt, is de globale pruning factor. De waarde hiervan bepaalt vanaf wanneer een onwaarschijnlijk pad niet meer in beschouwing wordt genomen. Hoe groter deze waarde, hoe beter de herkenning zou moeten verlopen, aangezien er minder kans is dat het juiste pad niet in beschouwing wordt genomen. Wanneer deze parameter echter gevarieerd wordt van 180 naar 250, in stappen van 10, is er weinig tot geen verschil merkbaar. De WER’s gaan van minimaal 61.16% tot maximaal 61.29%, wat bij een testset van 18158 woorden geen significant verschil genoemd kan worden. Voice activity detection Een volgende parameter die gewijzigd kan worden, is de parameter die bepaalt of voice activity detection al dan niet uitgevoerd wordt tijdens de herkenning. Standaard wordt deze detectie uitgevoerd. Wanneer we de detectie uitschakelen, blijken de resultaten ook nauwelijks te veranderen. Het experiment wordt eveneens gedaan voor elke global beam width pruning factor die voordien gebruikt was. De WER’s verhogen een klein beetje en schommelen deze keer tussen 61.19% en 61.41%.
SAMENVATTING
66
Word insertion penalty Tenslotte kan het effect van de word insertion penalty onderzocht worden. Deze staat normaal gezien op nul. Hoe positiever hij is, hoe groter de straf is. Dit wil zeggen dat er minder inserties zullen zijn, maar meer deleties. Hoe negatiever hij is, hoe meer inserties er zullen zijn en hoe minder deleties. In het resultaat verkregen met een word insertion penalty nul, zijn er 2442 deleties en 1045 inserties. Intu¨ıtief kan gesteld worden dat de herkenning optimaal verloopt wanneer het aantal deleties van ongeveer dezelfde orde van grootte is als het aantal inserties. Een negatievere word insertion penalty zorgt in dit geval dus normaal gezien voor een betere herkenning. Wanneer deze parameter gevarieerd wordt tussen -25 en 25, in stappen van 5, blijkt dat het beste resultaat bekomen wordt bij een penalty van -5. In dit geval zijn er 1860 deleties en 1351 inserties. De WER verlaagt tot 59.97%. Wanneer hij nog negatiever wordt, worden er teveel inserties toegevoegd, waardoor de WER terug verslechtert.
6.5 Extra taalmodellen In een volgend experiment worden de drie extra taalmodellen die voordien aangemaakt zijn, gebruikt. Eerst wordt het model dat niet alleen op de trainingsset maar ook op de testset getraind is, gebruikt. Dit model is niet ge¨ınterpoleerd met het krantenmodel. Logischerwijze zorgt het gebruik van dit model voor een WER die een stuk lager ligt: 48.74%. Dit lijkt misschien een onrealistische situatie. De transcriptie van spraak die je wil herkennen, is immers meestal niet op voorhand gekend. Een gelijkaardige situatie doet zich echter voor wanneer de kranten van een bepaalde dag gebruikt worden als trainingsmateriaal voor een taalmodel dat gebruikt wordt bij de herkenning van de televisie-uitzending met dezelfde items later op de dag. Hierna worden het unigram- en bigramtaalmodel gehanteerd. Zoals verwacht, geven deze veel slechtere resultaten. Het bigrammodel resulteert in een WER van 93.84%; het unigrammodel haalt zelfs 96.17%.
6.6 Analyse van de resultaten Twee experimenten worden tenslotte uitgevoerd om een verdere verklaring van de resultaten te geven. Een eerste test is het vergelijken van de resultaten voor spontane spraak met die voor geplande spraak, en dit zowel bij gebruik van het universeel taalmodel als bij gebruik van het taalmodel dat aangepast is aan de spreekstijl. Voor spontane spraak wordt een WER van 63.06% voor het universeel model en van 63.44% voor het aangepast model bekomen. Voor geplande spraak ligt de WER op 60.51% voor het universeel model en op 60.64% voor het aangepaste model. Hieruit kunnen twee
SAMENVATTING
67
conclusies getrokken worden. Ten eerste blijkt dat de taalmodellen die aangepast zijn aan de spreekstijl, niet effectief zijn. Ten tweede is het duidelijk dat spontane spraak minder goed herkend wordt dan geplande spraak. De reden hiervoor is dat in spontane spraak meer euh, uhm, herhalingen van woorden, slechte zinsconstructies, . . . voorkomen. Een tweede test is het bekijken van de herkenningsresultaten voor elke radiouitzending apart. Er is een groot verschil merkbaar. De best herkende uitzending bevat geen spontane spraak en gaat voornamelijk over politiek. De WER voor deze uitzending bedraagt slechts 46.44%. De tweede beste uitzending bevat heel veel spontane spraak en cultuur gerelateerde verhalen, wat resulteert in een WER van 61.29%. De slechtst herkende show heeft een WER van 72.51%. De verklaring hiervoor wordt achteraf, bij een verdere analyse van de herkenningsuitvoer, ontdekt: deze uitzending bestaat voor de helft uit beschadigde data. Hierdoor worden, vanaf een bepaald punt in de uitzending, een vijftal woorden steeds herhaald. Dit gaat door tot op het einde van de uitzending. In de manuele transcriptie echter loopt de tekst gewoon door. Wanneer dit 13 minuten durende stuk weggelaten wordt bij de analyse van de resultaten, wordt voor deze show een WER van 47.36% behaald en voor de hele testset een WER van 53.84%. Dit is een opmerkelijke verbetering ten opzichte van de 61.28% die initieel bekomen werd toen het stuk wel meegerekend werd.
6.7 Combinatie van de verbeteringen In totaal zijn er drie verbeteringen doorgevoerd: het weglaten van de beschadigde data, het gebruik van de geslachtsafhankelijke akoestische modellen en de verandering van de word insertion penalty. Deze drie kunnen nu gecombineerd worden. De combinatie van het weglaten van het beschadigde stuk en het veranderen van de word insertion penalty, resulteert in een WER van 52.32%. Wanneer het beschadigde stuk weggelaten wordt en tevens de geslachtsafhankelijke akoestische modellen gebruikt worden, wordt een WER van 50.97% bekomen. Het experiment waarbij dit dan nog eens gecombineerd wordt met een word insertion penalty van -5, is helaas niet uitgevoerd. Deze aanpassing zorgt in het universele geval voor een WER-vermindering van 53.84% - 52.32% = 1.52%. Dit resultaat kan echter niet zomaar afgetrokken worden van de 50.97% die bekomen wordt door een combinatie van de eerste twee verbeteringen. Wel mag aangenomen worden dat de helft van deze verbetering nog steeds geldig is in het geval van de aangepaste akoestische modellen. Weloverwogen extrapolatie van de resultaten indiceert dus dat een combinatie van alle eerder onderzochte verbeteringen tot een WER van 50.97% - 0.76% = 50.21% zou leiden.
SAMENVATTING
68
7. Besluit Tot besluit wordt onderzocht in hoeverre de doelstelling van het onderzoek bereikt is. Het doel was het ontwikkelen van een spraakherkenningsssyteem voor de Castiliaanse spraak. Met dit systeem moest een WER bekomen worden vergelijkbaar met de WER die behaald werd met de Galicische spraakherkenner die te Vigo ontwikkeld werd binnen het Transcrigal project. Binnen dit project werd, bij gebruik van een universeel taalmodel en aangepaste akoestische modellen, een WER van 36.62% behaald. Dit is 14.35% minder dan 50.97%, het in dit onderzoek best behaalde resultaat. Er zijn verschillende reden voor dit verschil. Een eerste is dat de radio-uitzendingen zeer heterogeen zijn wat betreft onderwerpen en structuur. Ze hebben geen vaste structuur en de onderwerpen zijn heel divers. Dit is wel het geval binnen het Transcrigal project. De gebruikte televisie-uitzendingen hebben een driedelige structuur: nieuws, sport en weer. Hierdoor vertoont het taalmodel dat getraind werd op deze uitzendingen, meer overeenkomst met de taal en onderwerpen die voorkomen in de uitzendingen die dienden als testdata. Dit resulteert in een betere herkenning. Een tweede reden is dat de televisie-uitzendingen drie vaste presentatoren hebben: ´e´en per blok. Ze komen terug in elke uitzending. Er zijn in dit geval dus niet enkel geslachtsafhankelijke akoestische modellen opgesteld voor de overige journalisten, maar ook drie modellen die aangepast zijn aan deze drie presentatoren afzonderlijk. In de radio-uitzendingen die gebruikt zijn in dit onderzoek, zijn er echter nauwelijks sprekers die terugkomen en al zeker geen vaste presentatoren. Hierdoor zijn de akoestische modellen veel minder goed aangepast, wat een grote invloed heeft op de WER. Een derde reden is het feit dat de taalmodellen van het Trancrigal project op veel meer data getraind zijn dan de modellen ontwikkeld in dit onderzoek. Het bekomen resultaat zou in de toekomst nog verder verbeterd kunnen worden als er meer en homogenere trainingsdata beschikbaar zouden zijn. Ook zou meer aandacht kunnen besteed worden aan het zoeken naar een oplossing voor de per definitie moeilijke herkenning van spontane spraak.