A2M31RAT - Řečové aplikace v telekomunikacích. Robustní řečové parametrizace

A2M31RAT - Řečové aplikace v telekomunikacích Robustní řečové parametrizace Doc. Ing. Petr Pollák, CSc.

23. března 2011 - 12:34

Obsah přednášky

Příznaky pro rozpoznávání řeči Vlastnosti příznaků na bázi DFT, LPC a kepstra MFCC a PLP Statické a dynamické příznaky

Techniky robustní parametrizace (příznaky pro rozpoznávání) Spektrální odečítání CMS VTLN Kompenzace Lombardova jevu

I. část Příznaky pro rozpoznávání řeči

Parametrizace řečového signálu parametrizace = extrakce příznaků popisujících signál pro účely následné klasifikace používané spektrální (kepstrální příznaky) DFT spektrum - méně používané pro množství redundantní informace AR koeficienty - nevhodné, koeficienty polynomu AR model, koeficienty odrazu - možné, stabilnější Kepstrum - tvoří shluky v n-dimenzionální prostoru - OK LPC kepstrum - méně používané pro malou robustnost MFCC = mel-kepstrum - nejpoužívanější příznaky (modelování nelinearity lidského slyšení) PLP - alternativní způsob modelování nelinearity slyšení

dynamické a akcelerační parametry (delta, delta-delta) energetické parametry ( E , ln E , c0 ) Další parametrizace výpočet pomocí neuronových sítí TRAPs - časové trajektorie příznaků v delších kontextech

MFCC - Melovské kepstrální koeficienty

Blokové schéma výpočtu mel-kepstrálních koeficientů: sn

Xk

mk

DFT

Mel−BF

ln(mk) ln(.)

Výpočet energie v jednom pásmu gj = ln

N/2 X k=0

|S[k]|2 Hmel,j [k] .

Výpočet kepstra pomocí DCT r

ci =

P 2X πi gj cos (j − 0.5) P P j=1

cn IDCT

PLP - kepstrální koeficienty

Blokové schéma výpočtu PLP kepstrálních koeficientů: sn

Xsk

Xk E(f)

DFT

Bj

IDFT

Rk

Bsj CBA

LPC

ak

0.3

a2c

aplikuje se dříve diskutovaná PLP banka filtrů kepstrum se počítá na bázi lineární predikce rozdílné šumové vlastnosti

cn

Statické a diferenciální parametry 

     c1 [i] ∆1 [i] δ1 [i] c2 [i] ∆2 [i] δ2 [i]       c[i] =  .  , ∆[i] =  .  , δ[i] =  .   ..   ..   ..  cp [i]

∆n [i]

δn [i]

Dynamické parametry (odhad 1. derivace základních příznaků) M X

∆k [i] =

m (ck [i + m] − ck [i − m])

m=1 M X

pro m

1≤k ≤n

2

m=1

Akcelerační parametry (odhad 2. derivace základních příznaků) M X

δk [i] =

m (∆k [i + m] − ∆k [i − m])

m=1 M X m=1

pro m

2

1≤k ≤n

II. část Základní robustní parametrizační techniky

0

pravdepod pro kazd

neuro

9

tody jsou zalozˇeny na spektra´lnı´m odecˇı´tańı´ vlozˇeny´m

^Xi GMM/ Robustní parametrizace řeči pro kompenzaci šumu 1400 HMM do standardnı ćh parametrizac ˇnıćh postupu˚aditivního , viz prˇ´ıklad Podrobneˇjsˇ´ı info

vy´poc koeficientu˚. x ˇtu melovskyćh kepstra´lnıćh c

|Ni| − + • - standardnı • ´+prˇ´ıznaky Mel-kepstrum ˇove MFCC -DFT mel-frekvenční kepstrální koeficienty (DCT) WF ´ rˇec IDFT p

|N |Si| i|´ Modifikovane´ spektra ´ lnı´ odecˇı´ta ´ nı • z −1 Metoda pro potlacˇenı´|.|aditivnı´ho sˇumu + ^Xi

sbi

• Borˇil, H., Polla´k, P.: De

/u/

• Borˇil, H.,1000 Polla´k, P.: C ASIDE 2005,300 COST27

DCT

1

0.8 0.6

0.1

0.2

+

0.3

0.4

0.5

− Blok MSO

0.6

0.7

|Xi|

0.9

|.|

4 3 2

1 ff

Podrobneˇ

• Borˇil, H., P

• Borˇil, H., P ASIDE 200

|Si|

MBF ln DCT |Ni−1 | 1 −p Blokz −1MSO • + p Blok MSO lze umístit |Ni| |Si| i za MBF či za logaritmus (z hlediska vyhlazení je vhodné −1 umístění • přímo za DFT) z

databa´zı´

MSO |Si−1 |

5

70 0.8

b xMFCC c| +ˇove i Robustnı |Xi| vlivu aditivního |N ´ mel-kepstrum - standardnı ´ rˇec ´ prˇ´ıznakysi s eliminací rušivého pozadí i − • • + WFc IDFTcbs,i xi DFT

DFT

6

0

0.2

0

7

1

0.4

0

8

cx,i Počet segmentů (x 10 000)

|Xi| |Si−1| |M|N 1 −pmi i| i−1| DFT MBF ln −1 • + z

Habs

pravdepodobnosti fonemu pro kazdy segment

xi

170

1200

|Xi|

i

neuronova sit

1600

70

Hardwarove´ impleme

Výsledky rozpoznávání řeči se standardní parametrizací Rozpoznávač izolovaných slov, standardní MFCC parametrizace Trénování i testování - zašuměná řeč s různou úrovní šumu 34 12

&(' ! #?

) (*,+.) -.*74 ) +;*79 492*3+.(*3@

! ) +/*10 ) -.*364 ) 98*3<+ ) .* ) 0 958*702

!#"

" ) 2*30 ) 92*390 ) -(*7 ) 0(*39 --(* ) 4

45.*360 ) .*3 ) 98*3<+ ) .*754 45.*7-5

--.*1) *1-8 ) +/*1=+ ) +/*1=+ ) *14=+

0 -/. -/,. + * ')(* &

%$

#$ (*708 -4(* 4:98*,+ ) (*>9 ) :*70

#%" 054.*74:9 5.*75-50.*7450 -54.*3960 - ) *7- )

Výsledky rozpoznávání řeči s předzpracováním řeči s šumem Rozpoznávač izolovaných slov, standardní MFCC parametrizace Trénování i testování - zašuměná řeč, spektrální odečítání

23 01

& ! #7

')(*'+ '2)( & '2)(502 ' & (*0 +2)(50'

! ',-(*',+ '/03(*+,' ',2-( & ',-(*',+ +,'-(*.,+

!"

" ')(*+ '04(5 & '2)(* '2)(* '3/(*.

',.-(*/ ',-(5 '/03(5 & '/03(5 & '/3(*./0

+-(10 '.-(6.' ' & (10' ')(50 '.-(6'3

/ ,.,.+* ) &(') % #$

#$ /2 03(*. +2)(*.3 ')(*'. ' & (* & '.)(50

#%" /03(*., 2/3(*',. +,-( &,& +,-(50, +,'-(*4

Robustní parametrizace řeči pro kompenzaci aditivního šumu

Spektrální odečítání lze použít různé techniky pro zvýrazňování řeči (obvykle spíše jednokanálové systémy) odstranění šumu se provede pouze ve frekvenční oblasti → zpětný převod do časové oblasti není nutný

otázka vlivu zkreslení řečového signálu (potlačení informace) Možnost zařazení potlačení aditivního rušení ve frekvenční oblasti standardní MFCC, PLP různý vliv na jednotlivé techniky (LPC vs. DCT) TRAPs (základem je též banka filtrů) - nadějné, zatím spíše v základním výzkumu

CMS - Cepstral Mean Subtraction Potlačení konvolučního rušení (především vlivu kanálu) Vstupní předpoklad: x[n] = s[n] ? h[n]

⇒

cx = cs + ch

Průměrné kepstrum všech L segmentů signálu je dané vztahem L−1

cx =

1X cs [i] + ch = cs + ch L i=0

pro dostatečné množství segmentů platí cs → 0. ⇓ cx = b ch

b cs [i] = cx [i] − cx = cs [i] + ch − b ch ≈ cs [i]

VTLN - Vocal Tract Length Normalization

Zvýšení robustnosti kompenzací variability mezi mluvčími Výchozí předpoklad - délka vokálního traktu je nepřímo úměrná formantovým frekvencím VTL =

(2i − 1) · c 4Fi

Řešení - warpovací funkce a warpovací faktor pro transformaci frekvenční osy

VTLN - Vocal Tract Length Normalization Ilustrační obrázek vlivu warpování frekvenční osy

Převzato z: David Suendermann, Guntram Strecha, Antonio Bonafonte, Harald Hoege, Hermann Ney: Evaluation of VTLN-Based Voice Conversion for Embedded Speech Synthesis, In Interspeech 2005.

VTLN - Vocal Tract Length Normalization

Ilustrace principu warpovacích funkcí

Převzato z: Xiaodong Cui and Abeer Alwan: MLLR-Like Speaker Adaptation Based on Linearization of VTLN with MFCC Features. In Interspeech 2005.

VTLN - Vocal Tract Length Normalization Nelineární zkreslení - warpovací faktor α : 0, 88 ≤ α ≤ 1, 12 ηα (f ) = f ·

1 + arctan

f (1 − α) sin 2π fmez

f 1 − (1 − α) cos 2π fmez

!

4000

3500

Warped frequency

3000

2500

2000

1500

1000

500

0

0

500

1000

1500

2000 2500 Linear frequency

3000

3500

4000

warpují se často jen mezní frekvence použité BF (MFCC, PLP) parametr α se odhaduje maximalizací pravděpodobnosti (analogie trénování)

VTLN - Vocal Tract Length Normalization Lineární zkreslení - warpovací faktor α : 0, 88 ≤ α ≤ 1, 12 ηα (f ) =

  αf

 αfo +

fmez 2 − αfo fmez 2 − fo

pro 0 ≤ f < fo (f − fo )

pro fo ≤ f < fmez

4500

4000

3500

Warped frequency

3000

2500

2000

1500

1000

500

0

0

500

1000

1500

2000 2500 Linear frequency

3000

3500

aproximace nelineární warpovací funkce parametr α se odhaduje opět trénováním

4000

Lombardův jev (LE)

Změny v produkci řeči pod vlivem šumu zvýšení intezity promluvy posun základního tónu řeči (při rozpoznávání menší vliv) posun formantových kmitočtů VÝRAZNÝ VLIV NA ROZPOZNÁVÁNÍ podobný efekt má vliv stresu či jiných emocí (emotional speech recognition)

200

3.2. Fundamental frequency ˆ s2,i and σˆ n2,i ) have to be estimated. Noise speech and noise ( σ estimated speech pause by standard exponential f0power was istracked in inthe WaveSurfer [10]. Tracking was estimation in voiced parts of all neutral and noisy speech performed 2 utterances. descriptions, letters ‘F’ and (2) ‘M’ σˆ n2,i =Inp ⋅the σˆ n2,i −graph 1 + (1 − p ) ⋅ σ x , i for VADi = 0, represent female and male data respectively. σˆ n2,i = σˆ n2,i −1 for VADi = 1. (3)

estimation error [8]. Precision of the estimation is very 0 sensitive to correct10VAD classification. Detector based70on -10 30 50 SNR (dB) differential cepstral analysis was used. The details are described in [9]. Figure 2: CLSD channel SNRs.

Engine On

70

Frequency (Hz)

Rs.

300

SPEECON Channelf0SNR Histograms Figure 3: SPEECON distribution.

Number of Utterances

250 200 150

Close-talk Office Hands-free Office Close-talk Car Hands-free Car

100 50

10

250

30 SNR (dB)

50

70

CZKCC Channel SNR Histograms

er of Utterances

200

150

1

50

0 3.2.

0 -10

Number of Frames (x 1000)

ngine On

8

Office_F (4) σˆ s2,i = σ x2,i − σˆ n2, i . Car_F In 6principle, the algorithm estimates standard global SNR Office_M evaluated over speech activity regions only. The segmental Car_M 4 approach and the averaging of linear power ratios give lower estimation error [8]. Precision of the estimation is very 2 sensitive to correct VAD classification. Detector based on differential cepstral analysis was used. The details are 0 described in [9]. 70 170 270 370 470 570

Number of of Frames (x 10 000) Number Utterances

Engine Off

Number of Frames (x 10 000)

ngine Off

Since CZKCC and CLSD were recorded by two 300 SPEECON Channel SNR Histograms microphones, SPEECON SNR distributions are also depicted only for first twoFrequency channels.Distribution In CZKCC a CZKCC -the Fundamental 16 250 directional microphone was used in the distant position, 14 which explains higher average SNR in the distant 200 12 microphone ‘engine on’ channel than in SPEECON. 10 Sometimes it is necessary to modify gain of the 150 Off_F Close-talk Office to Drv_F 8 microphone preamplifier during the recording Hands-freesession Office Off_M avoid voiceCar intensity. Close-talk 100 signal clipping when speaker changes 6 Drv_M Car voice In consequence, it becomes impossible Hands-free to evaluate 4 intensity changes directly from the amplitude of the recorded 50 2 speech signal. In case the ambient noise can be considered relative voice intensity changes can be estimated 0 stationary, 0 70 170 470 570 the -10 10 30 gain 370 50 changed 70 from the SNR even 270 with being during Frequency(dB) (Hz) session. Moreover, if the SNR absolute level of the ambient noise was known, absolute level of vocal intensity could be 250 CZKCC Channel SNR Histograms 7 estimated (but CLSD it was- not our case).Frequency Distribution Fundamental In SPEECON and CZKCC environmental characteristics 6 200 changed significantly when comparing office and car or 5 standing car with engine off and moving car scenarios, but in 150 of CLSD ambient noise can be consideredClean_F stationary 4 case Close-talk Engine LE_FOff and thus SNR histograms relate to overall vocal intensity Clean_M 3 Distant Mike Engine Off that changes in neutral and Lombard speech. It is obvious LE_M 100 voice intensity rises significantly for the Lombard speech, 2 Close-talk Engine On see Fig. 2.


12

If the speechSPEECON and noise signals canFrequency be considered to be - Fundamental Distribution uncorrelated, speech power can be estimated by subtraction 10 of noise power from the mixture power

Close-talk Engine Off Distant Mike Engine Off

Distant Mike Engine On

Fundamental frequency

70

170

270 370 the WaveSurfer Frequency (Hz)

570

In case 12 of SPEECON, see Fig. 3, and CZKCC, Fig. 4, shifts SPEECON - Fundamental Frequency Distribution observable but not significant. In case of in f0 distribution are CLSD, 10Fig. 4, maximum of the LE male f0 distribution appears 8at the higher frequency than maximum of neutral Office_F female distribution while female maximum moves to Car_F 6

1200

3.2. Fundam 1000

f0 was track 300 performed in utterances. I 2400 represent fem 2200

12

2000

10 8 6

1800 1600 1400

4

1200 2

1000 470

0 was tracked in [10]. Tracking was 10 30 50 70 performed in voiced parts of(dB) all neutral and noisy speech SNR utterances. In the graph descriptions, letters ‘F’ and ‘M’ Figure 4: CZKCC and CLSD f0 distribution. Figure 1 : SPEECON and CZKCC channel SNRs. represent female and male data respectively.

f0 -10

ames (x 10 000)

70

microphone avoid signal In consequen intensity cha speech signa stationary, re from the SN session. Mor was known, estimated 2400(bu In SPEEC 2200 changed sign standing car 2000 case of CLS 1800 and thus SN changes in n 1600 voice intensi see Fig.1400 2. F2 (Hz)

Hands-free LE

400

F2 (Hz)

voice intensity rises significantly for the Lombard speech, possible to evaluate only the power of mixture σ x2,i as the

evaluated over speech activity regions only. The segmental see Fig. 2. Přítomnost Lombardova DBpower ratios give lower approach and the averaging of linear signal is supposed to contain noise all the time.jevu Powers of v dostupných


free Car

300

0 70

F

Changes observed Formant changes i be observ of first tw

3.3. Formants

containing digits. Difference in phoneme duration in the

same word uttered in two different scenarios was evaluated Přítomnost Lombardova jevu DB Formant analysis was performed on utterances containingv dostupných as shown in Eq. 5.

digits. Monophone HTK [11] recognizer trained on 70 SPEECON office sessions was used for the forced alignment. 12th order LPC was chosen for formant tracking performed by the WaveSurfer. Information about first four formant frequencies and bandwidths were assigned to corresponding phonemes. In the following figures, positions of first two female formants of the selected vowels appearing in Czech digits are presented.

2200

12

2000

10

F2 (Hz)

8 6

2

1200

70

Neutral Off_F LE Drv_F Off_M /a'/ Drv_M

/e'/ /o'/ /u'/

1000 170 300

2200 2000

/e/

1600 1400

0

1600

/o/

1200 1000

270 370 400 450 Frequency (Hz)

500 F1 (Hz)

470 550

600

570 650

700

(5)

LE

Word

#N

TN (s) σTn (%) # LE

Nula

349

0,475

1400

/u/ 350

1800

TC 2 − TC1 ⋅ 100 (%), TC1

TCx represents average phoneme duration in scenario x. In SPEECON, phoneme duration differences did not exceed 38 %. In case of CZKCC, greatest duration changes were observed in the word 'štiri' (phoneme /r/ – 79 %) and in %).Formants Most significant the word 'sedm' (phoneme /e/ – 73 CZKCC - Female Vowel /i'/ differences phoneme duration were observed /i/ Engine Off vs. Engine Drivein the CLSD database, e.g. in word ‘jedna’ (/e/ – 161 %), ‘pjet’ (/e/ – 174 /e'/ systematical changes in %), ‘devjet’ (2nd /e/ – 177 %). No /e/ word duration were observed in SPEECON. Neutral

/a/

300 Dvje350

/o'/ 326 0,560 /a'/ /o/

34,48

13,62

251

0,607

26,27

8,58

245400 0,426 450 10,56 500

255 550

0,483 600

65032,51700

13,57

F1 (Hz)

2400

⇓ /i'/ /i/ /e'/ /e/ Oproti předpokladu je LE přítomen v menší míře (Čtené promluvy, bez zpětné vazby)

CLSD - Fundamental Frequency Distribution

CLSD - Female Vowel Formants

2200

6 5

2000

Clean_F LE_F Clean_M LE_M

4 3

1800 1600

2

1400

1

1200

0

1000 70

170

270 370 Frequency (Hz)

470

570

/a'/

/a/

/o'/

/u'/

400

Neutral LE

/o/

/u/ 300

∆ (%)

11,68

/u'/ /u/ 269 0,559

Jedna

TLE (s) /a/ σTle (%)

Figure 7: CZKCC - word duration changes.

Figure 5: Positions of female F1, F2 – SPEECON.

7 Number of Frames (x 10 000)

/i'/

1800

4

2400

CZKCC - Fundamental Frequency Distribution SPEECON - Female Vowel Formants Office vs. Car /i/

F2 (Hz)

Number of Frames (x 1000)

14

F2 (Hz)

2400

16

∆=

500

600 F1 (Hz)

700

800

900

Figure 4: CZKCC and CLSD f0 distribution.

Figure 6: Female F1, F2 – CZKCC and CLSD.

In case of SPEECON, see Fig. 3, and CZKCC, Fig. 4, shifts in f0 distribution are observable but not significant. In case of CLSD, Fig. 4, maximum of the LE male f0 distribution appears at the higher frequency than maximum of neutral

Changes in first two formant F1, F2 locations can be observed for SPEECON, Fig. 5, and CZKCC, Fig. 6. Formant bandwidths did not display any systematical changes in different scenarios. Significant formant shifts can

17,82

8 6 4

8 6

Off_F Drv_F Off_M Drv_M

1800

4

1800

LE

/a/

1400

1400

/u'/ /u/

2

1200

1200

0

0

1000

1000 350 300

7

6

6

5 4 3 2 1 0 70

annel distribution

470570

570

300

CLSD - Fundamental Frequency Distribution CLSD - Fundamental Frequency Distribution

5 Clean_F LE_F Clean_M LE_M

4 3

Clean_F LE_F Clean_M LE_M

F2 (Hz)


7

170 270 270370 370 470 Frequency (Hz) Frequency (Hz)

2400

2400

2200

2200/i/

2000

2000

1800

1800

1600

F2 (Hz)

70 170

1400

1400

1

1200

1200

0

1000

1000

170 270 270 370 370 470 Frequency (Hz) Frequency (Hz)

470 570

570

Figure 4: CZKCC and CLSD distribution. Figure 4: CZKCC andf0CLSD f0 distribution. 1600


x ,i

450 400

/i'/ /i/

500 450 F1 (Hz)

/o'/ /o/

/a/

/o'/ /a'/ /o/

550 600 500 550 F1 (Hz)

/a'/

650 600

700 650

CLSD - Female Formants CLSD -Vowel Female Vowel Formants /i'/ /e'/ /e'/ /e/

/e/ /a'/ /a/

/u'/ /u/

400 300

500 400

/a'/

/a/ /o'/

/u'/

/u/

700

/o/

/o/

600 500 F1 (Hz)

700 600 F1 (Hz)

/o'/

Neutral LE

800 700

Neutral LE

900 800

900

Figure 6: Female CZKCC and CLSD. Figure 6: F Female and CLSD. 1, F2 – F 1, F2 – CZKCC

CLSD Channel SNR Histograms

In case of see Fig. see 3, and Fig. 4, shifts In SPEECON, case of SPEECON, Fig.CZKCC, 3, and CZKCC, Fig. 4, shifts 1400 are observable but not significant. In case of in f0 distribution are observable but not significant. In case of in f0 distribution 1200 L σˆ 2 f distribution CLSD, Fig. 4, Fig. maximum of the LE male f distribution 4, maximum of the LE male ,j sCLSD, 0 0 (1) SNR = 10log ∑ 2 , 1000 appears at the higher frequency than maximum of neutral at the higher frequency than maximum of neutral j =1 σˆ nappears ,j female distribution while female maximum moves moves to distribution while female maximum to 800 index of frames withfemale speech activity since the of typical first formant appearance of certain location of typical first formant appearance of certain valuated onlylocation for short-time frames containing 600 f phonemes in neutral speech. During the recognition, f0 phonemes in neutral speech. During the recognition, 0 =1 for each j. For each short-time frame it is be wrongly interpreted as F1. 400 component may interpreted as F1. valuate only component the power ofmay mixture σ 2 beaswrongly the

segmental SNR was evaluated as

300

400 350

/u'/ /u/

1600

2

70 170

LE

1600

2

70


1600

F2 (Hz

Off_F Drv_F Off_M Drv_M

F2 (Hz

Number of Fram

Number of Fram

10

Analýza přítomnosti LE v CLSD05

ChangesChanges in first intwo F1, F2 Flocations can be can be firstformant two formant 1, F2 locations observedobserved for SPEECON, Fig. 5, Fig. and 5,CZKCC, Fig. 6. Fig. 6. for SPEECON, and CZKCC, FormantFormant bandwidths did notdiddisplay any systematical bandwidths not display any systematical changeschanges in different scenarios. Significant formant formant shifts can in different scenarios. Significant shifts can be observed in the CLSD, 6. Also narrowing be observed in the Fig. CLSD, Fig. significant 6. Also significant narrowing Close-talk Clean of first two formant bandwidths has beenhas observed. of first two formant bandwidths been observed. Hands-free Clean Close-talk LE

3.4. Phoneme and word durations Hands-free LE and 3.4. Phoneme word durations

AverageAverage phonemephoneme durations were evaluated for utterances durations were evaluated for utterances 200 containing digits. Difference in phoneme durationduration in the in the containing digits. Difference in phoneme 0 same word uttered in two different scenarios was evaluated same word uttered in two different scenarios was evaluated FormantFormant analysisanalysis was performed on utterances containing was performed on utterances containing -10 10 30 50 70 mated in speech pause by standard exponential shownasinshown Eq. 5.in Eq. 5. digits. Monophone HTK [11] trained trained on 70 on 70SNRas(dB) digits. Monophone HTKrecognizer [11] recognizer Figure 2 : CLSD channel SNRs. SPEECON office sessions was used SPEECON office sessions was for usedtheforforced the forced TC 2 − TC1TC 2 − TC1 (2) p ⋅ σˆ n2,i −1 + (1 −alignment. p ) ⋅ σ x2,i for VAD = 0,12thLPC was chosen formant trackingtracking 12th iorder order LPC was for chosen for formant alignment. (5) (5) ∆ = ⋅ 100 (%), ⋅ 100 (%), Since CZKCC and CLSD were recorded∆ =by Ttwo TC1 C1 for VADperformed σˆ n2,i = σˆ n2,i −1 performed by the WaveSurfer. Information about first fourfirst fourSNR distributions are also by the (3) WaveSurfer. Information about i = 1. microphones, SPEECON formant frequencies and bandwidths were assigned to formant frequencies assigned to two TCx channels. represents phoneme durationduration in scenario x. TCx represents average phoneme in scenario x. h and noise signals can be considered to beand bandwidths depicted were only for the first Inaverage CZKCC a corresponding phonemes. In the following figures, figures, positions phonemes. In the following positions In the SPEECON, phonemephoneme durationduration differences did not did not In SPEECON, differences speech power can becorresponding estimated by subtraction directional microphone was used in distant position, of first formantsformants of thewhich vowels oftwo firstfemale two female ofselected theexplains selected vowels exceed 38 %. of case CZKCC, greatest greatest durationduration changeschanges er from the mixture power higher average SNR in In the distant exceed 38case %. In of CZKCC,

posed to contain all Formants the time. Powers of 3.3. noise Formants 3.3. oise ( σˆ s2,i and σˆ n2,i ) have to be estimated. Noise

Možnosti kompenzace LE

Konverze hlasu - změna fo resp. dalších parametrů Speciální parametrizace - daty řízený návrh BF s optimalizovaným rozložením pásem Frekvenční transformace formantových kmitočtů - obdoba VTLN, počítáno na základě intezity LE

Děkuji vám za pozornost !

A2M31RAT - Řečové aplikace v telekomunikacích. Robustní řečové parametrizace

Recommend Documents