Machine Translation at translate.google.com
Slav Petrov on behalf of many people from Google Research (slides borrowed in part from Dan Klein, Thorsten Brants and Peng Xu)
Machine Translation (French)
Machine Translation (French)
Machine Translation (Japanese)
Machine Translation (Japanese)
General Approaches Rule-based approaches (outdated?)
Expert system-like rewrite systems Interlingua methods (analyze and generate) Lexicons come from humans Can be very fast, and can accumulate a lot of knowledge over time (e.g. Systran)
Statistical approaches (topic of this talk)
Word-to-word translation Phrase-based translation Syntax-based translation (tree-to-tree, tree-to-string) Trained on parallel corpora Usually noisy-channel (at least in spirit)
Levels of Transfer
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow
Hasta pronto See you soon
Machine translation system:
Hasta pronto See you around
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow
Hasta pronto See you soon
Machine translation system: Model of translation
Hasta pronto See you around
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow
Hasta pronto See you soon
Machine translation system: Yo lo haré pronto Novel Senten ce
Model of translation
Hasta pronto See you around
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow
Hasta pronto See you soon
Machine translation system: Yo lo haré pronto Novel Senten ce
Model of translation
Hasta pronto See you around
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow
Hasta pronto See you soon
Hasta pronto See you around
Machine translation system: Yo lo haré pronto Novel Senten ce
Model of translation
I will do it soon
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow
Hasta pronto See you soon
Hasta pronto See you around
Machine translation system: Yo lo haré pronto Novel Senten ce
Model of translation
I will do it soon I will do it around
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow
Hasta pronto See you soon
Hasta pronto See you around
Machine translation system: Yo lo haré pronto Novel Senten ce
Model of translation
I will do it soon I will do it around See you tomorrow
The Noisy-Channel Model
We want to predict a sentence given its translation a:
The noisy channel approach:
Translation model: a table with ‘phrase’ translations and their probabilities
Language model: Distributions over sequences of words (sentences)
MT System Components Noisy Channel Model Language Model
source P(e)
Translation Model
e
channel P(f|e)
f
MT System Components Noisy Channel Model Language Model
source P(e) best e
Translation Model
e
decoder
channel P(f|e) observed f
argmax P(e|f) = argmax P(f|e)P(e) e e
f
Statistical MT: Translation as a search problem
Unsupervised Word Alignment nous acceptons votre opinion . we accept your view .
Unsupervised Word Alignment Input: a bitext: pairs of translated sentences
nous acceptons votre opinion . we accept your view .
Unsupervised Word Alignment Input: a bitext: pairs of translated sentences
nous acceptons votre opinion . we accept your view . Output: alignments: pairs of translated words
Unsupervised Word Alignment Input: a bitext: pairs of translated sentences
nous acceptons votre opinion . we accept your view . Output: alignments: pairs of translated words When words have unique sources, can represent as a (forward) alignment function a from French to English positions
Word Alignment x What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?
z What is the anticipated cost of collecting fees under the new proposal ?
En vertu de les nouvelles propositions , quel est le coût prévu de perception de les droits ?
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Example from Kevin Knight
farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok .
7C lalok farok ororok lalok sprok izok enemok
1a. at-voon bichat dat .
7a. wat jjat bichat wat dat vat eneat .
2c. ok-drubel ok-voon anok plok sprok .
8c. lalok brok anok plok nok .
2a. at-drubel at-voon pippat rrat dat .
8a. iat lat pippat rrat nnat .
3c. erok sprok izok hihok ghirok .
9c. wiwok nok izok kantok ok-yurp .
3a. totat dat arrat vat hilat .
9a. totat nnat quat oloat at-yurp .
4c. ok-voon anok drok brok jok .
10c. lalok mok nok yorok ghirok clok .
4a. at-voon krat pippat sat lat .
10a. wat nnat gat mat bat hilat .
5c. wiwok farok izok stok .
11c. lalok nok crrrok hihok yorok zanzanok .
5a. totat jjat quat cat .
11a. wat nnat arrat mat zanzanat .
6c. lalok sprok izok jok stok .
12c. lalok rarok nok izok hihok mok .
6a. wat dat krat quat cat .
12a. wat nnat forat arrat vat gat .
Bidirectional Alignment
Alignment Heuristics
Extracting Phrases
From “Intro to Stat MT” slide by Chris Callison-Burch and Philip Koehn
16
From “Intro to Stat MT” slide by Chris Callison-Burch and Philip Koehn
16
Probabilistic Language Models Goal: Assign useful probabilities P(x) to sentences x Input: many observations of training sentences x Output: system capable of computing P(x)
Probabilities should broadly indicate plausibility of sentences P(I saw a van) >> P(eyes awe of an) Not grammaticality: P(artichokes intimidate zippers) ≈ 0 In principle, “plausible” depends on the domain, context, speaker…
One option: empirical distribution over training sentences? Problem: doesn’t generalize (at all)
Two aspects of generalization Decomposition: break sentences into small pieces which can be recombined in new ways (conditional independence) Smoothing: allow for the possibility of unseen pieces
N-Gram Model Decomposition Chain rule: break sentence probability down
Impractical to condition on everything before P(??? | Turn to page 134 and look at the picture of the) ?
N-gram models: assume each word depends only on a short linear history
Example:
More (parallel) data is better data… 0,4000
Test data BLEU
0,3250
pt-en fr-en es-en fi-en de-en zh-en
0,2500
0,1750
0,1000 1M
Parallel corpus size
10M 8
More (parallel) data is better data… 0,4000
Test data BLEU
0,3250
pt-en fr-en es-en fi-en de-en zh-en
0,2500
0,1750
0,1000 1M
10M Parallel corpus size
100M 9
More (parallel) data is better data… 0,4000
Test data BLEU
0,3250
pt-en fr-en es-en fi-en de-en zh-en
0,2500
0,1750
0,1000 1M
10M 100M Parallel corpus size
1B 10
Improvement over time: data & algorithms 0,2200
Test Data BLEU
0,1900 Hindi-English Thai-English Hungarian-English
0,1600
0,1300
0,1000 Q4'07
Q1'08
Q2'08
Q3'08
Q4'08
Q1'09
11
Motivation • Train 5-gram language models • Vary amount of training data from 13 million to 2 trillion tokens • Data is divided into four sets • Train separate language model for each set • Use up to four language models during decoding
Motivation target
+ldcnews
+webnews
+web
green: Kneser-Ney blue: Stupid Backoff
• Growth nearly linear within domains • Improvement smaller for web, but doesn’t seem to level off • Stupid has higher BLEU than KN with >8G tokens training
MapReduce Recap • Automatic parallelization & distribution • Fault-tolerant • Provides status and monitoring tools • Clean abstraction for programmers
MapReduce Recap 1. Backup map workers compensate for failure of map task.
2. Sorting phase requires all map jobs to be completed before starting reduce phase.
Language Models • n-gram language models
• n-gram statistics, relative frequency
Stupid Backoff Smoothing
• Statistics required: counts
Stupid Backoff Smoothing • Traditional LM tools:
• Scan text and keep a trie data structure in memory • Direct (or simple/fast) access to all necessary counts
• Scale to >1 trillion words: MapReduce, no in memory access to all counts. • Steps:
• Collect unigram counts • Determine Vocabulary • Collect n-gram counts
Collect Unigram Counts • Mapper: input={Document, }, output={word, count}
Collect Unigram Counts • Reducer: input={word, {count}}, output={word, count}
Determine Vocabulary • Crucial to reduce the amount of intermediate data storage • Simple count cut-off, words below cutoff are mapped to •
Frequent words have small IDs: good for compression
Collect N-Gram Counts • In a straightforward way:
• Similar to unigram counting, count n-grams for all orders 1,…,n • For each word in the corpus, O(n) copies of it are generated in the intermediate data • Unnecessary overhead for shuffling, sorting, etc. in MapReduce
• Better way:
• Only get the highest order n-grams • Split training corpus into smaller, more manageable chunks and combine the counts later
Collect Ngram Count • Mapper: input={Document, },
output={reversed_ngram, count}
Collect Ngram Count • Reducer: input={reversed_ngram, {count}}, output={reversed_ngram, count}
Combine Ngram Counts • Split input into several MapReduce jobs for scalability • Combine separate ngram statistics • Catch: • Ngrams are processed in sequential order • No access to other ngrams in general • Possibly large amount of intermediate data
• Trick: • Input ngrams in reverse order • Restore the normal ngram order in Map • Use a cache to accumulate counts in Map
Combine Ngram Counts • Example: • •
•
E D 2, E D C 3, E D C B 1, E D C A 1, F 2 … Book-keeping: • E=2, ED=2 • E=5, ED=5, EDC=3 • E=6, ED=6, EDC=4, EDCB=1 • E=7, ED=7, EDC=5, EDCB=1, EDCA=1: Output(“BCDE”, 1); • E=7, ED=7, EDC=5, EDCA=1, F=2: Output(“ACDE”, 1); Output(“CDE”, 5); Output(“DE”, 7); Output(“E”, 7); Size of the cache is only the size of highest order n
Combine Ngram Counts
Generate Final Distributed LM • So far, we have:
• Stupid backoff: {ngram, count} • Katz/Kneser-Ney: {ngram, prob/backoff}
• Need to pack them info the final distributed LM • Depend on the serving architecture in a client/ server approach
Distributed LM Architecture • Final LM data is served on multiple servers • Each server performs lookups for multiple
clients • Save memory: if model is 8GB, 100 clients need 800GB memory with local model, but only 80GB if use 10 servers to serve 10 copies.
Distributed LM Architecture Replicas for redundancy and higher load Vocab Replica
Replica server 0
Server 0
Replica server 1
Server 1
Replica server M-1
…
Server M-1
Vocab Server
Load balancers to distribute load between replicas Cache
Cache
Client 0
Client 1
… Clients use user-defined hash function to decide which server to contact
Distributed LM Latency • Network latency
• In memory access: X hundred nanoseconds • Network access: Y milliseconds
• Typical sentence of 25 words in machine translation:
• Roughly 100K ngrams during search • Requires >100 seconds to access all probabilities
• Batch processing
• One batch for each word position, 25 batches in ~100 milliseconds
Distributed LM • For each ngram A B C D E, we may need: • Stupid backoff: P(E|ABCD), P(E|BCD), P(E|CD), P(E|D), P(E) • Katz/Kneser-Ney: P(E|ABCD) , α(ABCD), P(E|BCD), •
α(BCD), P(E|CD), α(CD), P(E|D), α(D), P(E)
• Ideally those entities should locate on the same server, in •
practice not possible for Katz/Kneser-Ney Hash ngrams by the last word (E) could work well
• One network request per ngram for Stupid backoff • Two network requests per ngram for Katz/Kneser-Ney
• In practice, we hash ngrams by the last two words (D E) and duplicate unigram entries on every server for better balanced shards
Client for Distributed LM • Typical client usage:
• From a given set of histories, tentatively expand to
the possible next word (or words): dummy lookups • Batch all required ngrams (and histories too in case of Katz/Kneser-Ney) • Send all batched entries to the servers • Wait for the servers to return probabilities (and possibly backoff weights) • Revisit all expansions with real probabilities, perform pruning: true lookups
Typical Decoder as Client • Repeatedly switch between dummy lookup and true lookup
active node pruned node
Distributed LM Data Structure • The served model data structure can be flexible • Compact trie • Block encoding • Perfect hashing
• Different application may choose different data
structure • One MapReduce to generate the served models
Compressing Language Models • Data structures with tradeoff between • Space (compression) • Runtime (n-gram lookups) • Accuracy (quantization, lossy compression) • Functionality (enumeration, next word) Space
Runtime
Accuracy
Trie
baseline
baseline
baseline
Functionalit y baseline
Distributed LM
better
worse
same
same
Quantization
better
same
worse
same
Randomized
better
better
worse
worse
Value Quantization • Storing probabilities as floating point numbers is expensive • Only need small number of bits to achieve same quality 0,4700
0,4525 BLEU 0,4350
0,4175 #bits
0,4000 3
4
5
6
7
8
• Store negative log probability with 2#bits different values • Experiment above uses simple uniformly sized bins
Machine Translation Results with Bloomier Filters • Vary number of value and error bits • Results for 7 or more value bits equivalent identical to full floating point • Results for 12 or more error bits equivalent to lossless model
Conclusions • Google uses Statistical Machine Translation • There is no data like more data • (But smarter algorithms help, too) • Challenges in distributed environment different than on single core • Approximate solutions work well in practice (especially in the presence of massive data) • Think about the massive computation that happens next time you use: translate.google.com