kl800.com省心范文网

Active Learning for Statistical Machine Translation


Active Learning for Statistical Machine Translation

Chris Callison-Burch

U
E

NI VER

S

IT

TH

E

D I U N B

PhD Proposal Institute for Communicating and Collaborative Systems School of Informatics University of Edinburgh 2003

R

G

H

Y

O F

Abstract
For my PhD I propose to apply active learning to statistical machine translation. Statistical machine translation is a data-intensive way of producing translation systems. It uses machine learning algorithms to automatically create translation models from bilingual training data. Statistical translation can be used for any language provided that there is suf?cient training data. However, when only small amounts training data is available statistical translation fails to produce good translations. This is a problem for so-called low density languages, which do not have extensive resources. I will examine the problem of using statistical translation for such languages, by focusing on ef?cient ways of creating training data through active learning. Active learning is a way to reduce the cost of creating a corpus of labeled training examples. Most machine learning takes place passively, because the statistical learner has no input on which examples it is trained on. By contrast, active learning gives the statistical learner the power to query a human annotator to label examples that will be most informative in its learning. Thus active learning minimizes the amount of training data required to achieve a certain performance level, by selectively sampling the data that needs to be annotated. This in turn reduces the amount of human effort required to create a training corpus, and reduces the associated cost of its creation. In this report I describe the theory and practice of active learning, review the relevant details of how statistical translation is learned from data, and propose a number of ways in which active learning could be applied to translation. The signi?cance of my thesis will be to reduce the main cost associated with creating statistical translation systems. This will mean that that statistical translation may be ef?ciently applied to a much wider range of languages than it can currently be applied to, and that it may be more effectively transferred to new domains of language use.

i

Table of Contents

1

Motivation 1.1 1.2 Motivation for using active learning . . . . . . . . . . . . . . . . . . Active learning applied to machine translation . . . . . . . . . . . . .

1 3 4 6 6 7 8 10 10 10 10 12 12 13 14 15 17 17 18 19 19 20 20 21 21

2

Active Learning 2.1 2.2 2.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . Committee-based selection methods . . . . . . . . . . . . . . . . . . 2.3.1 2.3.2 2.4 2.4.1 Applied to NLP . . . . . . . . . . . . . . . . . . . . . . . . . Related topics . . . . . . . . . . . . . . . . . . . . . . . . . . An example from NLP . . . . . . . . . . . . . . . . . . . . .

Certainty-based selection methods . . . . . . . . . . . . . . . . . . .

3

Statistical Translation 3.1 3.2 As a supervised learning problem . . . . . . . . . . . . . . . . . . . . As an unsupervised learning problem . . . . . . . . . . . . . . . . . . 3.2.1 3.2.2 3.2.3 3.3 Parameters of word-level alignments . . . . . . . . . . . . . . Parameter estimation . . . . . . . . . . . . . . . . . . . . . . Word alignment accuracy . . . . . . . . . . . . . . . . . . . .

Phrase-based translation models . . . . . . . . . . . . . . . . . . . .

4

Research Goals 4.1 Selective sampling for translation . . . . . . . . . . . . . . . . . . . . 4.1.1 4.1.2 4.1.3 4.1.4 4.2 Committee-based selection . . . . . . . . . . . . . . . . . . . Certainty-based selection . . . . . . . . . . . . . . . . . . . . Other selection methods . . . . . . . . . . . . . . . . . . . . Evaluating selection methods . . . . . . . . . . . . . . . . .

Selective sampling for word-level alignments . . . . . . . . . . . . .
ii

4.3

Considerations for non-simulated settings . . . . . . . . . . . . . . .

22 23

Bibliography

iii

Chapter 1 Motivation
Statistical machine translation (as formulated by Brown et al. (1993)) is a data-driven way of producing translation systems. Statistical translation differs from rule-based approaches to machine translation because whereas rule-based approaches rely on morphological, syntactic, and/or semantic analysis of languages, statistical translation requires no linguistic information beyond that which is present in bilingual training data. Because of this, statistical machine translation is purportedly language-independent. The supposition is that it can be used to produce a translation system for any language. Since the original Brown et al. (1993) formulation of statistical translation focused on French-English, one might assume that it was somehow biased towards that language pair. However, statistical translation has been successfully applied to all the European Union languages, as well as to translation from Arabic, Chinese, and Japanese to English. Training data in the form of sentence-aligned bilingual corpora (also known as parallel corpora) exists in abundance for each of these language pairs. The LDC currently has 65 million words worth of French-English data, 90 million of Arabic English, and 120 million of Chinese-English, and there are pre-processed, publicly accessible EU corpora with 20 million words in each of the eleven languages (with gigabytes more raw data assessable through the EU site itself). This suggests the application os statistical machine translation to a particular language is greatly facilitated by the availability of a large training corpus. The performance of a statistical translation system is directly related to the amount of training material available for a language pair. This can be empirically veri?ed by graphing the performance of translation models against the amount of training data used to produce them. Figure 1.1 graphs the learning curve for statistical translation. Notice that the accuracy of translation increases as more data is used, but that the
1

Chapter 1. Motivation

2

64 62 60 Accuracy (100 - Word Error Rate) 58 56 54 52 50 48 46 44 10000 Training Corpus Size (number of sentence pairs) German to English French to English Spanish to English 100000

Figure 1.1: The learning rate of statistical machine translation.

convergence rate of the system is very slow. That is, accuracy only increases by a few percent even as 100,000 training examples are added. While the graph in ?gure 1.1 indicates that translation quality increases with the amount of training data, it does not bode well for so-called low density languages. Low density languages are languages for which extensive linguistic resources are not available. Since the learning rate for statistical translation is so slow, we may need to generate training data in amounts approaching the size of the LDC’s corpora before converging on an acceptable level of translation quality. The question that I will address in my thesis is whether it is possible to apply statistical translation to low density languages by devising more effective methods for creating training corpora. This question is not only of academic interest; it is also a political and commercial concern. The US Defense Advance Research Projects Administration (DARPA) has recently introduced a “surprise language” exercise into its annual machine translation evaluation competition. The surprise language exercise was developed to test how effectively machine translation could be employed in a situation where a low density language suddenly became signi?cant. To that end, the DARPA evaluation is a feasibility study to see whether a translation system could be created quickly, given no preexisting resources. Teams participating in the surprise language event this year were given one month to create an English-Hindi translation system from scratch. Oard et al. (2003) describes the various strategies that teams employed. These included trying to

Chapter 1. Motivation

3

?nd training data online, scanning in bilingual dictionaries from books and performing optical character recognition, and hiring native speakers to translate sentences to create a parallel corpus manually. The question of whether it is possible to create a translation system from scratch is also a commercial concern. If a company were deciding whether to create a statistical translation system for a new language pair or to a new domain (i.e. a domain outside of government documents), an important consideration would be the cost of assembling the training corpus. Germann (2001) describes the cost of building a Tamil-English parallel corpus entirely from scratch. Professional translations in the US charge rates of approximately 36 cents per Tamil word for translations from Tamil into English. Nonprofessional translators (in this case engineering graduate students who were native Tamil speakers) translated at a cost of about 10.8 cents per Tamil word. Extrapolating from these ?gures we can estimate the cost of building a corpus the size of the Canadian Hansards training data used in the original Brown et al. (1993) experiments at being somewhere in the range of $6–20 million. This cost is prohibitively high, so any way of reducing the amount of data that needs to be created would increase the feasibility of using statistical machine translation.

1.1

Motivation for using active learning

The problem with applying statistical machine translation to a low density language such as Tamil is: ? A substantial amount of training data must be created in order to achieve an acceptable level of quality, but ? There are usually limited resources (time or money) for creating the data. The problem of training data being scarce and costly to produce is not limited to machine translation, but affects statistical natural language processing and machine learning in general. Recent research on the topic of active learning seeks to ameliorate these problems. Most machine learning falls into the category of supervised learning, meaning that learning algorithms (which are referred to informally as learners and models in this report) require labeled training data. Examples of labeled training data in statistical natural language processing include words labeled with their parts-of-speech, sentences

Chapter 1. Motivation

4

labeled with their parse trees, and documents labeled with their categories. The process of assembling an annotated corpus involves starting with a collection of unlabeled examples and then manually applying labels to them.1 Manually assembling a corpus can be costly and time consuming (see for example Marcus et al. (1993) which details the costs involved with creating the Penn Treebank). In most machine learning setups, the process of selecting which examples are to be labeled is independent of the learner. In such cases learning is said to be passive because the learner has no in?uence over which examples constitute its training set. In contrast active learning gives the learner the power to select which examples should be included in the training set. Active learning reduces the amount of labeling that needs to be done though selective sampling2 of the unlabeled data. In active learning the learner examines a collection of unlabeled examples and selects those examples which will be most informative to it, queries the human annotator to label the selected examples, and iteratively re-trains on the augmented set of labeled training examples. This avoids annotating examples that contribute redundant information to the learner. Active learning thus reduces the number of examples that need to be hand-labeled, and thereby reduces the amount of human effort needed to create a training corpus.

1.2

Active learning applied to machine translation

For my PhD I propose to examine the problem of using active learning to improve the quality of statistical translation. While active learning has been successfully applied to a number of other natural language processing tasks, it has not yet been applied to translation. The primary contribution of my work will be in showing how statistical translation techniques can be applied to languages whose lack of data normally put them out of range. My focus will be on the ef?cient creation of training data for languages that have scarce resources (either because they are low density languages, or because parallel corpora are not publicly available, as is currently the case with Japanese). My work may also investigate the feasibility of applying active learning when transferring from one domain of language use to another.
alternate way of creating an annotated corpus is to automatically label the pool of unlabeled examples and then have a human correct the machine labeled examples. Active learning ?ts naturally into this approach since it usually generates predictions for the unlabeled examples in choosing which examples to have a person label. 2 Data selection in passive learning is known as random sampling since the examples are arbitrarily selected, at least in view of the learner.
1 An

Chapter 1. Motivation

5

The simplest way of imagining how active learning can be applied to statistical translation is: ? Rather than creating a parallel corpus by randomly selecting sentences to be translated, a translator will instead be selectively queried to translate the sentences that will be the most informative to the statistical translation model. ? This should reduce the cost of achieving translation of a certain quality for a new language pair, or for a new domain. However, there may be a number of subtler ways that active learning can be applied to translation. For instance, in the explicit annotation of word-to-word mappings between a pair of parallel sentences. The remainder of this report is as follows: Chapter 2 reviews the theory of active learning, giving examples of how it has been applied to other NLP tasks. Chapter 3 gives an overview of statistical models of translation, describing how they are trained from data. Chapter 4 contains my proposals for a number of ways to apply active learning to statistical translation.

Chapter 2 Active Learning
Active learning is a machine learning paradigm wherein a statistical learner is used to iteratively select training examples. The learner chooses which examples will be labeled by a human annotator. Through selective sampling of the pool of unlabeled training examples a learner is able to choose those examples which will be the most informative. Active learning therefore reduces the amount of training data required to achieve a certain level of performance when compared to randomly selecting the that comprise the training material. In this chapter I review the theoretical motivation of sample selection for active learning, discuss types sample selection, and show how they have been applied to two NLP applications.

2.1

Theory

Cohn et al. (1996) show that there is a statistical optimal way to select the training data by systematically reducing the expected error rate of a statistical learner. The expected error rate over unseen data can be written as
Z
X

E[(y(x; D) ? y(x))2 |x]P(x)dx ?

where D is the set of labeled training examples; x is an unlabeled example and y(x) is its label; and y(x; D) is the predicted label for x given D. The expectation is integrated over ? the set X of unlabeled examples, and weighted by the probability P(x) of each example. The equation therefore computes the average expected error over the unlabeled input data for a model trained on a particular set of data. The expected error rate may be decomposed into three terms:

6

Chapter 2. Active Learning

7

E[(y(x; D) ? y(x))2 |x] = E[(y(x) ? E[y|x])2 ] ? + (E[y(x; D)] ? E[y|x])2 ? + E[(y(x; D) ? E[y(x; D)])2 ] ? ?

(2.1) (2.2) (2.3)

The ?rst term (2.1) represents the noise that is inherent in the distribution that we are trying to model. Noise is the difference between a label and its expectation given the rest of the data. The second term (2.2) is the bias of the learner. The bias of the learner is measured as the difference in the learner’s expected predictions and the (unbiased) expected label for an example. The ?nal term (2.3) is the variance of the learner. Variance is an average that indicates how much a learner’s predictions vary from its expected prediction for examples. Reducing the error contributed by any of these three terms will reduce the overall expected error rate of a learner. A principled ways of doing active learning select examples in order to reduce the expected error rate of a system. Noise in the training data doesn’t depend on the learner, and so cannot be reduced through active learning. Selection techniques which minimize bias or variance can be formulated, however. Cohn et al. (1996) shows how to formulate variance reducing selection techniques in order to train neural networks, mixtures of gaussians and locally weighted regression learning algorithms. Note that these methods assume that the learner’s bias is negligible. Cohn (1995) shows ways of using queries in order to minimize bias. Ideally bias and variance could be jointly minimized.

2.2

Sample selection

The primary focus of active learning is to reduce the number of training examples that need to be manually labeled. This is accomplished by selectively sampling a pool of unlabeled data. The goal of selective sampling is to label only the most informative examples, and to save time and effort by neglecting examples which provide redundant information. Cohn et al.’s (1996) formulation of how to reduce expected error rate provides a statistical grounding for selective sampling; the informativeness of unlabeled examples is measured by how much they reduce the bias or the variance of a learner. By-and-large variance reduction techniques are used for selective sampling in active learning. Two types of variance reducing techniques are:

Chapter 2. Active Learning

8

? Committee-based methods – wherein an ensemble of different learners all posit labels for an example, and the human annotator is queried on examples that they disagree on. The disagreement between the different learners is caused by variance in how they predict the label for an example. ? Certainty-based methods – which select examples for the human to label when the model has low con?dence in its own labeling of the examples. Certainty is generally quanti?ed by measuring the variance in the n-best labels predicted by a learner. There may be ways of reducing a learner’s variance that do not fall into these categories, and there are approaches that attempt to reduce a learner’s bias, but committeebased and certainty-based sampling are the two main selection methods practiced in active learning.

2.3

Committee-based selection methods

Committee-based selection techniques measure the informativeness of unlabeled examples by having a number of learners posit labels on them, and then measure the variance or disagreement about the label. When the learners agree on the label for an example they are more likely to be correct, and therefore having a human annotator label that example would provide redundant information. Having a human annotator label examples which the learners disagree on will resolve the disagreement and add useful information. The idea of using multiple learners in order to select examples for labeling is known as query by committee and is formulated in Seung et al. (1992). Seung et al. decide which example to query the annotator with based on the principle of maximal disagreement. The principle of maximal disagreement selects examples that the committee members disagree about the most. Seung et al. consider the toy binary classi?cation problem of a high-low game (where a classi?er tries to guess a number receives answers like “higher” or “lower” or “that’s it”). Their query by committee algorithm uses a set containing an even number of learners that all apply labels to an unlabeled example. By selecting inputs that is classi?ed as positive by half of the committee and negative by the other half, the error rate decreases exponentially with the number of queries. This is in marked contrast to randomly chosen inputs for which the error rate decreases at a comparatively slow rate. Subsequent work shows how the

Chapter 2. Active Learning

9

query by committee algorithm can be used in order to perform selective sampling with the express purpose of accelerating the rate of learning. Freund et al. (1997) proves that the exponential decrease in error rate is guaranteed for a general class of learning problems. Freund et al. further prove more generally that error rates are guaranteed to decrease rapidly with the number of queries performed if the examples selected by the query by committee algorithm have a high expected information gain. Entropy can be used to quantify the potential for information gain by measuring the disagreement about the labeling of an example (as described in Equation 2.4). Engelson and Dagan (1996) describes a general procedure that one can use to perform committee-based sampling: 1. For each example e in a batch B of n unlabeled examples: (a) Use k models trained from the previously labeled examples to classify e giving labels {l1 , l2 , ... lk }. (b) Measure the disagreement De for e over {l1 , ... lk }. 2. Select for annotation the m examples from B with the highest De . 3. Add the human labeled examples to the pool of training examples. 4. Retrain the k models on the augmented set of training examples. This procedure is repeated sequentially for successive batches of n examples, returning to the start of the corpus at the end. If n is equal to the size of the corpus, the procedure selects the m globally best examples in the corpus of unlabeled examples at each stage. On the other hand as n gets closer to one then the processing happens in a more sequential fashion. An important consideration that is neglected in the above procedure is how disagreement should be measured. Engelson and Dagan (1996) use entropy as a measure of disagreement for committees containing more than two learners. Vote entropy is the entropy of the distributions of the labels assigned to an example (‘voted for’) by the k committee members. Vote entropy (VE) can be calculated as VE = ? 1 V (l, e) V (l, e) ∑ k log k log k l (2.4)

where l is a label, e is an unlabeled example, and V (l, e) is the number of votes for a label on an example by the committee members. Dividing by log k normalizes the scale for the number of committee members. Vote entropy is maximized when all committee members disagree, and is zero when they all agree.

Chapter 2. Active Learning

10

2.3.1

Applied to NLP

Committee-based active learning has recently been applied to the NLP task of parse selection (Baldridge and Osborne 2003). Baldridge and Osborne use two methods to create the learners in their committee – they train a single type of learning algorithm on two different feature sets extracted from the training data, and they train two different types of learning algorithms on the data as a whole. Baldridge and Osborne achieve a 60% reduction in the amount of training material without any degradation in the overall performance of the system.

2.3.2

Related topics

Committee-based methods for sample selection are related to ensemble learning (Dietterich 2000) and co-training (Blum and Mitchell 1998; Abney 2002). Concerns that come into play for these other topics, such as the degree of independence between models and their accuracy rates, are also signi?cant to the effectiveness of active learning.

2.4

Certainty-based selection methods

Whereas committee-based methods use disagreement among a number of learners as a measurement of the potential informativeness of an example, certainty-based methods choose examples based on the “con?dence” of a single learner. If a learner can con?dently predict whether its predicted label is correct, then we can limit the amount of work that a human annotator had to do by ignoring those examples that the learner had a high con?dence about. By instead focusing on the the low con?dence examples, we would be able to add informative new training examples. The challenge in using a certainty-based selection method is to develop a measure of a learner’s con?dence in its predictions.

2.4.1

An example from NLP

Hwa (2000) uses certainty-based sample selection for learning grammars from the Penn Treebank data set. Hwa equates a learner’s uncertainty about its labeling of an example with the training utility value of that example. If the learner can identify a subset of examples with a high training utility value from the pool of unlabeled data,

Chapter 2. Active Learning

11

then the human annotator would not need to spend time labeling uninformative examples. A general procedure for performing con?dence-based active learning would be: 1. Train a model from the previously labeled examples. 2. For each example in the batch of N unlabeled examples (a) Have the model generate a label (or set of labels) for the example (b) Estimate the model’s con?dence in its labeling. 3. Select m examples with the lowest con?dence from the batch to be annotated. 4. Add the newly labeled examples to the training data. 5. Repeat until the unlabeled examples are exhausted or the annotator stops. For statistical grammars the labels that the model applies are parse trees. When a statistical grammar parses a sentences it generates a set of possible trees and associates a likelihood value with each tree. Typically the most likely tree is taken to be the best parse, but Hwa uses the distribution of probabilities of all possible trees in order to measure the model’s uncertainty about its labeling. A uniform distribution would signify that the model is assigning equal weight to each of the possible trees, and is therefore maximally uncertain about which tree is the best parse. A spiked distribution would indicate that the model is certain that it is the correct tree. Hwa measures the certainty of a distribution over trees using entropy. Tree entropy (T E) is calculated as T E(s, G) = ? ∑ P(t|G) P(t|G) log P(s|G) P(s|G)

t∈T

where s is a sentence, G is the model of the grammar, and T is the set of parse trees assigned to s by G. Sample selection using tree entropy reduces the number of training examples that are needed to achieve a certain level of parse accuracy by 36% when compared to randomly sampled training examples. Active learning has not been applied to statistical machine translation to date. In the next chapter I review statistical machine translation showing how it can be thought of as either a supervised or an unsupervised learning task. In Chapter 4 I discuss my proposals on how to apply active learning to the task.

Chapter 3 Statistical Translation
In the early 1990s the increasing availability of machine-readable bilingual corpora, such as the proceedings of the Canadian Parliament, which are published in both French and English, lead to the investigation of ways of extracting useful linguistic information from them. Building on research into automatically aligning sentences with their translations in a bilingual corpus (such as Gale and Church (1993)), IBM researchers developed a statistical technique for automatically learning translation models using parallel corpora as training data (Brown et al. 1993). In statistical translation every English string e is viewed as a possible translation of a French string f . A statistical translation model can assign a probability P(e| f ) that indicates the likelihood that e is a translation of f .

3.1

As a supervised learning problem

Statistical translation uses machine learning algorithms to construct statistical models of translation. Machine learning can be divided into two types: supervised learning and unsupervised learning. The distinction is that supervised learning problems require tasks labeled training data, and unsupervised learning problems do not. Unsupervised learning tasks in NLP can used raw text to achieve their goals. For example, word clustering groups words with similar meanings by looking at the contexts that they appear in. Supervised learning tasks in NLP require annotated corpora rather than raw text. Tasks such as part-of-speech tagging and document classi?cation require that words or texts be labeled with a category. Other supervised learning tasks require more complex labels. For example, statistical parsing requires sentences labeled with their parse trees. Hwa (2000) points out that for most NLP classi?cation problems
12

Chapter 3. Statistical Translation

13

Examples Eaux de la Communauté européenne Les dépenses par élève

Labels European Community waters Expenditure per pupil

Données provisoires Le recours est rejeté comme manifestement irrecevable La République fran?aise supportera ses propres dépens Production domestique exprimée en pourcentage de l'utilisation domestique Seulement l'industrie

Provisional ?gures The action is dismissed as manifestly inadmissible France was ordered to bear its own costs Domestic output as a % of domestic use Industry only

...

...

Figure 3.1: A parallel corpus can be thought of as labeled training data.

the number of possible labels is relatively small, whereas in parsing the number of potential parse trees is exponential with respect to the length of the sentence. Viewed in that light, the type of training data used in statistical machine translation can be thought of as labeled training data (see Figure 3.1). A sentence in the source language is labeled with a sentence in the target language. The labels in this case are even more complex than the trees used in statistical parsing, because the number of possible translations may be a function of the size of the vocabulary rather than of the length of a sentence. If we think of parallel corpora as the labeled training data for statistical machine translation, then the process of active learning could be selecting ‘unlabeled’ sentences in the source language for a translator to apply labels to. We could use statistical translation models to measure the variance in what the predicted translations were for a set of sentences in monolingual corpora, and try to apply various active learning techniques to select which translations would be most informative to our model.

3.2

As an unsupervised learning problem

Statistical translation might also be thought of as an unsupervised learning problem, because the likelihood of a translation is calculated in terms of hidden word-level alignments between two sentences. Brown et al. (1993) calculate the conditional probability

Chapter 3. Statistical Translation

14

that an English sentence is the translation of a French sentence by examining the possible alignments between the words in the two sentences, as shown in Figure 3.2. The probability of an alignment a given an English string e and a French string f is written as P(a|e, f ), and can be equated with conditional probability P( f |e): P(a|e, f ) = P(a, e, f ) P(e, f )

=

P(a, f |e) ? P(e) P( f |e) ? P(e) P(a, f |e) P( f |e)

=

P( f |e) = ∑ P(a, f |e)
a

(3.1)

Thus the probability that e is a translation of f is equivalent to the sum of the probabilities of all possible word-level alignments between f and e. Since word-level alignments are not given in the bilingual training data used in statistical translation, it can be thought of as an unsupervised task. The task is to learn the word-level alignments for a parallel corpus, using no explicit labels.

3.2.1

Parameters of word-level alignments

In order to learn the word-level alignments between sentences in a parallel corpus Brown et al. (1993) parameterize them into several components, and estimate the probabilities for each of the components. The probability of a word-level alignment by decomposing it into three smaller pieces: ? The translation probability t( f j |ei ) is the probability that a French word f j is the translation of an English word ei . ? The fertility probability n(φi |ei ) is the probability that a word ei will expand into φi words in the target language. ? The distortion probability d(pi |i, l, m) is the probability that a target position pi will be chosen for a word given the index of the English word that this was translated from i, and the lengths of the English source string l and French target string m.

Chapter 3. Statistical Translation

15

many years in a farming
m

and worked

Those people have

Ces gens ont grandi , vécu et oeuvré des dizaines d' annés dans le domaine agricule .

Figure 3.2: A word-level alignment between a French and an English sentence.

The probability of an alignment of one sentence with another is: P(a, f |e) = ∏ n(φi |ei ) ? ∏ t( f j |ei ) ? ∏ d( j|a j , l, m)
i=1 j=1 j=1 l m

district .

grown up

, lived

(3.2)

Having de?ned the probability of an alignment allows the probability of a translation to be calculated, as shown in Equation 3.1. Note that Equation 3.1 allows the probability of an alignment to be formulated in any way, and that Equation 3.2 is only one possible way of doing it.

3.2.2

Parameter estimation

The t, n, and d probabilities would be simple to calculate if there were a bilingual corpus which was labeled with word-level alignments like in Figure 3.2. The number of times the French word “mason” was aligned with “house” divided by the total number of times “house” occurred would give t(mason | house). Similar estimates could be made for the other parameters. However, since bilingual corpora aligned on the wordlevel do not exist, such supervised learning is not possible. The convention is instead to use expectation maximization (EM) to address the problem of recovering word-level

Chapter 3. Statistical Translation

16

alignments from sentence-aligned corpora. A variety of EM algorithms are commonly used for estimating hidden variables, such as word-level alignments. EM searches for the maximally likely parameter assignments by trying to minimize the perplexity of a model. Perplexity is a measure of the “goodness” of a model. The assumption is that a good translation model will assign a high P( f |e) to all sentence pairs in the some bilingual test data. We can measure the cumulative probability assigned by any given model by taking the product of the probability that it assigns to all of the sentence pairs in some testing data. Then comparing models is simply a matter of comparing the resulting product. A model which assigns a higher probability to all of the test data will be better than a model which assigns a lower probability. However despite the fact that EM is guaranteed to improve the likelihood of a model on each iteration, the algorithm is not guaranteed to ?nd a globally optimal solution. Since there are so many factors contributing towards P(a, f |e), and because those factors are equally weighted, it would be easy for EM to work towards a suboptimal local maximum. For example, EM could work towards optimizing an imagined correspondence of d positions in source and target sentences rather than actually optimizing the translation probabilities for words in the t table. The search path that optimized some aspect of d while neglecting t would reach a local optima, but would not reach a suitable parameter set for translation. In situations where there is a large search space, such as this one, the EM algorithm is greatly affected by initial starting parameters. To address this search problem Brown et al. ?rst train a simpler model to ?nd sensible estimates for the t table, and then use those values to prime the parameters for incrementally more complex model which estimate d and n. An alternative to this approach might be to use an EM variant which combines unlabeled and labeled data, and explicitly label a number of word-level alignments in the training set. It has been shown that labeled training data provides exponentially more information than unlabeled data in certain settings (Castelli and Cover 1995). This may explain the slow convergence rate of the translation quality graphed in Figure 1.1. If we are combining explicitly labeled data with unlabeled data for word-level alignments, then we could employ active learning to select which sentences would be most valuable to have a human annotator align. We would not need to align everything by hand. We could use EM to ?ll in the alignments for the remaining sentences in the parallel corpus. McCallum and Nigam (1998) shows this to be an effective technique for active learning for text classi?cation.

Chapter 3. Statistical Translation

17

3.2.3

Word alignment accuracy

A workshop at the NAACL conference this year had a shared task where participants tried to develop systems which would produce the best word-alignment accuracies. Mihalcea and Pedersen (2003) gives an overview of the task: ? Two language pairs were evaluated, French-English and Romanian-English. ? Parallel corpora were provided so that systems could be compared directly. The French-English corpus was a section of the Canadian Hansards consisting of approximately 20 million words in both languages. The Romanian-English corpus was drawn from various sources and consisted of only one million words. ? A separate evaluation was performed which allowed teams to use resources beyond those that were provided. ? A dozen systems were evaluated for each task on their precision and recall of human annotated alignments, and were given f-scores and an alignment error rate scores. The results obtained from the shared task were that alignment accuracy for the French-English data was signi?cantly higher than for the Romanian-English data. The highest f-score on the French was 80.5% whereas for the Romanian it was 71.1%. Mihalcea and Pedersen attribute the performance gap to the difference in the amount of data available for the language pairs. None of the teams attempted to enhance their systems with hand-aligned material.

3.3

Phrase-based translation models

Almost all statistical translation approaches rely on the Brown et al. (1993) wordlevel alignments as a starting point. More sophisticated models such as the alignmenttemplate method (Och et al. 1999), and phrase-based translation (Koehn et al. 2003) use the Brown et al. word-level alignments to bootstrap more accurate phrase-level alignments. For an example of how phrases are extracted from word-level alignments see Knight and Koehn (2003). Since phrases-based translation models rely on wordbased predictions, improving word-level alignments ought to in turn improve the more sophisticated translation models.

Chapter 4 Research Goals
The aim of my research is to develop and test methods for reducing the amount of training data that is required for statistical machine translation to achieve a certain level of translation quality. My work will have implications on the practicality of using statistical translation with language pairs for which extensive parallel corpora are not readily available. Previous work (Germann 2001) has examined the issue of manually creating parallel corpora for low density languages, and found the cost to be prohibitively high if a large corpus is required. My work will investigate ways of employing active learning to reduce the cost of manually creating training data for statistical translation. Rather than randomly select sentences for a translator to translate, I will test the hypothesis that selective sampling of a monolingual corpus can lead to a faster rate of learning, and therefore will allow statistical translation to be applied more cheaply. I will further investigate the feasibility of creating other types of training data for statistical translation, expanding beyond the realm of sentence-aligned parallel corpora into the area of word-aligned parallel corpora. The application of active learning to translation, and the treatment of word-alignments as a supervised learning task are both novel pursuits. In this chapter I detail my plan for investigating the topic of applying active learning to statistical translation. Section 4.1 describes the ways that I will approach using active learning to do selective sampling for translation. Section 4.1.4 describes how I will go about evaluating these methods. Section 4.2 will describe how I will approach selective sampling for word-level alignments.

18

Chapter 4. Research Goals

19

1

French
some french sentenc some french sentence

English
some english sentence some english sentence

German
some french sentence some french sentence

English
some english sentence some english sentence

Spanish
some french sentence some french sentence

English
some english sentence some english sentence

2

French

German

Spanish

English target

Maison bleu

blaues Haus

Casa azul

???

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

some french sentence

some english sentence

Blue maison

blaues House

Blue house

3

Maison bleu

blaues Haus

Casa azul

Blue house

4

French
some french sentenc some french sentence some french sentence some french sentence

English
some english sentence some english sentence

German English
some french sentenc some french sentence some french sentence some french sentence some french sentence some french sentence some french sentence
some english sentence some english sentence

Spanish
some french sentenc some french sentence some french sentence some french sentence some french sentence some french sentence some french sentence

English
some english sentence some english sentence

some english sentence

some english sentence

some english sentence

some english sentence

some english sentence

some english sentence

Blue maison

blaues Haus

some french sentence

some english sentence

some english sentence

some english sentence

Blue house +

some french sentence some french sentence

some english sentence

some english sentence

some english sentence

some english sentence

some english sentence

some english sentence

Maison bleu

Blue house

+

blaues Haus

Blue house

+ Casa azul

Blue house

Figure 4.1: Co-training using German, French, and Spanish sources to produce English machine translations

4.1

Selective sampling for translation

Selective sampling for translation chooses sentences from a monolingual corpus that should be translated in order to create a bilingual training corpus. Rather than randomly select which sentences a human translator ought to do, this method will choose the most informative sentences. I will investigate this using both committee-based selection and con?dence-based selection. Committee-based selection will approximate the informativeness of sentences by measuring the similarity between their proposed translations; dissimilar translations will be taken to be an indicator of informativeness. Con?dence-based selection will approximate the informativeness of sentences by creating a list of n-best translations using a single translation model; n-best lists with a high entropy will be taken to be informative.

4.1.1

Committee-based selection

In order to form a committee of translation models, I will update the approach that I used in co-training for statistical machine translation (Callison-Burch 2002; CallisonBurch and Osborne 2003). Co-training uses multiple translation models to translate a bi- or multi-lingual corpus. For example, translation models could be trained for German?English, French? English and Spanish?English from appropriate bilingual corpora, and then used to translate a German-French-Spanish parallel corpus into English. Since there are three candidate English translations for each sentence alignment, the three translation models form a committee. In co-training the best translation out of the three can be selected

Chapter 4. Research Goals

20

and used to retrain the models. The process is illustrated in Figure 4.1. In active learning we will instead measure how much the committee members disagree about what the English translation should be. I will devise an appropriate measure of disagreement between the learners (it will probably based on current measures of translation quality such as word error rate, position-independent word error rate, or Bleu score). In my master’s thesis I observed that misalignments between the sentences in the multi-lingual corpus that I assembled caused problems. I am now assembling a new multi-lingual corpus that contains fewer errors. The corpus will be assembled using a variant of the Gale and Church algorithm for bilingual sentence alignment that was proposed in Simard (1999). The Simard method was developed for multi-lingual alignment, so fewer errors ought to be introduced.

4.1.2

Certainty-based selection

I will perform con?dence-based selection for statistical translation similar to the way that Hwa (2000) does for grammar induction. Rather than measuring entropy over the distribution of possible parse trees for a sentence, I will instead measure entropy over the n-best translations of a sentence. The intuition is that uninformative sentences will have spiked distributions, that is, the likelihood of the ?rst translation will be much higher than the next n translations. Sentences which will be useful to the statistical translation model are the ones where the ?rst few translations are roughly equiprobable, and having a human translation will help disambiguate which translation is better.

4.1.3

Other selection methods

Selection methods need not be limited to using translation models themselves; the goal of selective sampling for translation is simply to predict which sentences might be usefully translated. It could be that a much simpler model will be able to make those predictions as well as the translation model can. For instance, a language model trained on the source sentences used for a statistical translation model might be able to predict which sentences it will fail on simply by detecting the sentences which are the most unlike the ones that it has encountered in the past. The advantage that this method would have if it worked is that it is much simpler to retrain language models than it is to retrain translation models. Therefore selection could be done incrementally with a human translator without long lags in between which example should be chosen next.

Chapter 4. Research Goals

21

4.1.4

Evaluating selection methods

I will conduct simulated active learning experiments for the translation selection algorithms. I will use a parallel corpus and treat the source sentences as unlabeled by hiding the target sentences. Revealing the existing translation will simulate a human translator translating the selected sentences. In order to gauge whether the active learning techniques are an improvement over other methods for collecting training data for machine translation it is important to formulate an evaluation metric. To determine whether the selective sampling for translation methods are effective I will compare the performance of translation models created from the active learning examples against ones selected in baseline experiments. This comparison will be graphed in a learning curve similar to Figure 1.1. If rate of learning is better on the translations selected through active learning, then we have achieved the same performance with fewer translations. The baseline selection methods that I will use are: random selection, length-based selection and vocabulary-based selection. Each of these represent a simple way of selecting data for machine translation. Random selection is what is currently done. Vocabulary-based selection will seek to choose examples which contain unknown vocabulary and therefore expand a systems word coverage. Length-based selection may serve to enhance a model’s word ordering predictions.

4.2

Selective sampling for word-level alignments

Selective sampling for word-level alignments chooses examples from a bilingual corpus that should be hand aligned. The idea behind using active learning for this task is to treat learning word-level alignments of MT as a partially supervised task. Parallel corpora aligned on the word-level do not currently exist because they would (seemingly) be extremely time consuming to create. I will try to reduce the amount of human effort that has to be spend hand aligning a parallel corpus. I have already built an annotation tool 1 for word-level alignments which allows a person to manually correct the alignments made by a translation model. In order to adapt this tool so that the user is only presented with those examples which would be the most informative to have annotated I will need to develop a selection method. The method that I will try is a con?dencebased measure which examines the variation in the n-best predictions of the translation
1I

have built this tool in conjunction with Joshua Schroeder of Linear B Ltd.

Chapter 4. Research Goals

22

model for the word-level alignments. The evaluation of selective sampling methods for word-level alignments will be more complicated than evaluation of selective sampling methods for translation. This is because only very small corpora of word-level alignments currently exist, so performing simulation experiments will be dif?cult.

4.3

Considerations for non-simulated settings

Active learning has generally been applied in simulated settings. Because at least one of my tasks will be a non-simulated experiment I will consider the differences between active learning conducted in simulated and real-world settings. A number of new factors would come into play: ? Time it takes to re-estimate parameters is signi?cant. If we need to spend days or hours re-estimating the model parameters which will be used to select the next batch of examples for a person to translate, we would need to incorporate that fact into our procedure. ? If examples selected by active learning are somehow inherently harder for a human to translate might diminish the returns of selecting only those examples. ? Selection methods for active learning often make predictions about what the labels for examples ought to be. These predictions could be integrated into the process by having the human annotator edit the model’s predicted translations. It would be interesting to see whether editing translations is faster than production new ones, and whether the examples selected through active learning were of lesser value than average. ? In non-simulated settings human factors, like dif?culty of task, usability of interface, are important.

Bibliography
A BNEY, S TEVE. 2002. Bootstrapping. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. BALDRIDGE , JASON, and M ILES O SBORNE. 2003. Active learning for HPSG parse selection. In Proceedings of the 7th Conference on Natural Language Learning (CoNLL). B LUM , AVRIM, and T OM M ITCHELL. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann. B ROWN , P ETER, S TEPHEN D ELLA P IETRA, V INCENT D ELLA P IETRA, and ROBERT M ERCER. 1993. The mathematics of machine translation: Parameter estimation. Computational Linguistics 19.263–311. C ALLISON -B URCH , C HRIS. 2002. Co-training for statistical machine translation. Master’s thesis, University of Edinburgh. ——, and M ILES O SBORNE. 2003. Bootstrapping parallel corpora. In HLT-NAACL 2003 Workshop: Building and Using Parallel Texts. C ASTELLI , V ITTORIO, and T HOMAS M. C OVER. 1995. On the exponential value of labeled samples. Pattern Recognition Letters 16.105–111. C OHN , DAVID. 1995. Minimizing statistical bias with queries. Technical report, Massachusetts Institute of Technology. AI Lab memo AUM-1552. ——, Z OUBIN G HAHRAMANI, and M ICHAEL I. J ORDAN. 1996. Active learning with statistical models. In Advances in Neural Information Processing Systems, ed. by G. Tesauro, D. Touretzky, and T. Leen, volume 7, 705–712. The MIT Press.

23

Bibliography

24

D IETTERICH , T HOMAS G. 2000. Ensemble methods in machine learning. Lecture Notes in Computer Science 1857. E NGELSON , S EAN, and I DO DAGAN. 1996. Minimizing manual annotation cost in supervised training from corpora. In Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics, 319–326, San Francisco. F REUND , YOAV, H. S EBASTIAN S EUNG, E LI S HAMIR, and NAFTALI T ISHBY. 1997. Selective sampling using the query by committee algorithm. Machine Learning 28.133–168. G ALE , W ILLIAM, and K ENNETH C HURCH. 1993. A program for aligning sentence in bilingual corpora. Compuatational Linguistics 19.75–90. G ERMANN , U LRICH. 2001. Building a statistical machine translation system from scratch: How much bang for the buck can we expect? In ACL 2001 Workshop on Data-Driven Machine Translation, Toulouse, France. H WA , R EBECCA. 2000. Sample selection for statistical grammar induction. In Proceedings of EMNLP/VLC-2000, 45–52. K NIGHT, K EVIN, and P HILIPP KOEHN. 2003. What’s new in statistical machine translation. Tutorial at HLT/NAACL 2003. KOEHN , P HILIPP, F RANZ J OSEF O CH, and DANIEL M ARCU. 2003. Statistical phrase-based translation. In Proceedings of the HLT/NAACL. M ARCUS , M ITCHELL P., B EATRICE S ANTORI, and M ARY A NN M ARCINKIEWICZ. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19. M C C ALLUM , A NDREW, and K AMAL N IGAM. 1998. Employing em and pool-based active learning for text classi?cation. In Proceedings of the International Conference on Machine Learning. M IHALCEA , R ADA, and T ED P EDERSEN. 2003. An evaluation exercise for word alignment. In HLT-NAACL 2003 Workshop: Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, ed. by Rada Mihalcea and Ted Pedersen, 1–10, Edmonton, Alberta, Canada. Association for Computational Linguistics.

Bibliography

25

OARD , D OUG, DAVID D OERMANN, B ONNIE D ORR, DAQING H E, P HILLIP R ESNIK, W ILLIAM B YRNE, S ANJEEVE K HUDANPUR, DAVID YAROWSKY, A NTON L EUSKI, P HILIPP KOEHN, and K EVIN K NIGHT. 2003. Desperately seeking Cebuano. In Proceedings of HLT-NAACL. O CH , F RANZ J OSEF, C HRISTOPH T ILLMANN, and H ERMANN N EY. 1999. Improved alignment models for statistical machine translation. In Joint Conf. of Empirical Methods in Natural Language Processing and Very Large Corpora. S EUNG , H. S., M ANFRED O PPER, and H AIM S OMPOLINSKY. 1992. Query by committee. In Computational Learning Theory, 287–294. S IMARD , M ICHEL. 1999. Text-translation alignment: Aligning three or more versions of a text. In Parallel text processing, ed. by Jean Veronis. Kluwer Academic.


...learning for statistical machine translation.pdf

Transductive learning for statistical machine translation_IT/计算机_专业资料。...Active Learning for St... 暂无评价 29页 免费 Statistical methods fo... ...

...Parsing with Statistical Machine Translation_免....pdf

Active Learning for Stat... 暂无评价 29页 免费 Graph-based Learning for....Learning for Semantic Parsing with Statistical Machine Translation We present a...

Statistical machine translation by parsing.pdf

Terminal productions with active output components ...Yang (2002) “Learning Chinese u Bracketing ...for statistical machine translation,” Proceedings ...

Statistical Machine Translation by Parsing.pdf

Terminal productions with active output components ...Yang (2002) “Learning Chinese u Bracketing ...for statistical machine translation,” Proceedings ...

...measures for Statistical Machine Translation_免....pdf

Transductive learning fo... 8页 免费 Bidirectional decoding f... 7页 免费...Con?dence Measures for Statistical Machine Translation Nicola Uef?ng Klaus Mach...

Machine Translation Index Index.pdf

context-dependent lexicon models for statistical machine translation. Machine Learning, 60:135158, 2005. IARFID-UPV January 25, 2008 SMT-1: 2 IARFID-UPV...

A statistical machine translation approach to Sinha....pdf

A statistical machine translation approach to Sinhala Tamil language translation...same Sinhala corpus and its translated Tamil version for the learning process...

...syntax-aware statistical machine translation.pdf

for Syntax-Aware Statistical Machine Translation I....Terminal productions have exactly one “active” ...2000. Learning dependency translation models as ...

...toolkit for statistical machine translation.pdf

toolkit for statistical machine translation_专业资料...Moses has an active research community and has ...To minimize the learning curve for many ...

Active learning for spoken language understanding.pdf

When statistical classi?ers are employed for this purpose, they are trained...Ladner, “Improving generalization with active learning,” Machine Learning, ...

统计机器翻译领域自适应综述.doc

Usually, many machine learning tasks assume that ...Recently, domain adaptation is an active topic in...for statistical machine translation and discuss some...

...Automatic Evaluation of Machine Translation.pdf

Using Word-Perplexity for Automatic Evaluation of Machine Translation. ...1998. Statistical Learning Theory, WileyInterscience, New York. ...

Active learning in the drug discovery process.pdf

In each iteration the learner selects a batch of un-labeled examples for being labeled as positive (active) or negative (inactive). In Machine Learning ...

Quality Estimation for Machine Translation Using th....pdf

2012. Findings of the 2012 workshop on statistical machine translation. In ...2005. BLANC: Learning Evaluation Metrics for MT, Proceedings of Human ...

Learning Language from Perceptual Context A Challen....pdf

J. 2006. Learning for semantic parsing with statistical machine translation. In Proceedings of the 2006 Human Language Technology Conference - North American ...

...for Statistical Machine Translation Based on Pre....pdf

for Statistical Machine Translation Based on ...active development [1], there is no approach ...machine learning techniques which require semantically...

...Graphical Abstractions for Statistical Machine Translation....pdf

Generalized Graphical Abstractions for Statistical Machine Translation_专业资料。...Co-training for Statis... 64页 免费 Active Learning for St... 暂无评价...

Discriminative reranking for machine translation.pdf

There are about 31 active features for each sample on average. We use ...Finally, our statistical machine learning approach is theoretically well founded...

Google's Neural Machine Translation System Bridging....pdf

2 Related Work Statistical Machine Translation (SMT) has been the dominant ...Other proposed approaches for learning phrase representations [7] or learning ...

A Survey on Statistical Approaches to Natural Langu....pdf

Active learning for stat... 8页 免费 Symbiosis of Evolutionar... 暂无评价...analysis, language modeling and machine translation by various statistical ...