PapersFromAvenue

Automatic Rule Learning for Resource-Limited MT

http://www.cs.cmu.edu/~kathrin/Publications/amta02CarbonellEtAl.pdf

more from Kathrin

http://www.cs.cmu.edu/~kathrin/Publications/tmi02ProbstLevin.pdf

http://www.cs.cmu.edu/~kathrin/Publications/tmi04LavieEtAl.pdf

more from AVENUE ...

http://www.cs.cmu.edu/~aria/Papers/FontLlitjosRidmannFinal.pdf

http://www.cs.cmu.edu/~aria/Papers/FontLlitjosAPEW06Final.pdf

http://www.cs.cmu.edu/~aria/Papers/EAMT-2005-Font-Llitjos.pdf

Building Machine translation systems for indigenous languages

http://www.cs.cmu.edu/~aria/Papers/FontAranovich_CILLA2_mapuche_quechua(2).pdf

Kathrin's technique for learning translation rules from the elicitation data is called Seeded Version Space Learning.

Is that somewhat plausibly adaptable to XDG?

The tool (Probst et al 2001) gets alignments, it turns out. Oh snap. (is that in the elicitation data? ...)

"From this data, the learning module of our system automatically infers hierarchical syntactic transfer rules, which encode how constituent structures in the source language (SL) transfer to the target language (TL)."

They have a statistical decoder to actually produce the translations in the rule-based systems.

They've got quite a lot of parallel Mapudungun/Spanish text, it sounds like: 150 hours of transcribed speech?

Lexical frequency seems pretty important.

Haha, awesome! Spellchecker for !OpenOffice for Mapudungun!

The rule-based MT for Mapudungun only goes ''into'' Spanish? Interesting. The system works by doing a chart parse of the input sentence, then applying all the rules to produce a chart-like thing for the target language, then handing that off to the statistical decoder.

Quechua cooperation

What is this !TechBridgeWorld thing?

Functional Elicitation Corpus: tries to understand how Quechua does number, person, tense, gender. 1700 sentences.

222 different basic structures in the Structural Elicitation Corpus. "The final Structural Elicitation Corpus which was translated into Quechua had 146 Spanish sentences."

360 pages of Quechua text, scanned in.

Also, seem to have been going into Spanish for the Quechua/Spanish system.

Design and Implementation of Controllted Elicitation for Machine Translation of Low-density Languages

http://www.cs.cmu.edu/~kathrin/Publications/mts01ProbstEtAl.pdf

The idea of doing elicitation here is attributed to Sergei Nirenburg, interesting!

Early use of the word "omnivorous" to describe the system.

"In addition, we plan to develop statistical techniques for robust MT with sparse data using exponential models and joint source-channel modeling."

Instructable Rule-Based MT (iRBMT).

Elicitation: Bilingual user is asked to translate sentences ''and'' give word alignments. The learning process wants to match every pair in the set of translation examples with a transfer rule that accounts for it '''and is of an appropriate level of abstraction'''.

Automatically infer transfer rules with Version Spaces. "It assumes a hypothesis space with a partial order relation between the hypotheses."

"locally constrained seeded version space" ... "enabling active learning to explore the lattice around the seed."

"The input to SVS learning is a controlled corpus, which we call the elicitation corpus."

"Like Boas (Nirenburg, 1998), NICE emulates the work of field linguists. However, in contrast to the Boas project, we do not expect the user to learn linguistic terminology."

"... this follows the tradition of the Swadesh List (named after the linguist Morris Swadesh)."

Dependencies in orders of picking elicitation sentences: "if a language does not mark plural, then it will also not mark dual or paucal..." (wonder if that's For Sure true? Is that just for human languages that we've seen?)

Eliciting Minimal Pairs of sentences: "the rock fell" vs "the rocks fell". We know that they're different in English, so we take a look at which word is aligned with "rock" and which with "rocks" -- are they different? In what way?

Future work: tag each sentence with a feature vector showing what it exemplifies and have the SVS learning mechanism use this to find sentences that bear on a given hypothesis. (active learning with version spaces, really)

The Translation Correction Tool: English-Spanish user studies

http://www.cs.cmu.edu/~aria/Papers/LREC-2004-Llitjos.pdf

Building NLP Systems for Two Resource-Scarce Indigenous Languages: Mapudungun and Quechua

http://www.cs.cmu.edu/~aria/Papers/LREC-2006-Monson.pdf

One of the questions we're most interested in, right now, is how to sensibly combine information from different sources into one MT system. It might turn out that AVENUE has already addressed these issues; if I recall correctly, they said in at least one paper that they have an "omnivorous" system. This paper definitely uses that word, so let me take a look at what they might mean by that...

Interestingly, they've got automaticly-derived rule-based systems for Hindi and Hebrew to English...

"For Quechua, we have created parallel and aligned data as well as a basic bilingual lexicon with morphological information"

"All the unique words in the spoken corpus were extracted and then ordered by frequency."

EBMT system for Mapudungun: grabs fragments and recombines them like you might expect. "For unmatched portions of the input, EBMT falls back on a probabilistic lexicon trained from the corpus to produce word-for-word translations."

"The MT system consists of three main components: the Mapudungun morphological analyzer discussed in section 3.2.1, the transfer system, and the Spanish morphological analyzer. Each of these programs makes use of different data bases (lexicons or grammars). The transfer system makes use of a transfer grammar and a transfer lexicon, which contain syntactic and lexical rules in order to map Mapudungun expressions into Spanish expressions."

Synchronous chart parsing and generation: "A chart is populated with all constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar. Transfer and generation are performed in an integrated second stage. A dual TL chart is constructed by applying transfer and generation operations on each and every constituent entry in the SL parse chart."

The output of their system is a lattice, which is passed on to a decoder. They're not doing the generation all in one go. (see Peterson 2002)

So it sounds like their Mapudungun system is kind of like what we're building...

Quechua Elicitation Corpus

"The first version that was used in Peru had 1700 sentences."

"The second part, the Structural Elicitation Corpus, is a smaller corups designed to cover the major structures present in the Penn Treebank..."

"The final Structural Elicitation Corpus which was translated into Quechua has 146 Spanish sentences."

31986 different Quechua words used in the three books. They picked out the most common 10k. 1666 types from the Elicitation Corpus were grabbed as well.

HEY There are some really nice dictionaries stored in those Excel files: VocabularioCuzquen~o.xls VocabularioCuzquenyo.xls ... but what's the difference between them?

Only 25 different translation rules? Is that enough? Probably not? ...

A Walk on the Other Side: Adding Statistical Components to a Transfer-Based Translation System

http://www.cs.cmu.edu/~aria/Papers/FontLlitjosVogel-SSST07.pdf

Do we take the structure and put it into a stat-mt system, or do we take something that's not stat-mt and add stats? This one does the latter, starting with Transfer-Based MT. (how does this relate to Alon's recent work on STAT-XFER?)

"The Transfer-Based system is a hybrid translation system that has integrated n-gram language model, and while its core remains untouched, additional features are used to re-score the n-best list generated by the transfer engine. Statistical alignment techniques are used to lower the burden in building a lexicon for a new domain."

... so they have a rule-based MT system, and an Automatic Rule Refiner (Font Llitjós and Ridmann 2007)...

Xfer Engine is the synchronous chart-parser thing. "Often, no parse for the entire source sentence can be found. Partial parses are concatenated sequentially to generate complete translations." (That's really clever; kind of like the CCG generation stuff by Mike White...)

"The alternatives arise from '''lexical ambiguity''' and multiple synonymous choices for lexical items in the dictionary, but also from syntactic ambiguity and multiple competing hypotheses from the grammar"

ways that they rank translations:

fragmentation penalty
ngram language model on the target language, considered during generation (but they're doing the generation with a lattice, I guess?)
length model: what's the expected output length, given the input sentence length -- ratio trained from the training bitext.

"Semi-Automatic Generation of the Transfer Lexicon"

train IBM model 1 for English/Spanish and Spanish/English
phrase alignment techniques to find two-word phrases.

"Adding Statistical Components to a Re-Ranker" You could additionally add...

word-to-word probabilities (ie, a translation model.)
rule probabilities (but these have to be estimated from noisy data; unless you had an aligned treebank...)

Then, you've got to do parameter tweaking: MER, like from Och 2003.

The upshot of all of this: they have some training data, and they used it to train some sensible models, then use the models for reranking. But the core of the system seems to remain rule-based.

Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario

http://www.cs.cmu.edu/~alavie/papers/TALIP-SLE-03.pdf

CategoryPapers

(this space intentionally left blank)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly