PapersFromAvenue
http://www.cs.cmu.edu/~kathrin/Publications/amta02CarbonellEtAl.pdf
http://www.cs.cmu.edu/~kathrin/Publications/tmi02ProbstLevin.pdf
http://www.cs.cmu.edu/~kathrin/Publications/tmi04LavieEtAl.pdf
http://www.cs.cmu.edu/~aria/Papers/FontLlitjosRidmannFinal.pdf
http://www.cs.cmu.edu/~aria/Papers/FontLlitjosAPEW06Final.pdf
http://www.cs.cmu.edu/~aria/Papers/EAMT-2005-Font-Llitjos.pdf
http://www.cs.cmu.edu/~aria/Papers/FontAranovich_CILLA2_mapuche_quechua(2).pdf
Kathrin's technique for learning translation rules from the elicitation data is called Seeded Version Space Learning.
Is that somewhat plausibly adaptable to XDG?
The tool (Probst et al 2001) gets alignments, it turns out. Oh snap. (is that in the elicitation data? ...)
"From this data, the learning module of our system automatically infers hierarchical syntactic transfer rules, which encode how constituent structures in the source language (SL) transfer to the target language (TL)."
They have a statistical decoder to actually produce the translations in the rule-based systems.
They've got quite a lot of parallel Mapudungun/Spanish text, it sounds like: 150 hours of transcribed speech?
Lexical frequency seems pretty important.
Haha, awesome! Spellchecker for !OpenOffice for Mapudungun!
The rule-based MT for Mapudungun only goes ''into'' Spanish? Interesting. The system works by doing a chart parse of the input sentence, then applying all the rules to produce a chart-like thing for the target language, then handing that off to the statistical decoder.
What is this !TechBridgeWorld thing?
Functional Elicitation Corpus: tries to understand how Quechua does number, person, tense, gender. 1700 sentences.
222 different basic structures in the Structural Elicitation Corpus. "The final Structural Elicitation Corpus which was translated into Quechua had 146 Spanish sentences."
360 pages of Quechua text, scanned in.
Also, seem to have been going into Spanish for the Quechua/Spanish system.
Design and Implementation of Controllted Elicitation for Machine Translation of Low-density Languages
http://www.cs.cmu.edu/~kathrin/Publications/mts01ProbstEtAl.pdf
The idea of doing elicitation here is attributed to Sergei Nirenburg, interesting!
Early use of the word "omnivorous" to describe the system.
"In addition, we plan to develop statistical techniques for robust MT with sparse data using exponential models and joint source-channel modeling."
Instructable Rule-Based MT (iRBMT).
Elicitation: Bilingual user is asked to translate sentences ''and'' give word alignments. The learning process wants to match every pair in the set of translation examples with a transfer rule that accounts for it '''and is of an appropriate level of abstraction'''.
Automatically infer transfer rules with Version Spaces. "It assumes a hypothesis space with a partial order relation between the hypotheses."
"locally constrained seeded version space" ... "enabling active learning to explore the lattice around the seed."
"The input to SVS learning is a controlled corpus, which we call the elicitation corpus."
"Like Boas (Nirenburg, 1998), NICE emulates the work of field linguists. However, in contrast to the Boas project, we do not expect the user to learn linguistic terminology."
"... this follows the tradition of the Swadesh List (named after the linguist Morris Swadesh)."
Dependencies in orders of picking elicitation sentences: "if a language does not mark plural, then it will also not mark dual or paucal..." (wonder if that's For Sure true? Is that just for human languages that we've seen?)
Eliciting Minimal Pairs of sentences: "the rock fell" vs "the rocks fell". We know that they're different in English, so we take a look at which word is aligned with "rock" and which with "rocks" -- are they different? In what way?
Future work: tag each sentence with a feature vector showing what it exemplifies and have the SVS learning mechanism use this to find sentences that bear on a given hypothesis. (active learning with version spaces, really)
http://www.cs.cmu.edu/~aria/Papers/LREC-2004-Llitjos.pdf
http://www.cs.cmu.edu/~aria/Papers/LREC-2006-Monson.pdf
One of the questions we're most interested in, right now, is how to sensibly combine information from different sources into one MT system. It might turn out that AVENUE has already addressed these issues; if I recall correctly, they said in at least one paper that they have an "omnivorous" system. This paper definitely uses that word, so let me take a look at what they might mean by that...
Interestingly, they've got automaticly-derived rule-based systems for Hindi and Hebrew to English...
"For Quechua, we have created parallel and aligned data as well as a basic bilingual lexicon with morphological information"
"All the unique words in the spoken corpus were extracted and then ordered by frequency."
EBMT system for Mapudungun: grabs fragments and recombines them like you might expect. "For unmatched portions of the input, EBMT falls back on a probabilistic lexicon trained from the corpus to produce word-for-word translations."
"The MT system consists of three main components: the Mapudungun morphological analyzer discussed in section 3.2.1, the transfer system, and the Spanish morphological analyzer. Each of these programs makes use of different data bases (lexicons or grammars). The transfer system makes use of a transfer grammar and a transfer lexicon, which contain syntactic and lexical rules in order to map Mapudungun expressions into Spanish expressions."
Synchronous chart parsing and generation: "A chart is populated with all constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar. Transfer and generation are performed in an integrated second stage. A dual TL chart is constructed by applying transfer and generation operations on each and every constituent entry in the SL parse chart."
The output of their system is a lattice, which is passed on to a decoder. They're not doing the generation all in one go. (see Peterson 2002)
So it sounds like their Mapudungun system is kind of like what we're building...
"The first version that was used in Peru had 1700 sentences."
"The second part, the Structural Elicitation Corpus, is a smaller corups designed to cover the major structures present in the Penn Treebank..."
"The final Structural Elicitation Corpus which was translated into Quechua has 146 Spanish sentences."
31986 different Quechua words used in the three books. They picked out the most common 10k. 1666 types from the Elicitation Corpus were grabbed as well.
HEY There are some really nice dictionaries stored in those Excel files: VocabularioCuzquen~o.xls VocabularioCuzquenyo.xls ... but what's the difference between them?
Only 25 different translation rules? Is that enough? Probably not? ...
http://www.cs.cmu.edu/~aria/Papers/FontLlitjosVogel-SSST07.pdf
Do we take the structure and put it into a stat-mt system, or do we take something that's not stat-mt and add stats? This one does the latter, starting with Transfer-Based MT. (how does this relate to Alon's recent work on STAT-XFER?)
"The Transfer-Based system is a hybrid translation system that has integrated n-gram language model, and while its core remains untouched, additional features are used to re-score the n-best list generated by the transfer engine. Statistical alignment techniques are used to lower the burden in building a lexicon for a new domain."
... so they have a rule-based MT system, and an Automatic Rule Refiner (Font Llitjós and Ridmann 2007)...
Xfer Engine is the synchronous chart-parser thing. "Often, no parse for the entire source sentence can be found. Partial parses are concatenated sequentially to generate complete translations." (That's really clever; kind of like the CCG generation stuff by Mike White...)
"The alternatives arise from '''lexical ambiguity''' and multiple synonymous choices for lexical items in the dictionary, but also from syntactic ambiguity and multiple competing hypotheses from the grammar"
ways that they rank translations:
- fragmentation penalty
- ngram language model on the target language, considered during generation (but they're doing the generation with a lattice, I guess?)
- length model: what's the expected output length, given the input sentence length -- ratio trained from the training bitext.
"Semi-Automatic Generation of the Transfer Lexicon"
- train IBM model 1 for English/Spanish and Spanish/English
- phrase alignment techniques to find two-word phrases.
"Adding Statistical Components to a Re-Ranker" You could additionally add...
- word-to-word probabilities (ie, a translation model.)
- rule probabilities (but these have to be estimated from noisy data; unless you had an aligned treebank...)
Then, you've got to do parameter tweaking: MER, like from Och 2003.
The upshot of all of this: they have some training data, and they used it to train some sensible models, then use the models for reranking. But the core of the system seems to remain rule-based.
http://www.cs.cmu.edu/~alavie/papers/TALIP-SLE-03.pdf
CategoryPapers