A statistical view on bilingual lexicon extraction

Pascale Fung


Abstract
We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method-Convec. Convec is based on context information of a word to be translated.
Anthology ID:
1998.amta-papers.1
Volume:
Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers
Month:
October 28-31
Year:
1998
Address:
Langhorne, PA, USA
Editors:
David Farwell, Laurie Gerber, Eduard Hovy
Venue:
AMTA
SIG:
Publisher:
Springer
Note:
Pages:
1–17
Language:
URL:
https://link.springer.com/chapter/10.1007/3-540-49478-2_1
DOI:
Bibkey:
Cite (ACL):
Pascale Fung. 1998. A statistical view on bilingual lexicon extraction. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 1–17, Langhorne, PA, USA. Springer.
Cite (Informal):
A statistical view on bilingual lexicon extraction (Fung, AMTA 1998)
Copy Citation:
PDF:
https://link.springer.com/chapter/10.1007/3-540-49478-2_1