2022
pdf
bib
abs
You’ve translated it, now what?
Michael Maxwell
|
Shabnam Tafreshi
|
Aquia Richburg
|
Balaji Kodali
|
Kymani Brown
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)
Humans use document formatting to discover document and section titles, and important phrases. But when machines process a paper–especially documents OCRed from images–these cues are often invisible to downstream processes: words in footnotes or body text are treated as just as important as words in titles. It would be better for indexing and summarization tools to be guided by implicit document structure. In an ODNI-sponsored project, ARLIS looked at discovering formatting in OCRed text as a way to infer document structure. Most OCR engines output results as hOCR (an XML format), giving bounding boxes around characters. In theory, this also provides style information such as bolding and italicization, but in practice, this capability is limited. For example, the Tesseract OCR tool provides bounding boxes, but does not attempt to detect bold text (relevant to author emphasis and specialized fields in e.g. print dictionaries), and its discrimination of italicization is poor. Our project inferred font size from hOCR bounding boxes, and using that and other cues (e.g. the fact that titles tend to be short) determined which text constituted section titles; from this, a document outline can be created. We also experimented with algorithms for detecting bold text. Our best algorithm has a much improved recall and precision, although the exact numbers are font-dependent. The next step is to incorporate inferred structure into the output of machine translation. One way is to embed XML tags for inferred structure into the text extracted from the imaged document, and to either pass the strings enclosed by XML tags to the MT engine individually, or pass the tags through the MT engine without modification. This structural information can guide downstream bulk processing tasks such as summarization and search, and also enables building tables of contents for human users examining individual documents.
2017
pdf
bib
STREAMLInED Challenges: Aligning Research Interests with Shared Tasks
Gina-Anne Levow
|
Emily M. Bender
|
Patrick Littell
|
Kristen Howell
|
Shobhana Chelliah
|
Joshua Crowgey
|
Dan Garrette
|
Jeff Good
|
Sharon Hargus
|
David Inman
|
Michael Maxwell
|
Michael Tjalve
|
Fei Xia
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages
pdf
bib
Endangered Data for Endangered Languages: Digitizing Print dictionaries
Michael Maxwell
|
Aric Bills
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages
2016
bib
Did You Mean...? and Dictionary Repair: from Science to Engineering
Michael Maxwell
|
Petra Bradley
Conferences of the Association for Machine Translation in the Americas: MT Users' Track
2015
pdf
bib
Accounting for Allomorphy in Finite-state Transducers
Michael Maxwell
Proceedings of the 12th International Conference on Finite-State Methods and Natural Language Processing 2015 (FSMNLP 2015 Düsseldorf)
2008
pdf
bib
abs
Lexicon Schemas and Related Data Models: when Standards Meet Users
Thorsten Trippel
|
Michael Maxwell
|
Greville Corbett
|
Cambell Prince
|
Christopher Manning
|
Stephen Grimes
|
Steve Moran
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Lexicon schemas and their use are discussed in this paper from the perspective of lexicographers and field linguists. A variety of lexicon schemas have been developed, with goals ranging from computational lexicography (DATR) through archiving (LIFT, TEI) to standardization (LMF, FSR). A number of requirements for lexicon schemas are given. The lexicon schemas are introduced and compared to each other in terms of conversion and usability for this particular user group, using a common lexicon entry and providing examples for each schema under consideration. The formats are assessed and the final recommendation is given for the potential users, namely to request standard compliance from the developers of the tools used. This paper should foster a discussion between authors of standards, lexicographers and field linguists.
pdf
bib
Joint Grammar Development by Linguists and Computer Scientists
Michael Maxwell
|
Anne David
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages
2004
pdf
bib
Morphological Interfaces to Dictionaries
Michael Maxwell
|
William Poser
Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries
2000
pdf
bib
Book Reviews: A Grammar Writer’s Cookbook
Michael Maxwell
Computational Linguistics, Volume 26, Number 2, June 2000
1994
pdf
bib
Parsing Using Linearly Ordered Phonological Rules
Michael Maxwell
Computational Phonology
1991
pdf
bib
abs
Phonological Analysis and Opaque Rule Orders
Michael Maxwell
Proceedings of the Second International Workshop on Parsing Technologies
General morphological/phonological analysis using ordered phonological rules has appeared to be computationally expensive, because ambiguities in feature values arising when phonological rules are “un-applied” multiply with additional rules. But in fact those ambiguities can be largely ignored until lexical lookup, since the underlying values of altered features are needed only in the case of rare opaque rule orderings, and not always then.