2019
pdf
bib
abs
Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing
Tim vor der Brück
|
Marc Pouly
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
The prevalent way to estimate the similarity of two documents based on word embeddings is to apply the cosine similarity measure to the two centroids obtained from the embedding vectors associated with the words in each document. Motivated by an industrial application from the domain of youth marketing, where this approach produced only mediocre results, we propose an alternative way of combining the word vectors using matrix norms. The evaluation shows superior results for most of the investigated matrix norms in comparison to both the classical cosine measure and several other document similarity estimates.
2016
pdf
bib
abs
TLT-CRF: A Lexicon-supported Morphological Tagger for Latin Based on Conditional Random Fields
Tim vor der Brück
|
Alexander Mehler
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present a morphological tagger for Latin, called TTLab Latin Tagger based on Conditional Random Fields (TLT-CRF) which uses a large Latin lexicon. Beyond Part of Speech (PoS), TLT-CRF tags eight inflectional categories of verbs, adjectives or nouns. It utilizes a statistical model based on CRFs together with a rule interpreter that addresses scenarios of sparse training data. We present results of evaluating TLT-CRF to answer the question what can be learnt following the paradigm of 1st order CRFs in conjunction with a large lexical resource and a rule interpreter. Furthermore, we investigate the contigency of representational features and targeted parts of speech to learn about selective features.
2015
pdf
bib
Lexicon-assisted tagging and lemmatization in Latin: A comparison of six taggers and two lemmatization methods
Steffen Eger
|
Tim vor der Brück
|
Alexander Mehler
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
2014
pdf
bib
abs
ColLex.en: Automatically Generating and Evaluating a Full-form Lexicon for English
Tim vor der Brück
|
Alexander Mehler
|
Zahurul Islam
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The paper describes a procedure for the automatic generation of a large full-form lexicon of English. We put emphasis on two statistical methods to lexicon extension and adjustment: in terms of a letter-based HMM and in terms of a detector of spelling variants and misspellings. The resulting resource, \collexen, is evaluated with respect to two tasks: text categorization and lexical coverage by example of the SUSANNE corpus and the \openanc.
2010
pdf
bib
Learning Semantic Network Patterns for Hypernymy Extraction
Tim vor der Brück
Proceedings of the 6th Workshop on Ontologies and Lexical Resources