Anders Johannsen

Also published as: Anders Johanssen


2024

pdf bib
Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage
Yichen Jiang | Marco Vecchio | Mohit Bansal | Anders Johannsen
Findings of the Association for Computational Linguistics: EACL 2024

Long prompts present a significant challenge for practical LLM-based systems that need to operate with low latency and limited resources. We investigate prompt compression for zero-shot dialogue systems that learn to use unseen APIs directly in-context from their documentation, which may take up hundreds of prompt tokens per API. We start from a recently introduced approach (Mu et al., 2023) that learns to compress the prompt into a few “gist token” activations during finetuning. However, this simple idea is ineffective in compressing API documentation, resulting in low accuracy compared to the baseline using an uncompressed prompt. In this work, we introduce two major improvements. First, we specialize gist tokens for different hierarchies within an API: we use one Gistarg token for compressing an argument and one Gistvalue token for compressing an acceptable value of a categorical argument. We then dynamically reveal Gistvalue tokens only when they are needed. Second, we add a reconstruction loss to predict the API documentation from the gist tokens. On multiple API-calling tasks, our proposed system keeps the simplicity, efficiency, and large compression factor (20x on SGD) of the gist token approach while achieving significantly better accuracy.

pdf bib
Effective and Efficient Conversation Retrieval for Dialogue State Tracking with Implicit Text Summaries
Seanie Lee | Jianpeng Cheng | Joris Driesen | Alexandru Coca | Anders Johannsen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Few-shot dialogue state tracking (DST) with Large Language Models (LLM) relies on an effective and efficient conversation retriever to find similar in-context examples for prompt learning. Previous works use raw dialogue context as search keys and queries, and a retriever is fine-tuned with annotated dialogues to achieve superior performance. However, the approach is less suited for scaling to new domains or new annotation languages, where fine-tuning data is unavailable. To address this problem, we handle the task of conversation retrieval based on text summaries of the conversations.A LLM-based conversation summarizer is adopted for query and key generation, which enables effective maximum inner product search. To avoid the extra inference cost brought by LLM-based conversation summarization, we further distill a light-weight conversation encoder which produces query embeddings without decoding summaries for test conversations. We validate our retrieval approach on MultiWOZ datasets with GPT-Neo-2.7B and LLaMA-7B/30B. The experimental results show a significant improvement over relevant baselines in real few-shot DST settings.

pdf bib
LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues
Joe Stacey | Jianpeng Cheng | John Torr | Tristan Guigue | Joris Driesen | Alexandru Coca | Mark Gaynor | Anders Johannsen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Spurred by recent advances in Large Language Models (LLMs), virtual assistants are poised to take a leap forward in terms of their dialogue capabilities. Yet a major bottleneck to achieving genuinely transformative task-oriented dialogue capabilities remains the scarcity of high quality data. Existing datasets, while impressive in scale, have limited domain coverage and contain few genuinely challenging conversational phenomena; those which are present are typically unlabelled, making it difficult to assess the strengths and weaknesses of models without time-consuming and costly human evaluation. Moreover, creating high quality dialogue data has until now required considerable human input, limiting both the scale of these datasets and the ability to rapidly bootstrap data for a new target domain. We aim to overcome these issues with LUCID, a modularised and highly automated LLM-driven data generation system that produces realistic, diverse and challenging dialogues. We use LUCID to generate a seed dataset of 4,277 conversations across 100 intents to demonstrate its capabilities, with a human review finding consistently high quality labels in the generated data.

2020

pdf bib
Conversational Semantic Parsing for Dialog State Tracking
Jianpeng Cheng | Devang Agrawal | Héctor Martínez Alonso | Shruti Bhargava | Joris Driesen | Federico Flego | Dain Kaplan | Dimitri Kartsaklis | Lin Li | Dhivya Piraviperumal | Jason D. Williams | Hong Yu | Diarmuid Ó Séaghdha | Anders Johannsen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We consider a new perspective on dialog state tracking (DST), the task of estimating a user’s goal through the course of a dialog. By formulating DST as a semantic parsing task over hierarchical representations, we can incorporate semantic compositionality, cross-domain knowledge sharing and co-reference. We present TreeDST, a dataset of 27k conversations annotated with tree-structured dialog states and system acts. We describe an encoder-decoder framework for DST with hierarchical representations, which leads to ~20% improvement over state-of-the-art DST approaches that operate on a flat meaning space of slot-value pairs.

2016

pdf bib
An empirically grounded expansion of the supersense inventory
Hector Martinez Alonso | Anders Johannsen | Sanni Nimb | Sussi Olsen | Bolette Pedersen
Proceedings of the 8th Global WordNet Conference (GWC)

In this article we present an expansion of the supersense inventory. All new super-senses are extensions of members of the current inventory, which we postulate by identifying semantically coherent groups of synsets. We cover the expansion of the already-established supernsense inventory for nouns and verbs, the addition of coarse supersenses for adjectives in absence of a canonical supersense inventory, and super-senses for verbal satellites. We evaluate the viability of the new senses examining the annotation agreement, frequency and co-ocurrence patterns.

pdf bib
Multilingual Projection for Parsing Truly Low-Resource Languages
Željko Agić | Anders Johannsen | Barbara Plank | Héctor Martínez Alonso | Natalie Schluter | Anders Søgaard
Transactions of the Association for Computational Linguistics, Volume 4

We propose a novel approach to cross-lingual part-of-speech tagging and dependency parsing for truly low-resource languages. Our annotation projection-based approach yields tagging and parsing models for over 100 languages. All that is needed are freely available parallel texts, and taggers and parsers for resource-rich languages. The empirical evaluation across 30 test languages shows that our method consistently provides top-level accuracies, close to established upper bounds, and outperforms several competitive baselines.

pdf bib
SemEval-2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM)
Nathan Schneider | Dirk Hovy | Anders Johannsen | Marine Carpuat
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
Supersense tagging with inter-annotator disagreement
Héctor Martínez Alonso | Anders Johannsen | Barbara Plank
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf bib
The SemDaX Corpus ― Sense Annotations with Scalable Sense Inventories
Bolette Pedersen | Anna Braasch | Anders Johannsen | Héctor Martínez Alonso | Sanni Nimb | Sussi Olsen | Anders Søgaard | Nicolai Hartvig Sørensen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We launch the SemDaX corpus which is a recently completed Danish human-annotated corpus available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish. To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the adjucated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.

pdf bib
Exploring Language Variation Across Europe - A Web-based Tool for Computational Sociolinguistics
Dirk Hovy | Anders Johannsen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Language varies not only between countries, but also along regional and socio-demographic lines. This variation is one of the driving factors behind language change. However, investigating language variation is a complex undertaking: the more factors we want to consider, the more data we need. Traditional qualitative methods are not well-suited to do this, an therefore restricted to isolated factors. This reduction limits the potential insights, and risks attributing undue importance to easily observed factors. While there is a large interest in linguistics to increase the quantitative aspect of such studies, it requires training in both variational linguistics and computational methods, a combination that is still not common. We take a first step here to alleviating the problem by providing an interface, www.languagevariation.com, to explore large-scale language variation along multiple socio-demographic factors – without programming knowledge. It makes use of large amounts of data and provides statistical analyses, maps, and interactive features that will enable scholars to explore language variation in a data-driven way.

pdf bib
Joint part-of-speech and dependency projection from multiple sources
Anders Johannsen | Željko Agić | Anders Søgaard
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

pdf bib
Supersense tagging for Danish
Héctor Martínez Alonso | Anders Johannsen | Sussi Olsen | Sanni Nimb | Nicolai Hartvig Sørensen | Anna Braasch | Anders Søgaard | Bolette Sandford Pedersen
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Active learning for sense annotation
Héctor Martínez Alonso | Barbara Plank | Anders Johannsen | Anders Søgaard
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Coarse-grained sense annotation of Danish across textual domains
Sussi Olsen | Bolette S. Pedersen | Héctor Martínez Alonso | Anders Johannsen
Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015

pdf bib
Predicting word sense annotation agreement
Héctor Martínez Alonso | Anders Johannsen | Oier Lopez de Lacalle | Eneko Agirre
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics

pdf bib
Cross-lingual syntactic variation over age and gender
Anders Johannsen | Dirk Hovy | Anders Søgaard
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf bib
Any-language frame-semantic parsing
Anders Johannsen | Héctor Martínez Alonso | Anders Søgaard
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Inverted indexing for cross-lingual NLP
Anders Søgaard | Željko Agić | Héctor Martínez Alonso | Barbara Plank | Bernd Bohnet | Anders Johannsen
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
What’s in a p-value in NLP?
Anders Søgaard | Anders Johannsen | Barbara Plank | Dirk Hovy | Hector Martínez Alonso
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

pdf bib
More or less supervised supersense tagging of Twitter
Anders Johannsen | Dirk Hovy | Héctor Martínez Alonso | Barbara Plank | Anders Søgaard
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf bib
Copenhagen-Malmö: Tree Approximations of Semantic Parsing Problems
Natalie Schluter | Anders Søgaard | Jakob Elming | Dirk Hovy | Barbara Plank | Héctor Martínez Alonso | Anders Johanssen | Sigrid Klerke
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Importance weighting and unsupervised domain adaptation of POS taggers: a negative result
Barbara Plank | Anders Johannsen | Anders Søgaard
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Cross-Domain Answer Ranking using Importance Sampling
Anders Johannsen | Anders Søgaard
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Disambiguating Explicit Discourse Connectives without Oracles
Anders Johannsen | Anders Søgaard
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Using Crowdsourcing to get Representations based on Regular Expressions
Anders Søgaard | Hector Martinez | Jakob Elming | Anders Johannsen
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Down-stream effects of tree-to-dependency conversions
Jakob Elming | Anders Johannsen | Sigrid Klerke | Emanuele Lapponi | Hector Martinez Alonso | Anders Søgaard
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf bib
EMNLP@CPH: Is frequency all there is to simplicity?
Anders Johannsen | Héctor Martínez | Sigrid Klerke | Anders Søgaard
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Creation and use of Language Resources in a Question-Answering eHealth System
Ulrich Andersen | Anna Braasch | Lina Henriksen | Csaba Huszka | Anders Johannsen | Lars Kayser | Bente Maegaard | Ole Norgaard | Stefan Schulz | Jürgen Wedekind
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

ESICT (Experience-oriented Sharing of health knowledge via Information and Communication Technology) is an ongoing research project funded by the Danish Council for Strategic Research. It aims at developing a health/disease related information system based on information technology, language technology, and formalized medical knowledge. The formalized medical knowledge consists partly of the terminology database SNOMED CT and partly of authorized medical texts on the domain. The system will allow users to ask questions in Danish and will provide natural language answers. Currently, the project is pursuing three basically different methods for question answering, and they are all described to some extent in this paper. A system prototype will handle questions related to diabetes and heart diseases. This paper concentrates on the methods employed for question answering and the language resources that are utilized. Some resources were existing, such as SNOMED CT, others, such as a corpus of sample questions, have had to be created or constructed.

pdf bib
Robust Learning in Random Subspaces: Equipping NLP for OOV Effects
Anders Søgaard | Anders Johannsen
Proceedings of COLING 2012: Posters

2011

pdf bib
Shared Task System Description: Frustratingly Hard Compositionality Prediction
Anders Johannsen | Hector Martinez | Christian Rishøj | Anders Søgaard
Proceedings of the Workshop on Distributional Semantics and Compositionality

pdf bib
“Andre ord” – a wordnet browser for the Danish wordnet, DanNet
Anders Johannsen | Bolette Sandford Pedersen
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)