Stephen Wan


2024

pdf bib
What Causes the Failure of Explicit to Implicit Discourse Relation Recognition?
Wei Liu | Stephen Wan | Michael Strube
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We consider an unanswered question in the discourse processing community: why do relation classifiers trained on explicit examples (with connectives removed) perform poorly in real implicit scenarios? Prior work claimed this is due to linguistic dissimilarity between explicit and implicit examples but provided no empirical evidence. In this study, we show that one cause for such failure is a label shift after connectives are eliminated. Specifically, we find that the discourse relations expressed by some explicit instances will change when connectives disappear. Unlike previous work manually analyzing a few examples, we present empirical evidence at the corpus level to prove the existence of such shift. Then, we analyze why label shift occurs by considering factors such as the syntactic role played by connectives, ambiguity of connectives, and more. Finally, we investigate two strategies to mitigate the label shift: filtering out noisy data and joint learning with connectives. Experiments on PDTB 2.0, PDTB 3.0, and the GUM dataset demonstrate that classifiers trained with our strategies outperform strong baselines.

pdf bib
CSIRO at Context24: Contextualising Scientific Figures and Tables in Scientific Literature
Necva Bölücü | Vincent Nguyen | Roelien Timmer | Huichen Yang | Maciej Rybinski | Stephen Wan | Sarvnaz Karimi
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)

Finding evidence for claims from content presented in experimental results of scientific articles is difficult. The evidence is often presented in the form of tables and figures, and correctly matching it to scientific claims presents automation challenges. The Context24 shared task is launched to support the development of systems able to verify claims by extracting supporting evidence from articles. We explore different facets of this shared task modelled as a search problem and as an information extraction task. We experiment with a range of methods in each of these categories for the two sub-tasks of evidence identification and grounding context identification in the Context24 shared task.

2023

pdf bib
impact of sample selection on in-context learning for entity extraction from scientific writing
Necva Bölücü | Maciej Rybinski | Stephen Wan
Findings of the Association for Computational Linguistics: EMNLP 2023

Prompt-based usage of Large Language Models (LLMs) is an increasingly popular way to tackle many well-known natural language problems. This trend is due, in part, to the appeal of the In-Context Learning (ICL) prompt set-up, in which a few selected training examples are provided along with the inference request. ICL, a type of few-shot learning, is especially attractive for natural language processing (NLP) tasks defined for specialised domains, such as entity extraction from scientific documents, where the annotation is very costly due to expertise requirements for the annotators. In this paper, we present a comprehensive analysis of in-context sample selection methods for entity extraction from scientific documents using GPT-3.5 and compare these results against a fully supervised transformer-based baseline. Our results indicate that the effectiveness of the in-context sample selection methods is heavily domain-dependent, but the improvements are more notable for problems with a larger number of entity types. More in-depth analysis shows that ICL is more effective for low-resource set-ups of scientific information extraction

pdf bib
Investigating the Impact of Syntax-Enriched Transformers on Quantity Extraction in Scientific Texts
Necva Bölücü | Maciej Rybinski | Stephen Wan
Proceedings of the Second Workshop on Information Extraction from Scientific Publications

pdf bib
Rethinking the Role of Entity Type in Relation Classification
Xiang Dai | Sarvnaz Karimi | Stephen Wan
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

2022

pdf bib
Investigating Metric Diversity for Evaluating Long Document Summarisation
Cai Yang | Stephen Wan
Proceedings of the Third Workshop on Scholarly Document Processing

Long document summarisation, a challenging summarisation scenario, is the focus of the recently proposed LongSumm shared task. One of the limitations of this shared task has been its use of a single family of metrics for evaluation (the ROUGE metrics). In contrast, other fields, like text generation, employ multiple metrics. We replicated the LongSumm evaluation using multiple test set samples (vs. the single test set of the official shared task) and investigated how different metrics might complement each other in this evaluation framework. We show that under this more rigorous evaluation, (1) some of the key learnings from Longsumm 2020 and 2021 still hold, but the relative ranking of systems changes, and (2) the use of additional metrics reveals additional high-quality summaries missed by ROUGE, and (3) we show that SPICE is a candidate metric for summarisation evaluation for LongSumm.

2021

pdf bib
Mention Flags (MF): Constraining Transformer-based Text Generators
Yufei Wang | Ian Wood | Stephen Wan | Mark Dras | Mark Johnson
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper focuses on Seq2Seq (S2S) constrained text generation where the text generator is constrained to mention specific words which are inputs to the encoder in the generated outputs. Pre-trained S2S models or a Copy Mechanism are trained to copy the surface tokens from encoders to decoders, but they cannot guarantee constraint satisfaction. Constrained decoding algorithms always produce hypotheses satisfying all constraints. However, they are computationally expensive and can lower the generated text quality. In this paper, we propose Mention Flags (MF), which traces whether lexical constraints are satisfied in the generated outputs in an S2S decoder. The MF models can be trained to generate tokens in a hypothesis until all constraints are satisfied, guaranteeing high constraint satisfaction. Our experiments on the Common Sense Generation task (CommonGen) (Lin et al., 2020), End2end Restaurant Dialog task (E2ENLG) (Duˇsek et al., 2020) and Novel Object Captioning task (nocaps) (Agrawal et al., 2019) show that the MF models maintain higher constraint satisfaction and text quality than the baseline models and other constrained decoding algorithms, achieving state-of-the-art performance on all three tasks. These results are achieved with a much lower run-time than constrained decoding algorithms. We also show that the MF models work well in the low-resource setting.

pdf bib
Integrating Lexical Information into Entity Neighbourhood Representations for Relation Prediction
Ian Wood | Mark Johnson | Stephen Wan
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Relation prediction informed from a combination of text corpora and curated knowledge bases, combining knowledge graph completion with relation extraction, is a relatively little studied task. A system that can perform this task has the ability to extend an arbitrary set of relational database tables with information extracted from a document corpus. OpenKi[1] addresses this task through extraction of named entities and predicates via OpenIE tools then learning relation embeddings from the resulting entity-relation graph for relation prediction, outperforming previous approaches. We present an extension of OpenKi that incorporates embeddings of text-based representations of the entities and the relations. We demonstrate that this results in a substantial performance increase over a system without this information.

pdf bib
ECOL-R: Encouraging Copying in Novel Object Captioning with Reinforcement Learning
Yufei Wang | Ian Wood | Stephen Wan | Mark Johnson
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Novel Object Captioning is a zero-shot Image Captioning task requiring describing objects not seen in the training captions, but for which information is available from external object detectors. The key challenge is to select and describe all salient detected novel objects in the input images. In this paper, we focus on this challenge and propose the ECOL-R model (Encouraging Copying of Object Labels with Reinforced Learning), a copy-augmented transformer model that is encouraged to accurately describe the novel object labels. This is achieved via a specialised reward function in the SCST reinforcement learning framework (Rennie et al., 2017) that encourages novel object mentions while maintaining the caption quality. We further restrict the SCST training to the images where detected objects are mentioned in reference captions to train the ECOL-R model. We additionally improve our copy mechanism via Abstract Labels, which transfer knowledge from known to novel object types, and a Morphological Selector, which determines the appropriate inflected forms of novel object labels. The resulting model sets new state-of-the-art on the nocaps (Agrawal et al., 2019) and held-out COCO (Hendricks et al., 2016) benchmarks.

pdf bib
Demonstrating the Reliability of Self-Annotated Emotion Data
Anton Malko | Cecile Paris | Andreas Duenser | Maria Kangas | Diego Molla | Ross Sparks | Stephen Wan
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

Vent is a specialised iOS/Android social media platform with the stated goal to encourage people to post about their feelings and explicitly label them. In this paper, we study a snapshot of more than 100 million messages obtained from the developers of Vent, together with the labels assigned by the authors of the messages. We establish the quality of the self-annotated data by conducting a qualitative analysis, a vocabulary based analysis, and by training and testing an emotion classifier. We conclude that the self-annotated labels of our corpus are indeed indicative of the emotional contents expressed in the text and thus can support more detailed analyses of emotion expression on social media, such as emotion trajectories and factors influencing them.

pdf bib
Measuring Similarity of Opinion-bearing Sentences
Wenyi Tay | Xiuzhen Zhang | Stephen Wan | Sarvnaz Karimi
Proceedings of the Third Workshop on New Frontiers in Summarization

For many NLP applications of online reviews, comparison of two opinion-bearing sentences is key. We argue that, while general purpose text similarity metrics have been applied for this purpose, there has been limited exploration of their applicability to opinion texts. We address this gap in the literature, studying: (1) how humans judge the similarity of pairs of opinion-bearing sentences; and, (2) the degree to which existing text similarity metrics, particularly embedding-based ones, correspond to human judgments. We crowdsourced annotations for opinion sentence pairs and our main findings are: (1) annotators tend to agree on whether or not opinion sentences are similar or different; and (2) embedding-based metrics capture human judgments of “opinion similarity” but not “opinion difference”. Based on our analysis, we identify areas where the current metrics should be improved. We further propose to learn a similarity metric for opinion similarity via fine-tuning the Sentence-BERT sentence-embedding network based on review text and weak supervision by review ratings. Experiments show that our learned metric outperforms existing text similarity metrics and especially show significantly higher correlations with human annotations for differing opinions.

2019

pdf bib
How to Best Use Syntax in Semantic Role Labelling
Yufei Wang | Mark Johnson | Stephen Wan | Yifang Sun | Wei Wang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

There are many different ways in which external information might be used in a NLP task. This paper investigates how external syntactic information can be used most effectively in the Semantic Role Labeling (SRL) task. We evaluate three different ways of encoding syntactic parses and three different ways of injecting them into a state-of-the-art neural ELMo-based SRL sequence labelling model. We show that using a constituency representation as input features improves performance the most, achieving a new state-of-the-art for non-ensemble SRL models on the in-domain CoNLL’05 and CoNLL’12 benchmarks.

pdf bib
Red-faced ROUGE: Examining the Suitability of ROUGE for Opinion Summary Evaluation
Wenyi Tay | Aditya Joshi | Xiuzhen Zhang | Sarvnaz Karimi | Stephen Wan
Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association

One of the most common metrics to automatically evaluate opinion summaries is ROUGE, a metric developed for text summarisation. ROUGE counts the overlap of word or word units between a candidate summary against reference summaries. This formulation treats all words in the reference summary equally. In opinion summaries, however, not all words in the reference are equally important. Opinion summarisation requires to correctly pair two types of semantic information: (1) aspect or opinion target; and (2) polarity of candidate and reference summaries. We investigate the suitability of ROUGE for evaluating opin-ion summaries of online reviews. Using three simulation-based experiments, we evaluate the behaviour of ROUGE for opinion summarisation on the ability to match aspect and polarity. We show that ROUGE cannot distinguish opinion summaries of similar or opposite polarities for the same aspect. Moreover,ROUGE scores have significant variance under different configuration settings. As a result, we present three recommendations for future work that uses ROUGE to evaluate opinion summarisation.

2017

pdf bib
Demographic Inference on Twitter using Recursive Neural Networks
Sunghwan Mac Kim | Qiongkai Xu | Lizhen Qu | Stephen Wan | Cécile Paris
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In social media, demographic inference is a critical task in order to gain a better understanding of a cohort and to facilitate interacting with one’s audience. Most previous work has made independence assumptions over topological, textual and label information on social networks. In this work, we employ recursive neural networks to break down these independence assumptions to obtain inference about demographic characteristics on Twitter. We show that our model performs better than existing models including the state-of-the-art.

2016

pdf bib
The Role of Features and Context on Suicide Ideation Detection
Yufei Wang | Stephen Wan | Cécile Paris
Proceedings of the Australasian Language Technology Association Workshop 2016

pdf bib
Data61-CSIRO systems at the CLPsych 2016 Shared Task
Sunghwan Mac Kim | Yufei Wang | Stephen Wan | Cécile Paris
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf bib
CSIRO Data61 at the WNUT Geo Shared Task
Gaya Jayasinghe | Brian Jin | James Mchugh | Bella Robinson | Stephen Wan
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

In this paper, we describe CSIRO Data61’s participation in the Geolocation shared task at the Workshop for Noisy User-generated Text. Our approach was to use ensemble methods to capitalise on four component methods: heuristics based on metadata, a label propagation method, timezone text classifiers, and an information retrieval approach. The ensembles we explored focused on examining the role of language technologies in geolocation prediction and also in examining the use of hard voting and cascading ensemble methods. Based on the accuracy of city-level predictions, our systems were the best performing submissions at this year’s shared task. Furthermore, when estimating the latitude and longitude of a user, our median error distance was accurate to within 30 kilometers.

pdf bib
The Effects of Data Collection Methods in Twitter
Sunghwan Mac Kim | Stephen Wan | Cécile Paris | Brian Jin | Bella Robinson
Proceedings of the First Workshop on NLP and Computational Social Science

pdf bib
Detecting Social Roles in Twitter
Sunghwan Mac Kim | Stephen Wan | Cécile Paris
Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media

2015

pdf bib
Ranking election issues through the lens of social media
Stephen Wan | Cécile Paris
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

2014

pdf bib
Proceedings of the Australasian Language Technology Association Workshop 2014
Gabriela Ferraro | Stephen Wan
Proceedings of the Australasian Language Technology Association Workshop 2014

2013

pdf bib
A Study: From Electronic Laboratory Notebooks to Generated Queries for Literature Recommendation
Oldooz Dianat | Cécile Paris | Stephen Wan
Proceedings of the Australasian Language Technology Association Workshop 2013 (ALTA 2013)

2011

pdf bib
Proceedings of the Workshop on Monolingual Text-To-Text Generation
Katja Filippova | Stephen Wan
Proceedings of the Workshop on Monolingual Text-To-Text Generation

2009

pdf bib
Improving Grammaticality in Statistical Sentence Generation: Introducing a Dependency Spanning Tree Algorithm with an Argument Satisfaction Model
Stephen Wan | Mark Dras | Robert Dale | Cécile Paris
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Designing a Citation-Sensitive Research Tool: An Initial Study of Browsing-Specific Information Needs
Stephen Wan | Cécile Paris | Michael Muthukrishna | Robert Dale
Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL)

2008

pdf bib
In-Browser Summarisation: Generating Elaborative Summaries Biased Towards the Reading Context
Stephen Wan | Cécile Paris
Proceedings of ACL-08: HLT, Short Papers

pdf bib
Seed and Grow: Augmenting Statistically Generated Summary Sentences using Schematic Word Patterns
Stephen Wan | Robert Dale | Mark Dras | Cécile Paris
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
GLEU: Automatic Evaluation of Sentence-Level Fluency
Andrew Mutton | Mark Dras | Stephen Wan | Robert Dale
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib
Using Dependency-Based Features to Take the ’Para-farce’ out of Paraphrase
Stephen Wan | Mark Dras | Robert Dale | Cécile Paris
Proceedings of the Australasian Language Technology Workshop 2006

pdf bib
Proceedings of the Fourth International Natural Language Generation Conference
Nathalie Colineau | Cécile Paris | Stephen Wan | Robert Dale
Proceedings of the Fourth International Natural Language Generation Conference

2005

pdf bib
Towards Statistical Paraphrase Generation: Preliminary Evaluations of Grammaticality
Stephen Wan | Mark Dras | Robert Dale | Cécile Paris
Proceedings of the Third International Workshop on Paraphrasing (IWP2005)

pdf bib
Searching for Grammaticality: Propagating Dependencies in the Viterbi Algorithm
Stephen Wan | Robert Dale | Mark Dras
Proceedings of the Tenth European Workshop on Natural Language Generation (ENLG-05)

pdf bib
Proceedings of the ACL Student Research Workshop
Chris Callison-Burch | Stephen Wan
Proceedings of the ACL Student Research Workshop

2004

pdf bib
Generating Overview Summaries of Ongoing Email Thread Discussions
Stephen Wan | Kathy McKeown
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Proceedings of the Australasian Language Technology Workshop 2004
Ash Asudeh | Cecile Paris | Stephen Wan
Proceedings of the Australasian Language Technology Workshop 2004

2003

pdf bib
Straight to the point: Discovering themes for summary generation
Stephen Wan | Mark Dras | Cecile Paris | Robert Dale
Proceedings of the Australasian Language Technology Workshop 2003

pdf bib
Using Thematic Information in Statistical Headline Generation
Stephen Wan | Mark Dras | Cécile Paris | Robert Dale
Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering

1998

pdf bib
Automatic English-Chinese name transliteration for development of multilingual resources
Stephen Wan | Cornelia Maria Verspoor
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

pdf bib
Automatic English-Chinese name transliteration for development of multilingual resources
Stephen Wan | Cornelia Maria Verspoor
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics