2022
pdf
bib
abs
WeCanTalk: A New Multi-language, Multi-modal Resource for Speaker Recognition
Karen Jones
|
Kevin Walker
|
Christopher Caruso
|
Jonathan Wright
|
Stephanie Strassel
Proceedings of the Thirteenth Language Resources and Evaluation Conference
The WeCanTalk (WCT) Corpus is a new multi-language, multi-modal resource for speaker recognition. The corpus contains Cantonese, Mandarin and English telephony and video speech data from over 200 multilingual speakers located in Hong Kong. Each speaker contributed at least 10 telephone conversations of 8-10 minutes’ duration collected via a custom telephone platform based in Hong Kong. Speakers also uploaded at least 3 videos in which they were both speaking and visible, along with one selfie image. At least half of the calls and videos for each speaker were in Cantonese, while their remaining recordings featured one or more different languages. Both calls and videos were made in a variety of noise conditions. All speech and video recordings were audited by experienced multilingual annotators for quality including presence of the expected language and for speaker identity. The WeCanTalk Corpus has been used to support the NIST 2021 Speaker Recognition Evaluation and will be published in the LDC catalog.
2020
pdf
bib
abs
The SAFE-T Corpus: A New Resource for Simulated Public Safety Communications
Dana Delgado
|
Kevin Walker
|
Stephanie Strassel
|
Karen Jones
|
Christopher Caruso
|
David Graff
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce a new resource, the SAFE-T (Speech Analysis for Emergency Response Technology) Corpus, designed to simulate first-responder communications by inducing high vocal effort and urgent speech with situational background noise in a game-based collection protocol. Linguistic Data Consortium developed the SAFE-T Corpus to support the NIST (National Institute of Standards and Technology) OpenSAT (Speech Analytic Technologies) evaluation series, whose goal is to advance speech analytic technologies including automatic speech recognition, speech activity detection and keyword search in multiple domains including simulated public safety communications data. The corpus comprises over 300 hours of audio from 115 unique speakers engaged in a collaborative problem-solving activity representative of public safety communications in terms of speech content, noise types and noise levels. Portions of the corpus have been used in the OpenSAT 2019 evaluation and the full corpus will be published in the LDC catalog. We describe the design and implementation of the SAFE-T Corpus collection, discuss the approach of capturing spontaneous speech from study participants through game-based speech collection, and report on the collection results including several challenges associated with the collection.
pdf
bib
abs
Call My Net 2: A New Resource for Speaker Recognition
Karen Jones
|
Stephanie Strassel
|
Kevin Walker
|
Jonathan Wright
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce the Call My Net 2 (CMN2) Corpus, a new resource for speaker recognition featuring Tunisian Arabic conversations between friends and family, incorporating both traditional telephony and VoIP data. The corpus contains data from over 400 Tunisian Arabic speakers collected via a custom-built platform deployed in Tunis, with each speaker making 10 or more calls each lasting up to 10 minutes. Calls include speech in various realistic and natural acoustic settings, both noisy and non-noisy. Speakers used a variety of handsets, including landline and mobile devices, and made VoIP calls from tablets or computers. All calls were subject to a series of manual and automatic quality checks, including speech duration, audio quality, language identity and speaker identity. The CMN2 corpus has been used in two NIST Speaker Recognition Evaluations (SRE18 and SRE19), and the SRE test sets as well as the full CMN2 corpus will be published in the Linguistic Data Consortium Catalog. We describe CMN2 corpus requirements, the telephone collection platform, and procedures for call collection. We review properties of the CMN2 dataset and discuss features of the corpus that distinguish it from prior SRE collection efforts, including some of the technical challenges encountered with collecting VoIP data.
2016
pdf
bib
abs
Multi-language Speech Collection for NIST LRE
Karen Jones
|
Stephanie Strassel
|
Kevin Walker
|
David Graff
|
Jonathan Wright
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The Multi-language Speech (MLS) Corpus supports NIST’s Language Recognition Evaluation series by providing new conversational telephone speech and broadcast narrowband data in 20 languages/dialects. The corpus was built with the intention of testing system performance in the matter of distinguishing closely related or confusable linguistic varieties, and careful manual auditing of collected data was an important aspect of this work. This paper lists the specific data requirements for the collection and provides both a commentary on the rationale for those requirements as well as an outline of the various steps taken to ensure all goals were met as specified. LDC conducted a large-scale recruitment effort involving the implementation of candidate assessment and interview techniques suitable for hiring a large contingent of telecommuting workers, and this recruitment effort is discussed in detail. We also describe the telephone and broadcast collection infrastructure and protocols, and provide details of the steps taken to pre-process collected data prior to auditing. Finally, annotation training, procedures and outcomes are presented in detail.
2014
pdf
bib
abs
The RATS Collection: Supporting HLT Research with Degraded Audio Data
David Graff
|
Kevin Walker
|
Stephanie Strassel
|
Xiaoyi Ma
|
Karen Jones
|
Ann Sawyer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The DARPA RATS program was established to foster development of language technology systems that can perform well on speaker-to-speaker communications over radio channels that evince a wide range in the type and extent of signal variability and acoustic degradation. Creating suitable corpora to address this need poses an equally wide range of challenges for the collection, annotation and quality assessment of relevant data. This paper describes the LDCs multi-year effort to build the RATS data collection, summarizes the content and properties of the resulting corpora, and discusses the novel problems and approaches involved in ensuring that the data would satisfy its intended use, to provide speech recordings and annotations for training and evaluating HLT systems that perform 4 specific tasks on difficult radio channels: Speech Activity Detection (SAD), Language Identification (LID), Speaker Identification (SID) and Keyword Spotting (KWS).
2007
pdf
bib
Last Words: Computational Linguistics: What About the Linguistics?
Karen Spärck Jones
Computational Linguistics, Volume 33, Number 3, September 2007
2005
pdf
bib
ACL Lifetime Achievement Award: Some Points in a Time
Karen Spärck Jones
Computational Linguistics, Volume 31, Number 1, March 2005
1997
pdf
bib
Summarising: Where are we now? Where should we go?
Karen Sparck Jones
Intelligent Scalable Text Summarization
1994
pdf
bib
Towards Better NLP System Evaluation
Karen Sparck Jones
Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994
1993
pdf
bib
Document retrieval and text retrieval
Karen Sparck Jones
Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993
pdf
bib
Summarising as a Lever for Studying Large-Scale Discourse Structure
Karen Sparck Jones
Intentionality and Structure in Discourse Relations
1988
pdf
bib
User Models, Discourse Models, and Some Others
Karen Sparck Jones
Computational Linguistics, Volume 14, Number 3, September 1988
pdf
bib
Book Reviews: Semantic Interpretation and the Resolution of Ambiguity
Karen Sparck Jones
Computational Linguistics, Volume 14, Number 4, December 1988, LFP: A Logic for Linguistic Descriptions and an Analysis of its Complexity
1987
pdf
bib
A Note on a Study of Cases
Karen Sparck Jones
|
Branimir Boguraev
Computational Linguistics, Formerly the American Journal of Computational Linguistics, Volume 13, Numbers 1-2, January-June 1987
pdf
bib
They say it’s a new sort of engine: but the SUMP’s still there
Karen Sparck Jones
Theoretical Issues in Natural Language Processing 3
1984
pdf
bib
Panel: Natural Language and Databases, Again
Karen Sparck Jones
10th International Conference on Computational Linguistics and 22nd Annual Meeting of the Association for Computational Linguistics
pdf
bib
Linguistically Motivated Descriptive Term Selection
K. Sparck Jones
|
J.I. Tait
10th International Conference on Computational Linguistics and 22nd Annual Meeting of the Association for Computational Linguistics
1983
pdf
bib
How to Drive a Database Front End Using General Semantic Information
B.K. Boguraev
|
K. Sparck Jones
First Conference on Applied Natural Language Processing
pdf
bib
Letters to the Editor: Re Ballard on the Need for Careful Description
Karen Sparck Jones
American Journal of Computational Linguistics, Volume 9, Number 2, April-June 1983
1961
pdf
bib
Mechanised semantic classification
Karen Sparck-Jones
Proceedings of the International Conference on Machine Translation and Applied Language Analysis