SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation

Andres Garcia-Silva, Cristian Berrio, Jose Manuel Gomez-Perez


Abstract
Detecting salient parts in text using natural language processing has been widely used to mitigate the effects of information overflow. Nevertheless, most of the datasets available for this task are derived mainly from academic publications. We introduce SPACE-IDEAS, a dataset for salient information detection from innovation ideas related to the Space domain. The text in SPACE-IDEAS varies greatly and includes informal, technical, academic and business-oriented writing styles. In addition to a manually annotated dataset we release an extended version that is annotated using a large generative language model. We train different sentence and sequential sentence classifiers, and show that the automatically annotated dataset can be leveraged using multitask learning to train better classifiers.
Anthology ID:
2024.lrec-main.1311
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
15087–15092
Language:
URL:
https://aclanthology.org/2024.lrec-main.1311
DOI:
Bibkey:
Cite (ACL):
Andres Garcia-Silva, Cristian Berrio, and Jose Manuel Gomez-Perez. 2024. SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 15087–15092, Torino, Italia. ELRA and ICCL.
Cite (Informal):
SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation (Garcia-Silva et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1311.pdf