Ilham Firdausi Putra


2023

pdf bib
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Samuel Cahyawijaya | Holy Lovenia | Alham Fikri Aji | Genta Winata | Bryan Wilie | Fajri Koto | Rahmad Mahendra | Christian Wibisono | Ade Romadhony | Karissa Vincentio | Jennifer Santoso | David Moeljadi | Cahya Wirawan | Frederikus Hudi | Muhammad Satrio Wicaksono | Ivan Parmonangan | Ika Alfina | Ilham Firdausi Putra | Samsul Rahmadani | Yulianti Oenang | Ali Septiandri | James Jaya | Kaustubh Dhole | Arie Suryani | Rifki Afina Putri | Dan Su | Keith Stevens | Made Nindyatama Nityasya | Muhammad Adilazuarda | Ryan Hadiwijaya | Ryandito Diandaru | Tiezheng Yu | Vito Ghifari | Wenliang Dai | Yan Xu | Dyah Damapuspita | Haryo Wibowo | Cuk Tho | Ichwanul Karo Karo | Tirana Fatyanosa | Ziwei Ji | Graham Neubig | Timothy Baldwin | Sebastian Ruder | Pascale Fung | Herry Sujaini | Sakriani Sakti | Ayu Purwarianti
Findings of the Association for Computational Linguistics: ACL 2023

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments.NusaCrowd’s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.

2020

pdf bib
Knowing Right from Wrong: Should We Use More Complex Models for Automatic Short-Answer Scoring in Bahasa Indonesia?
Ali Akbar Septiandri | Yosef Ardhito Winatmoko | Ilham Firdausi Putra
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

We compare three solutions to UKARA 1.0 challenge on automated short-answer scoring: single classical, ensemble classical, and deep learning. The task is to classify given answers to two questions, whether they are right or wrong. While recent development shows increasing model complexity to push the benchmark performances, they tend to be resource-demanding with mundane improvement. For the UKARA task, we found that bag-of-words and classical machine learning approaches can compete with ensemble models and Bi-LSTM model with pre-trained word2vec embedding from 200 million words. In this case, the single classical machine learning achieved less than 2% difference in F1 compared to the deep learning approach with 1/18 time for model training.