AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

Wenhao Huang, Zhouhong Gu, Chenghao Peng, Jiaqing Liang, Zhixu Li, Yanghua Xiao, Liqian Wen, Zulong Chen


Abstract
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website, while language agents, empowered by large language models (LLMs), exhibit poor reusability in diverse web environments. In this work, we introduce the paradigm of generating web scrapers with LLMs and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently. AutoScraper leverages the hierarchical structure of HTML and similarity across different web pages for generating web scrapers. Besides, we propose a new executability metric for better measuring the performance of web scraper generation tasks. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Our work is now open-source.
Anthology ID:
2024.emnlp-main.141
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2371–2389
Language:
URL:
https://aclanthology.org/2024.emnlp-main.141/
DOI:
10.18653/v1/2024.emnlp-main.141
Bibkey:
Cite (ACL):
Wenhao Huang, Zhouhong Gu, Chenghao Peng, Jiaqing Liang, Zhixu Li, Yanghua Xiao, Liqian Wen, and Zulong Chen. 2024. AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2371–2389, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation (Huang et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.141.pdf
Software:
 2024.emnlp-main.141.software.zip
Data:
 2024.emnlp-main.141.data.zip